+

WO2018132385A1 - Zoom audio dans un service de contenu vidéo audio naturel - Google Patents

Zoom audio dans un service de contenu vidéo audio naturel Download PDF

Info

Publication number
WO2018132385A1
WO2018132385A1 PCT/US2018/012992 US2018012992W WO2018132385A1 WO 2018132385 A1 WO2018132385 A1 WO 2018132385A1 US 2018012992 W US2018012992 W US 2018012992W WO 2018132385 A1 WO2018132385 A1 WO 2018132385A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
stream
region
content
zoomed
Prior art date
Application number
PCT/US2018/012992
Other languages
English (en)
Inventor
Pasi Sakari OJALA
Original Assignee
Pcms Holdings, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pcms Holdings, Inc. filed Critical Pcms Holdings, Inc.
Publication of WO2018132385A1 publication Critical patent/WO2018132385A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/4728End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for selecting a Region Of Interest [ROI], e.g. for requesting a higher resolution version of a selected region
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/23439Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements for generating different versions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/262Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists
    • H04N21/26258Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists for generating a list of items to be played back in a given order, e.g. playlist, or scheduling item distribution according to such list
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2227/00Details of public address [PA] systems covered by H04R27/00 but not provided for in any of its subgroups
    • H04R2227/003Digital PA systems using, e.g. LAN or internet
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R27/00Public address systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/03Application of parametric coding in stereophonic audio systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space

Definitions

  • VOD live streaming and video on demand
  • Camera captured streaming and VOD services offer several functionalities on managing the video stream.
  • the content owner may, for example, produce new content options by zooming in on a given portion of or location within the video stream, or by following a particular visual object in detail. Basically, a content editor may select a region of the video stream and zoom in. The resulting new video stream may then be added as a new adaptation block within a media presentation description (MPD) content manifestation file of the streaming or download service.
  • MPD media presentation description
  • Live media streams accompanied by additional sensor data provide rich contextual information regarding the recording environment. It may be beneficial to have third party interfaces for augmenting new content and focusing the existing stream and helping the content creator to find focus points and points of interest in the presentation.
  • Described herein are systems and methods generally related to streaming of media content. More specifically, embodiments of the herein disclosed systems and methods relate to zooming an immersive audio presentation in connection with video content.
  • Systems and methods described herein provide mechanisms for "zooming" an immersive audio image of a video presentation.
  • Exemplary embodiments enable a user to concentrate on a selected area and/or objects in the video presentation.
  • the visual content has augmented material that is meant to capture a user's attention in a certain direction within the presentation
  • users may be provided with the ability to control the experience from an audio perspective.
  • a method includes accessing, at a server, a primary audio and video stream; preparing, at the server, a custom audio stream that enhances audio content associated with an identified spatial region of the primary video stream by classifying portions of audio content of the primary audio stream as either inside or outside the identified spatial region via spatial rendering of the primary audio stream, including determining a plurality coherence parameters identifying a diffuseness associated with the one or more sound sources in the primary audio stream, and determining a plurality of directional parameters associated with the primary audio stream, and filtering the primary audio stream using the plurality of coherence parameters and the plurality of directional parameters.
  • a method includes accessing, at a server, a first media stream comprising a first audio stream and a first video stream, determining a first spatial region of the first video stream, and determining a first zoomed audio region of the first audio stream associated with the first spatial region of the first video stream.
  • the method includes generating a focused audio stream based at least in part on processing of the first zoomed audio region of the first audio stream via spatial rendering of the first audio stream in time and frequency, including determining a plurality of coherence values as a measure of diffuseness associated with one or more sound sources in the first audio stream, and determining a plurality of directional values associated with the first audio stream, and generating a custom media stream based on the focused audio stream and a first custom video stream based at least in part on the first video stream.
  • the method further includes streaming the custom media stream from the server to a first receiving client.
  • the first media stream is received at the server from one or more of a first live content capture client and a video on demand server, the first media stream including client video data corresponding to a user-selected region of interest and audio data corresponding to the selected region of interest.
  • a system comprising a processor and a non-transitory storage medium storing instructions operative, when executed on the processor, to perform functions including those set forth above, and others.
  • FIG. 1 illustrates an overview of one embodiment of a system architecture for a streaming media server as disclosed herein.
  • FIG. 2 illustrates an overall process flow of one embodiment of content editing at a live-streaming server, as disclosed herein.
  • FIG. 3 illustrates a block diagram of one embodiment of an analysis filter bank.
  • FIG. 4 illustrates a process flow of one embodiment of audio zooming.
  • FIG. 5 illustrates a block diagram of one embodiment of audio image filtering in sub band domain based on the zoom information.
  • FIG. 6 illustrates one embodiment of classification of time-frequency slots based on the content.
  • FIG. 7 illustrates a process flow of one embodiment of audio zooming.
  • FIG. 8A illustrates a block diagram of one embodiment of a synthesis filter bank.
  • FIG. 8B illustrates a block diagram for one embodiment of combined analysis, directional filtering, and synthesis.
  • FIG. 9 illustrates a sequence diagram for an exemplary embodiment for live operation of audio zooming with user target selection.
  • FIG. 10 illustrates an exemplary embodiment of selection of a zoomed region of a video stream.
  • FIG. 11 illustrates an exemplary embodiment of a zoomed video stream added in an media presentation description (MPD).
  • MPD media presentation description
  • FIG. 12 illustrates an exemplary embodiment of how a user may hear the zoomed audio image and targets.
  • FIG. 13 illustrates an exemplary wireless transmit/receive unit (WTRU) that may be employed as a server or user client in some embodiments.
  • WTRU wireless transmit/receive unit
  • FIG. 14 illustrates an exemplary network entity that may be employed in some embodiments.
  • modules that carry out (i.e., perform, execute, and the like) various functions that are described herein in connection with the respective modules.
  • a module includes hardware (e.g., one or more processors, one or more microprocessors, one or more microcontrollers, one or more microchips, one or more application-specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs), one or more memory devices) deemed suitable by those of skill in the relevant art for a given implementation.
  • ASICs application-specific integrated circuits
  • FPGAs field programmable gate arrays
  • Each described module may also include instructions executable for carrying out the one or more functions described as being carried out by the respective module, and it is noted that those instructions could take the form of or include hardware (i.e., hardwired) instructions, firmware instructions, software instructions, and/or the like, and may be stored in any suitable non-transitory computer-readable medium or media, such as commonly referred to as RAM, ROM, etc.
  • a streaming server enables an audio/visual presentation zooming functionality for camera captured natural audio/visual live-streaming or video-on-demand (VOD) content.
  • a content editor, director, or an automatic contextual tool may access the content through a third party interface to create a zoomed audio/video stream to focus on a particular area or object of the stream.
  • the zooming operation of the audio image may provide a direction or area of interest in the presentation, around which the viewer is expected or desired to concentrate.
  • Live camera-captured natural video streams typically have a very vivid audio image with a plurality of individual sources and ambient sounds.
  • a "zoomed" region or a direction of interest in a presentation does not necessarily have a single distinct sound source that can be traced and emphasized.
  • an object based media content approach does not generally work with live natural content. That is, the number of sound sources and their location in the audio image of live natural content is varying, and the existing sound sources may move in and out of the audio image.
  • Conventional management and manipulation of individual distinct sound sources, "audio objects”, in predetermined and fixed locations does not generally work for live natural content.
  • Camera capture for a natural content stream may have an undetermined number of sources and ambient sounds without any particular location cues. Therefore, tracing sound sources and moving their locations artificially is unreliable and may cause annoying audio effects.
  • an external content editor or automatic contextual tool such as a third party, which controls a media presentation including a zooming function, and where the third party may, in some embodiments, augment new content to the stream in a selected direction of interest.
  • Horizontal and vertical angles of the zoomed or selected area of the media presentation may be extracted and applied to select the corresponding area in an immersive audio image.
  • the zooming area selection may be conducted on the audio image only.
  • the content creation service or user may track target objects within the captured audio/visual image and determine the zoomed region based on contextual cues.
  • Multi-channel audio parameters in the time-frequency domain may be classified in the given direction of interest and outside the zoomed area. Parameters within the zoomed area in the direction of interest may be focused by increasing the coherence and sharpening the directional information to produce a clear audio image. Parameters outside the zoomed area may be "blurred" to fade away possible audio components with distinct directional cues, when reverb and decorrelation (random level and phase shift) is introduced. This may make the image outside a zoomed area more ambient and diffuse.
  • the new tuned audio/visual stream is identified in a manifest file such as a media presentation description (MPD), in some cases together with augmented content.
  • MPD media presentation description
  • a third-party may tune a media presentation in a streaming service by selecting a certain detail in the audio/visual stream, and thereby drive a viewer's interest towards a given direction.
  • the selection may be done manually by a content editor/director, or based on an automatic contextual tool.
  • a zoomed target may be traced automatically using contextual information of the captured content. For example, when a content capturing tool is collecting location information of all targets in the audio/visual image, the presentation may concentrate and zoom in on a desired target.
  • tuning of existing content may be accompanied with augmentation of third-party content in a presentation based on the selected area and contextual information existing in the content.
  • a zoomed area that is selected from the video stream may sound sharp with (dry) directional cues, while the rest of the audio image (e.g., corresponding objects not seen in the zoomed video) may become more ambient without clear audible location information (e.g., direction of arrival), thus automatically driving the viewer's attention in the desired direction and creating an artificial "cocktail party effect".
  • a content server may create a new version of a video stream containing only the zoomed area.
  • the server may compose a new stereo or multi-channel audio signal that is focused to the zoomed area.
  • the content identified in an MPD or other manifest file may include both the full and zoomed presentations.
  • Audio image processing at a content server Audio image processing at a content server.
  • a streaming media server tunes natural camera-captured content from live streaming clients or VOD services and creates an additional media presentation for the same or other streaming clients.
  • a content editor/director, or an automatic contextual editor tool processes the received natural media stream to meet the expected requirements.
  • consumers receive improved and focused media streams.
  • providers or clients may simply edit an ongoing live stream.
  • the server collects audio/visual content and creates a new user experience by editing an alternative stream, for example, by zooming the visual content, thereby creating more options for viewers.
  • a media content server 10 may receive media streams from recording live-streaming clients 20 or from VOD servers 30 and may prepare the content for live-streaming.
  • the server may access the content, and create new content, for example by zooming the audio/visual image and/or augmenting new components to the stream.
  • the composed media including the original stream as well as the tuned, and in some instances augmented, content may be collected in segments and bundled into an MPD file.
  • Receiving clients 40 may then apply, for example, MPEG DASH protocol 50 to stream the content from the server and render the presentation for the user.
  • One function of the media server 10 may be to bundle the new content streams as additional content identified in the MPD file.
  • a receiving client may have an option to stream the original content or the tuned and/or augmented content.
  • the server creates a segment representing a zoomed version of the audio/visual content and adds the segment information to the MPD.
  • Receivers 40 will then have an option to retrieve and view either the original or the tuned content.
  • the media content server 10 in FIG. 1 may conduct the video stream zooming according to predetermined instructions from an external party.
  • a content editor/director 60 may interact with the content stream through a third-party API.
  • the third-party may have access to the content itself, as well as all possible contextual information from accompanied sensor data.
  • a zooming operation may be steered by a human director or it may be based on contextual analysis of the content stream itself.
  • the server 10 may, for example, have a special task to trace a certain object in the audio/visual stream.
  • the streaming client may capture the audio/visual content and may also extract contextual information about the objects appearing in the content 210.
  • a capturing device may also retrieve the location information of targets.
  • the content may also be obtained from a VOD service.
  • the content may be a camera and/or microphone captured natural audio/visual stream, and collected context information about the environment.
  • the audio/visual content and accompanying contextual metadata may be forwarded to the live-streaming server at step 220 for distribution to streaming clients.
  • the server may have the potential to improve user experiences and tune the content.
  • the metadata regarding the target context may be applied by the streaming service.
  • an external party 240 such as a content editor/director 250 or even a contextual content analysis tool, may analyze the content and select an area or direction in the presentation that is cropped from the stream.
  • the selection of the area and direction of interest, e.g., the zoomed area 260, may be performed based on the visual and/or audio content.
  • the context information such as a target location relative to the camera, may also drive the zoomed area selection.
  • the zooming action instructions such as vertical and horizontal angles of the selected area, may be returned to the server through the third-party API, in some embodiments.
  • the server may then conduct the zooming operation 270 for both visual and audio content. That is, the content editor/director or contextual tool chooses the view coordinates after which the server does the zooming operation.
  • the server then segments and identifies new zoomed content in an MPD file, together with the original content in step 280.
  • the live-streaming (or VOD) content may be available for streaming at step 290, for example with the MPEG DASH protocol.
  • the zooming operation provides information about the view coordinates, as well as the horizontal and vertical angles of the selected area in the view. This information is used for audio/visual signal processing. Audio signal processing may be conducted in the time-frequency domain to extract information regarding both time evolving and frequency spectrum related phenomena.
  • the input for the spatial filtering is the stereo or multi-channel audio signal captured with a microphone array, or the like.
  • the processing may define the zooming operation of the presentation, and thus the filtering input may comprise the view angles (horizontal and vertical) of an identified spatial region (zoomed region).
  • FIG. 3 illustrates one embodiment of time- frequency domain analysis of a multi-channel signal.
  • two input channels 310 and 320 stereo sound
  • the filter bank can be scaled according to the number of input channels.
  • a filter bank may include several stages of band splitting filters to achieve sufficient frequency resolution.
  • the filter configuration may also differ to achieve a non-uniform band split.
  • the resulting band limited signals are segmented, the result is a time- frequency domain parameterization of the original time series signal.
  • the resulting signal components limited by frequency and time slots are considered as the time-frequency parameterization of the original signal.
  • an aim of the analysis filtering bank of FIG. 3 is to classify the incoming "audio image" into a first area within the zoomed area and a second area outside of the zoomed area.
  • the selection is conducted by analyzing, as shown in block 330, the spatial location of sound sources and the presence of ambient sounds without any clear direction of arrival against the zooming area coordinates.
  • the classified parameters in the time-frequency domain 340 may then be further filtered with spatial filtering to enable the zooming effect.
  • the sub band domain processing 330 may be performed in the Discrete Fourier Transform (DFT) domain using a Short Term Fourier Transform (STFT) method. In some cases, such processing may be preferable as the complex domain transform domain parameters may be easier to analyze and manipulate regarding the level and phase shift.
  • DFT Discrete Fourier Transform
  • STFT Short Term Fourier Transform
  • BCC binaural cue coding
  • Conventional BCC analysis comprises computation of inter-channel level difference (ILD), inter- channel time difference (ITD), and inter-channel coherence (ICC) parameters estimated within each transform domain time-frequency slot, i.e., in each frequency band of each input frame.
  • ILD inter-channel level difference
  • ITD inter-channel time difference
  • ICC inter-channel coherence
  • the ICC parameter may be utilized for capturing the ambient components that are not correlated with the "dry" sound components represented by phase and magnitude parameters.
  • the coherence cue represents the diffuseness of the signal component.
  • High coherence values indicate that the sound source is point like with an accurate direction of arrival, whereas low coherence represents a diffuse sound that does not have any clear direction of arrival. For example, reverberation and reflected signals coming from many different directions typically have low coherence.
  • An exemplary audio zooming process may include spatial filtering of the time-frequency domain parameters based on the zooming details.
  • FIG. 4 One embodiment of an overall process is illustrated in FIG. 4.
  • the zooming operation is conducted after the signal is decomposed in the time-frequency domain 410.
  • the process includes classification of the parameters based on their location relative to the zoomed area in the audio image 420.
  • the processing of the parameters in spatial filtering depends on the classification.
  • Parameters classified within the zoomed area are focused in process step 430 by reducing their diffuseness through reducing level and time difference variations in the time-frequency domain. Outside the zoomed area, diffuseness is increased by adding random variations and reverberation to decorrelate parameters outside the zoomed area. Such decorrelation makes the sound source appear more ambient.
  • FIG. 5 illustrates one embodiment of a filtering operation 510 of sub band domain time-frequency parameters and the details for the zoomed area.
  • the content editor/director or contextual editing tool may provide the control information for the zooming.
  • the zooming information drives the classification of the sub band domain audio parameterization.
  • the spatial filtering conducts the audio zooming effect.
  • the output of the process 450 is the focused audio within, and ambient audio image outside the identified spatial region (zoomed region).
  • Audio image classification is conducted by estimating the direction of arrival cues and diffuseness in each time-frequency defined area (slot).
  • the parameterization in a given slot indicates a sound source with a direction of arrival in the identified spatial region (zoomed region)
  • the slot is classified as zoomed content. All other slots may be classified as out-of-focus parameterization.
  • input x(z) 520 and zoomed area details 530 enable sub band filtering to produce y(z) 540.
  • FIG. 6 illustrates an embodiment of audio classification in the time-frequency domain.
  • the output of the filter bank is presented in the time-frequency domain.
  • FIG. 6 indicates parameters limited by the frequency 610 and time 630 axes to identify "slots" 620 shaded as solid, hashed or null.
  • the direction of arrival analysis using, for example, the BCC method of the parameters classifies the slots as within the zoomed region (solid slots) or as out-of-focus parametrizations (hashed slots) that contain directional components outside the zoomed region or ambient sound (null slots) without any particular directional cues.
  • the classified content is then processed to enable the audio zooming functionality.
  • the audio zooming may serve to focus a user's attention on the zoomed region of the audio image, and reduce the emphasis of content outside the zoomed area.
  • Audio zooming process In various embodiments, the herein disclosed systems and methods connect the audio processing to the video zooming operation. When the visual content is zoomed into a certain area, the "audio image" is also focused in that area.
  • the time-frequency domain classification discussed above operates to decompose the signal into the zoomed region, the outside area, and ambient sounds. The audio zooming then focuses the user experience on the identified spatial region (zoomed region) of the image.
  • the time and level differences and their variation in each time- frequency parameter are analyzed. Determining the variation of phase and level difference, e.g., in a discrete Fourier transform (DFT) domain is then performed. The information reveals the diffuseness of the audio image within and outside the identified spatial region (zoomed region).
  • DFT discrete Fourier transform
  • the parameters classified within the identified spatial region (zoomed area) are focused by reducing the time and level differences.
  • a variance of the parameters can also be reduced by averaging the values. This may be performed in the DFT domain when the amplitude of the parameters related to the level differences and the phase difference of complex value parameters refer to time difference. Manipulating these values reduces the variance and diffuseness. Effectively, the operation moves the sounds towards the center of the zoomed region, and makes them more coherent and focused.
  • the parameters classified outside the identified spatial region (zoomed area) are decorrelated, such as by adding a random component to the time and level differences. For example, in one embodiment, adding a random component to the level and phase differences of DFT domain parameters decorrelates the area outside the identified spatial region (zoomed area).
  • the server may also add a reverberation effect by filtering the parameters with an appropriate transform domain filter.
  • the reverberation filtering may be conducted by multiplying with an appropriate filter, which is also transformed in the DFT domain instead of a convolution operation in a time domain.
  • a decorrelation module may increase the reverberation effect, as well as random time and level differences, the further away the audio components were classified from the identified spatial region (zoomed region).
  • random time and level differences and reverberation may be applied to time-frequency slots that have "dry" components (with distinct direction of arrival cues). There is generally no need to manipulate ambient sounds that already lack a location cue/directional parameter.
  • sounds outside the identified spatial region may be made more diffuse so as to lack any particular direction of arrival.
  • the sounds outside the identified spatial region thus become more ambient.
  • FIG. 7 One embodiment of the zooming process for the audio image zooming is illustrated in FIG. 7.
  • classification of the audio image parameters in the time-frequency domain 710 results in spatial filtering split into two branches, inside the zoomed region 720, and outside the zoomed region 730, such as based on the classification of slots as inside the identified spatial region (zoomed region) or outside the identified spatial region (zoomed region).
  • inside the identified spatial region (zoomed region) decorrelation is reduced 240.
  • a next step is to make the zoomed area more focused and coherent, and sounds outside of the identified spatial region are decorrelated to generate ambient sound.
  • the filtering operations are different for the different branches: the audio inside the identified spatial region is focused by reducing the a level and phase shift and averaging out the variation 750.
  • focused and "dry" sounds sounds with directional information
  • audio content is diffused by increasing decorrelation 770.
  • One method of increasing decorrelation adding a random component, such as to level and phase shift parameters, thereby increasing random variations, and adding reverberation in the time-frequency domain 780.
  • Ambient sounds in the audio content are already diffuse and do not need to have phase shift processing such as adding randomization or reverberation.
  • the branches may be transformed back to the time domain 794.
  • "random" variations incudes pseudo-random variations, including predetermined pseudo-random variations.
  • the filtered sub-band domain signal may be transformed back to the time domain using a synthesis filter bank.
  • FIG. 8A illustrates one embodiment of sampled filter bank structure which reconstructs the decomposition of the analysis filter bank signal from FIG. 3 back to the time domain.
  • sub band domain filtering 810 is up sampled by 2 in blocks 812, 814 for output yi(z) 830 and 822, 824 for output y2(z) 840.
  • inverse function G(z) 832, 834 for output yi (z) 830 and 843, 844 for outputy 2 (z) 840 combine to produce the two time domain audio signals for a stereo-type signal.
  • FIG. 8A can be scaled for multi-channel audio signals.
  • FIG. 8B illustrates a block diagram for one embodiment of combined analysis, directional filtering, and synthesis, using the modules discussed in relation to FIGS. 3, 5, and 8A.
  • the processed audio stream is segmented and the access details can be added to the MPD file to complete the media presentation with the zoomed content.
  • Information identifying the associated video portion is also included in the MPD.
  • the streaming media server such as the server shown in FIG. 1 , is able to provide live or VOD content in both original and zoomed versions.
  • a receiving user may select between the two using the details in the MPD file.
  • more than one zoomed region may exist for given video content.
  • each zoomed region may be processed as discussed above, and then the plurality of zoomed regions may be included in the MPD file.
  • Audio classification functionality falls into two classes: components having directional components within the identified spatial areas (zoomed area) (e.g., direction of interest), and the rest. However, the classification may be further separated. For example, components that are not in the zoomed area may be classified as either ambient sounds or components having directional cues. In such a case, the "blurring" and decorrelation operation may be conducted only on the components having directional cues, or "dry" areas.
  • a receiving user may have control of the presentation and may steer a zooming target, and thus control the content tuning.
  • the receiving user may, for example, use head tracking and a post filtering type of audio image control to "zoom in" on the presentation.
  • the focused audio image of this solution is driving the user attention towards the zoomed area and emphasizes the target in the video stream.
  • User control may further be included in the streaming server as additional contextual information. There is valuable information about the user interests. Hence, the data is available e.g. for third party content augmentation.
  • Target selection The receiving user may, in some embodiments, select a desired target on the screen.
  • the contextual information such as location
  • the user may tap the object on the screen after which the consumer application may determine the selected object or the area on the screen and return the information to the streaming server.
  • the server may lock on the target and create a new (or additional) zoomed stream that is focused on the selected object or area.
  • FIG. 9 depicts a sequence diagram for live operation of audio zooming with user target selection. More particularly,
  • camera captured streaming and VOD services may enable new functionalities of managing a video stream by zooming in on a particular location or following a particular visual object in the video stream.
  • a media server may create an alternative presentation for a live video stream.
  • the content may be available at the server for streaming, such as using MPEG DASH protocol.
  • the video content may be constructed in a MPD manifestation file that contains details for streaming the content.
  • the server may create a zoomed version of the video stream, such as by following the particular area or section of the image or the particular target in the stream.
  • a content director or automatic contextual editor tool may crop a part of a video presentation and compose a new video stream.
  • FIGS. 10-12 illustrate particular stages in the process of the exemplary embodiment.
  • a user or content editor may make a target selection for a given video stream.
  • the target zoomed region may be the area within the box of the larger video feed.
  • the selection may be performed from the video stream only.
  • the content editor may follow an object in the visual stream and maintain the zoomed area.
  • automatic tools may be used.
  • the captured (or VOD) content may have contextual metadata associated with the absolute positions of objects appearing in the video. From this, by combining the context with the information about the camera location objects may be pinpointed in the visual stream, and video zooming to a given object may be performed automatically.
  • each individual player in an NFL game may be instrumented with location tracking, and by adding this information as metadata a content stream covering the stadium may enable the streaming server to trace any object (e.g., player) on the stream.
  • the server may create a new visual adaptation set in the MPD file based on the zoomed region.
  • An exemplary resulting video stream is illustrated in FIG. 11 , for the selected region based on the original video stream of FIG. 10.
  • the server may also create a new audio stream that matches the zoomed video presentation, such that the audio experience reflects the new presentation. For example, a viewer may be guided to the zoomed area or target with the help of the processed audio.
  • an immersive audio image may be focused within the zoomed region while the remaining area is processed to sound more like ambient or background noise.
  • Possible sound sources outside the zoomed area are still present in the audio image but the viewer (listener) is not able to trace their location as in the full presentation (e.g., point sound sources are diffused).
  • FIG. 12 illustrates an exemplary effect in the audio image, e.g., how a user hears the audio image.
  • treating optical focus as comparable to the aural focus only the area of the zoomed area is in focus while the remainder is "blurred.”
  • distinct audio sources are not distracting to a viewer when they are not easily recognizable. As such, the viewer's focus is also drawn to the zoomed area by the audio image.
  • the server may apply only the audio zooming and possibly augment artificial visual cues to emphasize a certain area in the visual stream.
  • the target may by highlighted with an augmented frame, such as in FIG. 10.
  • the corresponding audio stream may be zoomed to drive the user's attention to the selected target, e.g., the audio image may appear as in FIG. 12.
  • the audio zooming may introduce an "artificial cocktail party effect" to the audio presentation.
  • the user is influenced to concentrate on the target when the rest of the image is "blurred,” without clear details (e.g., clear alternative point sound sources) that the user may follow.
  • Exemplary embodiments disclosed herein are implemented using one or more wired and/or wireless network nodes, such as a wireless transmit/receive unit (WTRU) or other network entity.
  • WTRU wireless transmit/receive unit
  • FIG. 13 is a system diagram of an exemplary WTRU 3102, which may be employed as a server or user device in embodiments described herein.
  • the WTRU 3102 may include a processor 3118, a communication interface 3119 including a transceiver 3120, a transmit/receive element 3122, a speaker/microphone 3124, a keypad 3126, a display/touchpad 3128, a non-removable memory 3130, a removable memory 3132, a power source 3134, a global positioning system (GPS) chipset 3136, and sensors 3138.
  • GPS global positioning system
  • the processor 3118 may be a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific
  • the processor 3118 may perform signal coding, data processing, power control, input/output processing, and/or any other functionality that enables the WTRU 3102 to operate in a wireless environment.
  • the processor 3118 may be coupled to the transceiver 3120, which may be coupled to the transmit/receive element 3122. While FIG. 13 depicts the processor 3118 and the transceiver 3120 as separate components, it will be appreciated that the processor 3118 and the transceiver 3120 may be integrated together in an electronic package or chip.
  • the transmit/receive element 3122 may be configured to transmit signals to, or receive signals from, a base station over the air interface 3116.
  • the transmit/receive element 3122 may be an antenna configured to transmit and/or receive RF signals.
  • the transmit/receive element 3122 may be an emitter/detector configured to transmit and/or receive IR, UV, or visible light signals, as examples.
  • the transmit/receive element 3122 may be configured to transmit and receive both RF and light signals. It will be appreciated that the transmit/receive element 3122 may be configured to transmit and/or receive any combination of wireless signals.
  • the WTRU 102 may include any number of transmit/receive elements 3122. More specifically, the WTRU 3102 may employ MIMO technology. Thus, in one embodiment, the WTRU 3102 may include two or more transmit/receive elements 3122 (e.g., multiple antennas) for transmitting and receiving wireless signals over the air interface 3116.
  • the WTRU 3102 may include two or more transmit/receive elements 3122 (e.g., multiple antennas) for transmitting and receiving wireless signals over the air interface 3116.
  • the transceiver 3120 may be configured to modulate the signals that are to be transmitted by the transmit/receive element 3122 and to demodulate the signals that are received by the transmit/receive element 3122.
  • the WTRU 3102 may have multi-mode capabilities.
  • the transceiver 3120 may include multiple transceivers for enabling the WTRU 3102 to communicate via multiple RATs, such as UTRA and IEEE 802.11, as examples.
  • the processor 3118 of the WTRU 3102 may be coupled to, and may receive user input data from, the speaker/microphone 3124, the keypad 3126, and/or the display/touchpad 3128 (e.g., a liquid crystal display (LCD) display unit or organic light-emitting diode (OLED) display unit).
  • the processor 3118 may also output user data to the speaker/microphone 3124, the keypad 3126, and/or the display/touchpad 3128.
  • the processor 3118 may access information from, and store data in, any type of suitable memory, such as the non-removable memory 3130 and/or the removable memory 3132.
  • the removable memory 3130 may include random-access memory (RAM), read-only memory (ROM), a hard disk, or any other type of memory storage device.
  • the removable memory 3132 may include a subscriber identity module (SIM) card, a memory stick, a secure digital (SD) memory card, and the like.
  • SIM subscriber identity module
  • SD secure digital
  • 3118 may access information from, and store data in, memory that is not physically located on the WTRU
  • the processor 3118 may receive power from the power source 3134, and may be configured to distribute and/or control the power to the other components in the WTRU 3102.
  • the power source 3134 may be any suitable device for powering the WTRU 3102.
  • the power source 3134 may include one or more dry cell batteries (e.g., nickel-cadmium (NiCd), nickel-zinc (NiZn), nickel metal hydride (NiMH), lithium-ion (Li-ion), and the like), solar cells, fuel cells, and the like.
  • the processor 3118 may also be coupled to the GPS chipset 3136, which may be configured to provide location information (e.g., longitude and latitude) regarding the current location of the WTRU 3102.
  • location information e.g., longitude and latitude
  • the WTRU 3102 may receive location information over the air interface 3116 from a base station and/or determine its location based on the timing of the signals being received from two or more nearby base stations. It will be appreciated that the WTRU 3102 may acquire location information by way of any suitable location-determination method while remaining consistent with an embodiment.
  • the processor 3118 may further be coupled to other peripherals 3138, which may include one or more software and/or hardware modules that provide additional features, functionality and/or wired or wireless connectivity.
  • the peripherals 3138 may include sensors such as an accelerometer, an e-compass, a satellite transceiver, a digital camera (for photographs or video), a universal serial bus (USB) port, a vibration device, a television transceiver, a hands free headset, a Bluetooth® module, a frequency modulated (FM) radio unit, a digital music player, a media player, a video game player module, an Internet browser, and the like.
  • sensors such as an accelerometer, an e-compass, a satellite transceiver, a digital camera (for photographs or video), a universal serial bus (USB) port, a vibration device, a television transceiver, a hands free headset, a Bluetooth® module, a frequency modulated (FM) radio unit, a digital music player, a media player, a video game player module
  • FIG. 14 depicts an exemplary network entity 4190 that may be used in embodiments of the present disclosure.
  • network entity 4190 includes a communication interface 4192, a processor 4194, and non-transitory data storage 4196, all of which are communicatively linked by a bus, network, or other communication path 4198.
  • Communication interface 4192 may include one or more wired communication interfaces and/or one or more wireless-communication interfaces. With respect to wired communication, communication interface 4192 may include one or more interfaces such as Ethernet interfaces, as an example. With respect to wireless communication, communication interface 4192 may include components such as one or more antennae, one or more transceivers/chipsets designed and configured for one or more types of wireless (e.g.,
  • communication interface 4192 may be equipped at a scale and with a configuration appropriate for acting on the network side— as opposed to the client side— of wireless communications (e.g., LTE communications, Wi-Fi communications, and the like). Thus, communication interface 4192 may include the appropriate equipment and circuitry (perhaps including multiple transceivers) for serving multiple mobile stations, UEs, or other access terminals in a coverage area.
  • Processor 4194 may include one or more processors of any type deemed suitable by those of skill in the relevant art, some examples including a general-purpose microprocessor and a dedicated DSP.
  • Data storage 4196 may take the form of any non-transitory computer-readable medium or combination of such media, some examples including flash memory, read-only memory (ROM), and random- access memory (RAM) to name but a few, as any one or more types of non-transitory data storage deemed suitable by those of skill in the relevant art could be used. As depicted in FIG. 14, data storage 4196 contains program instructions 4197 executable by processor 4194 for carrying out various combinations of the various network-entity functions described herein.
  • ROM read only memory
  • RAM random access memory
  • register cache memory
  • semiconductor memory devices magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
  • a processor in association with software may be used to implement a radio frequency transceiver for use in a WTRU, UE, terminal, base station, RNC, or any host computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Quality & Reliability (AREA)
  • Databases & Information Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

La présente invention concerne des systèmes et des procédés connexes au zoom sur une présentation audio immersive en relation avec un contenu vidéo. Dans un mode de réalisation, un procédé comprend l'accès, au niveau d'un serveur, à un flux audio et vidéo primaire. Le procédé comprend également la préparation, au niveau du serveur, d'un flux vidéo personnalisé pour améliorer une région spatiale du flux audio et vidéo primaire. Le procédé comprend également la préparation, au niveau du serveur, d'un flux audio personnalisé qui correspond au flux vidéo personnalisé. Le flux audio peut être traité en classifiant l'audio à l'intérieur ou à l'extérieur de la région spatiale. L'audio à l'intérieur de la région peut être focalisé, et l'audio à l'extérieur de la région peut être diffusé et/ou décorrélé. Le flux audio traité peut être apparié avec un flux vidéo amélioré et fourni à un dispositif client.
PCT/US2018/012992 2017-01-12 2018-01-09 Zoom audio dans un service de contenu vidéo audio naturel WO2018132385A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762445641P 2017-01-12 2017-01-12
US62/445,641 2017-01-12

Publications (1)

Publication Number Publication Date
WO2018132385A1 true WO2018132385A1 (fr) 2018-07-19

Family

ID=61569373

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2018/012992 WO2018132385A1 (fr) 2017-01-12 2018-01-09 Zoom audio dans un service de contenu vidéo audio naturel

Country Status (1)

Country Link
WO (1) WO2018132385A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020193851A1 (fr) 2019-03-25 2020-10-01 Nokia Technologies Oy Reproduction audio spatiale associée
EP3849202A1 (fr) * 2020-01-10 2021-07-14 Nokia Technologies Oy Traitement audio et vidéo
WO2023118643A1 (fr) * 2021-12-22 2023-06-29 Nokia Technologies Oy Appareil, procédés et programmes informatiques pour générer une sortie audio spatiale

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110129095A1 (en) * 2009-12-02 2011-06-02 Carlos Avendano Audio Zoom
EP2346028A1 (fr) * 2009-12-17 2011-07-20 Fraunhofer-Gesellschaft zur Förderung der Angewandten Forschung e.V. Appareil et procédé de conversion d'un premier signal audio spatial paramétrique en un second signal audio spatial paramétrique
US20140348342A1 (en) * 2011-12-21 2014-11-27 Nokia Corporation Audio lens

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110129095A1 (en) * 2009-12-02 2011-06-02 Carlos Avendano Audio Zoom
EP2346028A1 (fr) * 2009-12-17 2011-07-20 Fraunhofer-Gesellschaft zur Förderung der Angewandten Forschung e.V. Appareil et procédé de conversion d'un premier signal audio spatial paramétrique en un second signal audio spatial paramétrique
US20140348342A1 (en) * 2011-12-21 2014-11-27 Nokia Corporation Audio lens

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SCHULTZ-AMLING RICHARD ET AL: "Acoustical Zooming Based on a Parametric Sound Field Representation", AES CONVENTION 128; MAY 2010, AES, 60 EAST 42ND STREET, ROOM 2520 NEW YORK 10165-2520, USA, 1 May 2010 (2010-05-01), XP040509503 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020193851A1 (fr) 2019-03-25 2020-10-01 Nokia Technologies Oy Reproduction audio spatiale associée
CN113632496A (zh) * 2019-03-25 2021-11-09 诺基亚技术有限公司 相关联的空间音频回放
EP3949432A4 (fr) * 2019-03-25 2022-12-21 Nokia Technologies Oy Reproduction audio spatiale associée
US11902768B2 (en) 2019-03-25 2024-02-13 Nokia Technologies Oy Associated spatial audio playback
EP3849202A1 (fr) * 2020-01-10 2021-07-14 Nokia Technologies Oy Traitement audio et vidéo
US11342001B2 (en) 2020-01-10 2022-05-24 Nokia Technologies Oy Audio and video processing
WO2023118643A1 (fr) * 2021-12-22 2023-06-29 Nokia Technologies Oy Appareil, procédés et programmes informatiques pour générer une sortie audio spatiale

Similar Documents

Publication Publication Date Title
US9820037B2 (en) Audio capture apparatus
EP3807669B1 (fr) Localisation de sources sonores dans un environnement acoustique donné
US20160155455A1 (en) A shared audio scene apparatus
US20130304244A1 (en) Audio alignment apparatus
US20150003802A1 (en) Audio/video methods and systems
CN106659936A (zh) 用于确定增强现实应用中音频上下文的系统和方法
KR20220068894A (ko) 오디오 재생 방법 및 오디오 재생 장치, 전자 기기 및 저장 매체
CN112189348B (zh) 空间音频捕获的装置和方法
KR20220077132A (ko) 시청각 콘텐츠용 바이노럴 몰입형 오디오 생성 방법 및 시스템
JP2018503148A (ja) ビデオ再生のための方法および装置
EP2920979B1 (fr) Acquisition de données sonores spatialisées
WO2018132385A1 (fr) Zoom audio dans un service de contenu vidéo audio naturel
US9195740B2 (en) Audio scene selection apparatus
EP2904817A1 (fr) Appareil et procédé pour reproduire des données audio enregistrées avec une orientation spatiale correcte
JP2022552333A (ja) 動画ファイルの生成方法、装置、端末及び記憶媒体
WO2013088208A1 (fr) Appareil d'alignement de scène audio
WO2018162803A1 (fr) Procédé et agencement d'analyse paramétrique et traitement de scènes sonores spatiales codées de manière ambisonique
US9288599B2 (en) Audio scene mapping apparatus
US20150310869A1 (en) Apparatus aligning audio signals in a shared audio scene
US20160100110A1 (en) Apparatus, Method And Computer Program Product For Scene Synthesis
US20150043756A1 (en) Audio scene mapping apparatus
FR3011373A1 (fr) Terminal portable d'ecoute haute-fidelite personnalisee
CN112735455B (zh) 声音信息的处理方法和装置
CN108713313B (zh) 多媒体数据处理方法、装置和设备/终端/服务器
Oldfield et al. Cloud-based AI for automatic audio production for personalized immersive XR experiences

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18709151

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18709151

Country of ref document: EP

Kind code of ref document: A1

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载