+

WO2014087277A1 - Generating drive signals for audio transducers - Google Patents

Generating drive signals for audio transducers Download PDF

Info

Publication number
WO2014087277A1
WO2014087277A1 PCT/IB2013/059875 IB2013059875W WO2014087277A1 WO 2014087277 A1 WO2014087277 A1 WO 2014087277A1 IB 2013059875 W IB2013059875 W IB 2013059875W WO 2014087277 A1 WO2014087277 A1 WO 2014087277A1
Authority
WO
WIPO (PCT)
Prior art keywords
drive signal
audio
rendering
decorrelation
signal
Prior art date
Application number
PCT/IB2013/059875
Other languages
French (fr)
Inventor
Jeroen Gerardus Henricus Koppens
Erik Gosuinus Petrus Schuijers
Werner Paulus Josephus De Bruijn
Arnoldus Werner Johannes Oomen
Original Assignee
Koninklijke Philips N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips N.V. filed Critical Koninklijke Philips N.V.
Publication of WO2014087277A1 publication Critical patent/WO2014087277A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/02Systems employing more than two channels, e.g. quadraphonic of the matrix type, i.e. in which input signals are combined algebraically, e.g. after having been phase shifted with respect to each other
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]

Definitions

  • the invention relates to generation of drive signals for audio transducers and in particular, but not exclusively, to generation of drive signals from audio signals
  • Digital encoding of various source signals has become increasingly important over the last decades as digital signal representation and communication increasingly has replaced analogue representation and communication.
  • audio content such as speech and music
  • digital content encoding is increasingly based on digital content encoding.
  • audio consumption has increasingly become an enveloping three dimensional experience with e.g. surround sound and home cinema setups becoming prevalent.
  • Audio encoding formats have been developed to provide increasingly capable, varied and flexible audio services and in particular audio encoding formats supporting spatial audio services have been developed.
  • Well known audio coding technologies like DTS and Dolby Digital produce a coded multi-channel audio signal that represents the spatial image as a number of channels that are placed around the listener at fixed positions. For a speaker setup that is different from the setup that corresponds to the multi-channel signal, the spatial image will be suboptimal. Also, these channel based audio coding systems are typically not able to cope with a number of speakers that is different from the number of speakers represented by the multi-channel signal.
  • MPEG Surround provides a multi-channel audio coding tool that allows existing mono- or stereo-based coders to be extended to multi-channel audio applications.
  • Fig. 1 illustrates an example of elements of an MPEG Surround system.
  • an MPEG Surround decoder can recreate the spatial image by a controlled upmix of the mono- or stereo signal to obtain a multichannel output signal. Since the spatial image of the multi-channel input signal is parameterized, MPEG Surround allows for decoding of the same multi-channel bit-stream by rendering devices that do not use a multichannel speaker setup.
  • An example is virtual surround reproduction on headphones, which is referred to as the MPEG Surround binaural decoding process. In this mode a realistic surround experience can be provided while using regular headphones.
  • Another example is the pruning of higher order multichannel outputs, e.g. 7.1 channels, to lower order setups, e.g. 5.1 channels.
  • MPEG standardized a format known as 'Spatial Audio Object Coding' (MPEG-D SAOC).
  • MPEG-D SAOC provides efficient coding of individual audio objects rather than audio channels.
  • each speaker channel can be considered to originate from a different mix of sound objects
  • SAOC makes individual sound objects available at the decoder side for interactive manipulation as illustrated in Fig. 2.
  • multiple sound objects are coded into a mono or stereo downmix together with parametric data allowing the sound objects to be extracted prior to the rendering thereby allowing the individual audio objects to be available for manipulation e.g. by the end-user.
  • SAOC similarly to MPEG Surround, SAOC also creates a mono or stereo downmix.
  • object parameters are calculated and included.
  • the user may manipulate these parameters to control various features of the individual objects, such as position, level, equalization, or even to apply effects such as reverb.
  • Fig. 3 illustrates an interactive interface that enables the user to control the individual objects contained in an SAOC bitstream.
  • SAOC allows a more flexible approach and in particular allows more rendering based adaptability by transmitting audio objects instead of only reproduction channels. This allows the decoder-side to place the audio objects at arbitrary positions in space, provided that the space is adequately covered by speakers.
  • SAOC This way there is no relation between the transmitted audio and the reproduction or rendering setup, hence arbitrary speaker setups can be used. This is advantageous for e.g. home cinema setups in a typical living room, where the speakers are rarely at the intended positions.
  • SAOC it is decided at the decoder side where the objects are placed in the sound scene, which is often not desired from an artistic point-of-view.
  • the SAOC standard does provide ways to transmit a default rendering matrix in the bitstream, eliminating the decoder responsibility.
  • the provided methods rely on either fixed reproduction setups or on unspecified syntax.
  • SAOC does not provide normative means to fully transmit an audio scene independently of the speaker setup.
  • SAOC is not well equipped to the faithful rendering of diffuse signal components. Although there is the possibility to include a so called multichannel background object to capture the diffuse sound, this object is tied to one specific speaker configuration.
  • 3DAA 3D Audio Alliance
  • 3DAA is dedicated to develop standards for the transmission of 3D audio, that "will facilitate the transition from the current speaker feed paradigm to a flexible object-based approach".
  • 3DAA a bitstream format is to be defined that allows the transmission of a legacy multichannel downmix along with individual sound objects.
  • object positioning data is included. The principle of generating a 3DAA audio stream is illustrated in Fig. 4.
  • the sound objects are received separately in the extension stream and these may be extracted from the multi-channel downmix.
  • the resulting multi-channel downmix is rendered together with the individually available objects.
  • the objects may consist of so called stems. These stems are basically grouped (downmixed) tracks or objects. Hence, an object may consist of multiple sub-objects packed into a stem.
  • a multichannel reference mix can be transmitted with a selection of audio objects. 3DAA transmits the 3D positional data for each object. The objects can then be extracted using the 3D positional data. Alternatively, the inverse mix-matrix may be transmitted, describing the relation between the objects and the reference mix.
  • 3DAA From the description of 3DAA, sound-scene information is likely transmitted by assigning an angle and distance to each object, indicating where the object should be placed relative to e.g. the default forward direction. This is useful for point-sources but fails to describe wide sources (like e.g. a choir or applause) or diffuse sound fields (such as ambience). When all point-sources are extracted from the reference mix, an ambient multichannel mix remains. Similar to SAOC, the residual in 3DAA is fixed to a specific speaker setup.
  • both the SAOC and 3DAA approaches incorporate the transmission of individual audio objects that can be individually manipulated at the decoder side.
  • SAOC provides information on the audio objects by providing parameters characterizing the objects relative to the downmix (i.e. such that the audio objects are generated from the downmix at the decoder side)
  • 3DAA provides audio objects as full and separate audio objects (i.e. that can be generated independently from the downmix at the decoder side).
  • FIG. 5 provides an illustration of the current high level block diagram of the intended MPEG 3D Audio system.
  • object based and scene based formats are also to be supported.
  • An important aspect of the system is that its quality should scale to transparency for increasing bitrate. This puts a burden on the use of parametric coding techniques that have been used quite heavily in the past (viz. MPEG-4 HE- AAC v2, MPEG-D MPEG Surround, MPEG-D SAOC, MPEG-D USAC).
  • Envisioned reproduction possibilities include flexible loudspeaker setups (envisaged up to 22.2 channels), virtual surround over headphones, and closely spaced speakers. Flexible loudspeaker setups refer to any number of speakers at arbitrary physical locations.
  • the decoder of MPEG 3D Audio is intended to comprise a rendering module that is responsible for translating the decoded individual audio channels/objects into speaker feeds based on the physical location of the speakers, i.e. based on the specific rendering speaker configuration/ setup.
  • the rendering of the audio is accordingly dependent on the physical locations of the speakers of the rendering configuration. These positions may be determined or provided in various ways. For example, they may simply be provided by a direct user input, such as by the user directly providing a user input indicating the floor plan of speakers location, e.g. using a mobile app interface.
  • acoustic methods both those using ultra- and audible sound
  • the acoustic methods are typically based on the concept of acoustic Time-Of- Flight, which means that the distance between any two speakers is determined by measuring the time it takes for sound to travel from one speaker to the other. This requires a microphone (or ultrasound receiver) to be integrated into each loudspeaker.
  • the positioning of the loudspeakers set within the room may also be relevant. Again this information may be provided manually or via automated methods. E.g. ultrasound reflections may be used to automatically detect the distance to room boundaries (walls, ceiling, floor) and general room dimensions. Together this information gives a full description of the rendering configuration.
  • Another requirement resulting from the speaker configuration independent audio provision is that the individual rendering device must position the different audio sources. Such positioning is traditionally performed at the content creation side, and is often manually performed or directly results from the recording signals. Furthermore, the positioning is conventionally performed based on a set of audio channels that are each associated with a fixed nominal position. Therefore, the rendering device merely needs to render the received audio signals and does not need to perform any positioning.
  • the rendering device needs to position the sound sources appropriately in the audio scene generated by the rendering of audio from the specific speaker configuration.
  • the positioning may often be based on position information received from the source, e.g. a desired position may be received for each audio object, but may be locally modified or changed.
  • the rendering device Based on the position of a given audio signal, the rendering device must generate drive signals for the individual loudspeakers which at a (nominal) listening position is then perceived to originate from the given position.
  • An approach for positioning sounds sources is to use a panning algorithm where the relative levels of the resulting drive signals for individual speakers are adjusted such that the audio signal is perceived as a sound source at the desired position.
  • two loudspeakers can radiate coherent signals with different amplitudes (except for the situation where the sound source is positioned exactly midway between the speakers). The listener perceives this as a virtual sound source positioned at a position between the speakers given by the relative amplitude levels.
  • the relation of amplitudes of emanating signals controls the perceived direction of the virtual source.
  • a virtual source can be positioned to any direction on the plane using two adjacent loudspeakers surrounding the virtual source. This method is called a pair-wise panning paradigm.
  • the loudspeaker pair need not be in front of the listener. There typically exists, however, some limitations in the effectiveness of the approach for loudspeaker placement to the side of the listener.
  • the loudspeakers should furthermore preferably both either be in front of the listener or behind the listener. If a loudspeaker configuration has loudspeakers both behind and in front of the listener, the use of such a pair of speakers result in a gap in the directions at which the virtual sources can be positioned.
  • the loudspeaker setup will include speakers that are not in the same horizontal plane, e.g. it may include elevated loudspeakers.
  • a suitable approach for 3D audio rendering is so-called Vector Base Amplitude Panning (VBAP) described in Pulkki V. Virtual source positioning using vector base amplitude panning, Journal of the Audio Engineering Society 1997; 45(6):456-466.
  • the loudspeaker setup can be divided into triangles (loudspeaker triplets), with the audio signal for a given position being positioned by a panning of one triplet.
  • a loudspeaker triplet may be formulated using vectors.
  • the unit-length vectors I m , I n and Ik point from the listening position to the loudspeakers.
  • the direction of the virtual source is presented with unit-length vector p which is expressed as a linear weighted sum of the loudspeaker vectors
  • g m , g n , and gk are called gain factors of respective loudspeakers.
  • the loudspeaker setup is divided into triangles forming a triangle set.
  • a single triangle from the set is chosen to be used for the panning.
  • the selection can be made by calculating the gain factors in each loudspeaker triangle in the triangle set and selecting the triangle that produced non-negative factors. If the triangles in the set are non-overlapping, the selection is unambiguous.
  • transmission is envisioned to be independent of the rendering speaker setup. Therefore, the received bitstream can be used for rendering to an arbitrary speaker setup.
  • the scene intended by the audio engineer is mapped to the available speakers using their actual positions.
  • receiver/decoder/renderer In practice this may result in a transmission of audio objects along with position information indicating where the object should be rendered in (3D) space.
  • position information indicating where the object should be rendered in (3D) space.
  • a multitude of algorithms is available to generate speaker signals from this information, for example Vector-Based Amplitude Panning.
  • panning between speakers that are widely spaced does not yield a well- placed source.
  • front-back confusion may arise when panning between the two surround speakers of a 5.1 configuration.
  • part of the audio is perceived at the location of the speakers.
  • many speaker configurations such as e.g. a 5.1 loudspeaker configuration, utilize speakers that are relatively far apart and which accordingly provide a suboptimal perception of the virtual sound source at the desired position.
  • Panning between two speakers introduces a sweet-spot, or in fact a 'sweet-plane', which is the plane where the distance to both speakers is equal.
  • a 'sweet-plane' which is the plane where the distance to both speakers is equal.
  • this 'sweet-plane' becomes a vertical 'sweet-line'.
  • elevated speakers are used to pan elevated objects the sweet-spot is also limited in height. This is even more problematic than the 'sweet-line' since people are generally not equally tall and therefore not listening at the same height.
  • Solutions based on crosstalk cancelation can be used to introduce improved localization cues at the ears of the listener.
  • approaches are complex, sensitive to imperfections, have a narrow sweet-spot due to phase manipulation, and require personalized components in order to work well.
  • an improved approach would be advantageous and in particular an approach allowing increased flexibility, improved positioning of audio sources, improved adaptability to different rendering configurations, reduced complexity, an improved user experience, and/or improved performance would be advantageous.
  • the Invention seeks to preferably mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination.
  • an apparatus for generating drive signals for audio transducers comprising: an audio receiver for receiving an audio signal; a position receiver for receiving position data indicative of a desired rendering position for the audio signal; a drive signal generator for generating at least a first drive signal for a first audio transducer associated with a first position and a second drive signal for a second audio transducer associated with a second position, the drive signal generator being arranged to generate the drive signals in response to a panning for the audio signal in response to the desired rendering position; and wherein the drive signal generator is arranged to decorrelate the first drive signal relative to the second drive signal, a degree of decorrelation being dependent on an indication of the first position.
  • the invention may provide an improved audio experience, and in particular an improved spatial audio experience.
  • the approach may support rendering over a wide range of loudspeaker configurations with increased adaptability of the user experience to the given configuration.
  • an improved perception of a sound source at a desired position may be provided, and often with a reduced sensitivity to specific loudspeaker configurations.
  • improved performance may be achieved for loudspeaker configurations having a relatively large distance between loudspeakers.
  • the approach may in many scenarios result in mitigation of imperfections of a panning operation.
  • the perception of a given sound source as also originating from the positions of the speakers involved in the panning may be reduced substantially.
  • the approach may specifically reduce the correlation between the speaker signals used to generate a panned phantom source thereby reducing the perceptibility of imperfections of the panning operation. For example, using panning for localization between widely spaced speakers tends to result in artifacts, including the perception of additional sound sources at the speaker positions.
  • the sound source is rendered more diffusely but still with a directional component originating from the desired position. It is often preferable to have a perceived sound source which is perceived as coming from a less-defined, but still more or less correct direction, than to have a sound source which is perceived as, for example, coming from two distinct loudspeaker positions or from a completely wrong position (e.g. front-back reversal).
  • the panning operation may comprise and/or consist in setting relative levels and/or gains for the first and second drive signal in response to the desired rendering position.
  • the levels/ gains may be set such that the audio signal will be perceived to originate from the desired rendering position at a (nominal) listening position.
  • the desired rendering position may be a three dimensional, two dimensional or one dimensional position.
  • the panning operation may be a three dimensional, two dimensional or one dimensional position.
  • a three dimensional system may consider both a horizontal angular direction (azimuth), a vertical angular direction (elevation), and a distance from a (nominal) listening position.
  • a two dimensional system may e.g.
  • a one dimensional system may e.g. consider only a horizontal angular direction (azimuth).
  • the desired rendering position may be an angular direction (azimuth) from a (nominal) listening position.
  • the apparatus may be arranged to receive audio transducer position data indicative of the positions of the first and second audio transducers, i.e. to receive an indication of at least the first position.
  • the data may e.g. be received from an internal source (such as a memory), a user input, or a remote source.
  • the audio signal may be received from an internal or external source.
  • the desired rendering position may also be received from any internal or external source, and may for example be received from a remote source together with the audio signal, or may be locally provided or generated.
  • the first position may be a three dimensional, two dimensional, or one dimensional position.
  • the second position may be a three dimensional, two dimensional, or one dimensional position.
  • the first position may be represented by any indication of a position including a three dimensional, two dimensional, or three dimensional position indication.
  • the first (and/or second) position may be represented by an angular direction (azimuth) from a (nominal) listening position.
  • the position receiver may receive an indication of the first position (from an external or internal source), and the drive signal generator may determine the degree of decorrelation dependent on the indication of the first position.
  • the indication of the first position may be an indication of an absolute position or may e.g. be an indication of a relative position, such as an indication of the first position relative to the second position and/or to a listening position.
  • the indication of the first position may be a partial indication of the first position (e.g. may only provide an indication in one dimension, such as an indication of an angle from a listening position to the first position, e.g. relative to a reference direction).
  • the audio signal may for example be an audio object, audio scene, audio channel or audio component.
  • the audio signal may be part of a set of audio signals, such as e.g. an audio component in an encoded data stream comprising a plurality of (possibly different types of) audio items.
  • the degree of decorrelation is dependent on an indication of the first position relative to the second position.
  • This may provide improved rendering in many embodiments, and may in particular allow efficient and accurate adaptation of the characteristics of the rendering to the specific audio transducer configuration.
  • the relative positions of audio transducers involved in a panning operation may have a strong influence on the performance, accuracy and possible artifacts of the operation, and thus an adaptation of the decorrelation based on a measure of a relative positioning of the audio transducers may provide a particularly suitable adaptation of the rendering.
  • the dependency of the degree of decorrelation on the (indication of) the first position may specifically be a dependency on the (indication of) the first position relative to the second position.
  • the indication of the first position relative to the second position may for example be an indication of the difference between the positions, e.g. measured as a distance along a line between the first and second position, or as an angular distance measured relative to a (typically nominal) listening position.
  • the degree of decorrelation is dependent on an indication of an angle between a direction from a listening position to the first position and a direction from the listening position to the second position.
  • This may provide improved rendering in many embodiments, and may in particular allow efficient and accurate adaptation of the characteristics of the rendering to the specific audio transducer configuration.
  • difference/distance to audio transducers involved in a panning operation from a listening position may have a strong effect on the performance, accuracy and possible artifacts of the operation, and thus an adaptation of the decorrelation based on a measure of the angular difference/ distance may provide a particularly suitable adaptation of the rendering.
  • the dependency of the degree of decorrelation on the (indication of) the first position may specifically be a dependency on the (indication of) the angle between a direction from a listening position to the first position and a direction from the listening position to the second position.
  • the degree of decorrelation of the first drive signal relative to the second drive signal is dependent on an indication of a distance between the first position and the second position.
  • This may provide an improved adaptation to specific audio transducer configurations, and may in particular allow an improved trade-off between degradations resulting from imperfect panning and the definiteness of the perceived sound source position.
  • An improved user experience is typically provided with the localization effect being adapted to the specific audio transducer setup.
  • the distance may be an angular distance.
  • the angular distance may be measured from a (nominal) listening position.
  • the drive signal generator is arranged to increase decorrelation for an indication of increasing distance.
  • the distance may be an angular distance.
  • the angular distance may be measured from a (nominal) listening position.
  • the drive signal generator may be arranged to increase decorrelation for increasing angular distance between the first and second positions. The degree of
  • decorrelation may specifically be a monotonically increasing function of the distance.
  • the drive signal generator is arranged to only decorrelate the first drive signal relative to the second drive signal when the indication of the distance is indicative of a distance above a threshold.
  • the threshold may in many embodiments advantageously correspond to an angular difference (from a nominal listening position) belonging to the interval of [45°;75°]; [50°;70°], or [55°;65°], and may specifically advantageously be substantially 60°.
  • the degree of decorrelation of the first drive signal relative to the second drive signal is dependent on an indication of a distance between the desired rendering position and at least one of the first position and the second position.
  • This may provide an improved adaptation to specific audio transducer configurations, and may in particular allow an improved trade-off between degradations resulting from imperfect panning and a degree of localization.
  • An improved user experience is typically provided with the localization effect being adapted to the specific audio transducer setup and position being rendered.
  • the distance may be an angular distance.
  • the angular distance may be measured from a (nominal) listening position.
  • the degree of decorrelation is dependent on a distance between the desired rendering position and a closest speaker position of at least one of the first position and the second position.
  • the drive signal generator is arranged to increase decorrelation for an indication of increasing distance.
  • the distance may be an angular distance.
  • the angular distance may be measured from a (nominal) listening position.
  • the drive signal generator may be arranged to increase decorrelation for increasing angular distance between the desired rendering position and at least one of the first and second positions.
  • the degree of decorrelation may specifically be a monotonically increasing function of the distance.
  • the degree of decorrelation may be increased for an increasing distance between the desired rendering position and a closest speaker position of at least one of the first position and the second position.
  • the drive signal generator furthermore comprises a frequency response modifier arranged to modify a frequency response for at least the first drive signal in response to the desired rendering position.
  • This may provide an improved rendering in many embodiments and may in particular allow improved direction perception by a listener.
  • the feature may allow improved back to front resolution in many scenarios.
  • the modification of the frequency response is dependent on an ear response for a direction from a listening position to the desired rendering position.
  • This may provide an improved rendering in many embodiments and may in particular allow improved direction perception by a listener.
  • the feature may allow improved back to front resolution in many scenarios.
  • the modification of the frequency response may specifically be dependent on an ear response for a direction from the listening position to the desired rendering position relative to a reference direction, e.g. corresponding to a nominal listener orientation.
  • the drive signal generator furthermore comprises a frequency response modifier arranged to modify a frequency response for at least the first drive signal dependent on the first position.
  • This may provide an improved rendering in many embodiments and may in particular allow improved direction perception by a listener.
  • the feature may allow improved back to front resolution in many scenarios.
  • different frequency equalization/ coloration may be used for different speakers.
  • the drive signal generator may further comprise means arranged to modify a frequency response for the second drive signal dependent on the first position.
  • the degree of decorrelation of the first drive signal relative to the second drive signal is dependent on an angular direction from a listening position to the desired rendering position relative to a reference direction.
  • the reference direction may typically be a listening direction, such as a nominal forward direction of a listener at a nominal listening position.
  • the signal generator is further arranged to generate a third drive signal for a third audio transducer associated with a third position in response to the panning operation for the audio signal in response to the desired rendering position; and drive signal generator is arranged to decorrelate the third drive signal relative to first drive signal and to decorrelate the third drive signal relative to the second drive signal.
  • the signal generator comprises a decorrelator for decorrelating the first drive signal relative to the second drive signal.
  • a method of generating drive signals for audio transducers for rendering an audio signal comprising: receiving the audio signal; receiving position data indicative of a desired rendering position for the audio signal; generating at least a first drive signal for a first audio transducer associated with a first position and a second drive signal for a second audio transducer associated with a second position, the drive signals being generated in response to a panning for the audio signal in response to the desired rendering position; and wherein generating the first drive signal comprises decorrelating the first drive signal relative to the second drive signal, a degree of decorrelation being dependent on an indication of the first position.
  • Fig. 1 illustrates an example of elements of an MPEG Surround system in accordance with the prior art
  • Fig. 2 exemplifies the manipulation of audio objects possible in MPEG SAOC
  • Fig. 3 illustrates an interactive interface that enables the user to control the individual objects contained in an SAOC bitstream
  • Fig. 4 illustrates an example of the principle of audio encoding of 3DAA in accordance with the prior art
  • Fig. 5 illustrates an example of the principle of audio encoding envisaged for MPEG 3D Audio in accordance with the prior art
  • Fig. 6 illustrates an example of an audio rendering system in accordance with some embodiments of the invention
  • Fig. 7 illustrates an example of a loudspeaker rendering configuration
  • Fig. 8 illustrates an example of an audio rendering unit in accordance with some embodiments of the invention
  • Fig. 9 illustrates an example of an audio rendering unit in accordance with some embodiments of the invention.
  • Fig. 10 illustrates an example of an audio rendering unit in accordance with some embodiments of the invention
  • Fig. 11 illustrates an example of a three speaker rendering configuration
  • Fig. 12 illustrates an example of an audio rendering unit in accordance with some embodiments of the invention
  • Fig. 13 illustrates an example of a panning operation processing for rendering of an audio signal in accordance with some embodiments of the invention
  • Fig. 14 illustrates an example of ear frequency responses for audio signals from different directions.
  • Fig. 15 illustrates an example of a frequency response modification for rendering of an audio signal in accordance with some embodiments of the invention.
  • Fig. 6 illustrates an example of an audio renderer in accordance with some embodiments of the invention.
  • the audio renderer comprises an audio receiver 601 which is arranged to receive audio data for audio that is to be rendered.
  • the audio data may be received from any internal or external source.
  • the audio data may be received from any suitable communication medium including direct communication or broadcast links.
  • communication may be via the Internet, data networks, radio broadcasts etc.
  • the audio data may be received from a physical storage medium such as a CD, Blu-RayTM disc, memory card etc.
  • the audio data may generated locally, e.g. by a 3D audio model (as e.g. used by a gaming application).
  • the audio data comprises a plurality of audio components which may include audio channel components associated with a specific rendering loudspeaker configuration (such as a spatial audio channel of a 5.1 surround signals) or audio objects that are not associated with any specific rendering loudspeaker configuration.
  • a specific rendering loudspeaker configuration such as a spatial audio channel of a 5.1 surround signals
  • audio objects that are not associated with any specific rendering loudspeaker configuration.
  • the audio signal is specifically one component of the received audio data, and indeed the following description will focus on a rendering of an audio object of the received audio data.
  • the described approach may be used with other audio components, including for example audio channels and audio signals extracted e.g. from audio channels (e.g. corresponding to individual sound sources embedded in the audio channels).
  • audio channels e.g. corresponding to individual sound sources embedded in the audio channels.
  • rendering of other audio signals may be performed in parallel and these audio signals may be rendered simultaneously from the same loudspeakers, and indeed the rendering of these other audio signals may follow the same approach.
  • all received audio components of the received audio data will be rendered in parallel thereby generating an audio scene represented by the audio data. It will also be appreciated that the described approach may only be applied to some of the audio components, or may indeed be applied to all received audio components.
  • the audio renderer of Fig. 6 further comprises a position receiver 603 which is arranged to receive position data which is indicative of a desired rendering position for the audio signal.
  • a single data stream may be received, e.g. via the Internet, with the single data stream comprising a number of audio signals defining audio objects and position data defining a recommended rendering position for each of the audio objects.
  • the following description will focus on an example wherein an audio signal corresponding to an audio object is rendered such that it is perceived to originate at a desired position indicated by position data received together with the audio signal.
  • the audio signal may be rendered at other positions.
  • the position indicated by the received position data may be modified locally, e.g. in response to a manual user input.
  • the desired position at which the audio renderer tries to render the audio signal may be determined by a local modification or manipulation of the received indicated position.
  • the position data may not be received with the audio data but may be received from another source, including both external and internal sources.
  • the audio renderer may include a position processor which automatically or in response to user inputs generates desired positions for various audio objects.
  • Such an embodiment may be particularly suitable for scenarios wherein the audio data is also locally generated. For example, for a gaming or virtual world application, a three dimensional model may be generated and used to generate both audio signals and associated positions.
  • the audio receiver 601 and position receiver 603 are coupled to a rendering unit 605 which is arranged to generate signals for individual audio transducers.
  • the rendering unit 605 generates one signal for each of the audio transducers, and thus the output set of signals comprises one individual signal for each audio transducer of a set of audio transducers.
  • the system of Fig. 6 renders the audio using a plurality of audio transducers in the form of a set of loudspeakers 607, 609 that are (assumed to be) arranged in a given speaker configuration.
  • Fig. 7 illustrates an example of speaker configuration comprising five speakers, namely a center speaker C, a left front speaker L, a right front speaker R, a left surround (or rear) speaker LS, and a right surround (or rear) speaker RS.
  • the speakers are in this example positioned at positions in a circle around a listening position.
  • the speaker configuration is in the example referenced to a listening position and furthermore to a listening orientation.
  • a nominal listening position and orientation is assumed for the rendering.
  • the rendering seeks to position the audio signal such that it, for a listener positioned at the nominal listening position and with the nominal listening orientation, will be perceived to originate from a sound source in the desired direction.
  • the positions may specifically be positions that are defined with respect to the (nominal) listening position and to the (nominal) listening orientation. In many embodiments, positions may only be considered in a horizontal plane, and distance may often be ignored. In such examples, the position may be considered as a one- dimensional position which is given by an angular direction relative to the reference direction which is given as a specific direction from the listening position.
  • the reference direction may typically correspond to the direction assumed to be directly in front of the nominal listener, i.e. to the forward direction. Specifically, in Fig. 7, the reference direction is that from the listening position to the front center speaker C.
  • the angle between the reference direction and the direction from the listening position to a given speaker will simply be referred to as the angular position of the speaker.
  • the angular position of the front speakers are at ⁇ 30°, and the angular position of the rear speakers are at ⁇ 110°.
  • the angular distance between the front right speaker R and the surround right speaker RS is 80°.
  • Fig. 7 corresponds to a five channel surround sound configuration.
  • other loudspeaker rendering configurations may be used, including for example a larger number of speakers, elevated speakers, asymmetric speaker locations etc.
  • the rendering unit 605 is arranged to render the audio signal to be perceived to originate from the desired position.
  • the desired position is given as an angle with respect to the reference (forward) direction from the listening position to the center speaker C.
  • the desired position is given as an angle with respect to the reference (forward) direction from the listening position to the center speaker C.
  • the rendering unit 605 is arranged to position the audio signal at a sound source position using a panning operation.
  • the positioning of the audio signal is specifically by panning using the two nearest speakers.
  • the rendering unit 605 will perform a panning between the center speaker C and the right speaker R, for an angle in the interval of [0;-30°]
  • the rendering unit 605 will perform a panning between the center speaker C and the left speaker L, for an angle in the interval of [30°; 110°]
  • the rendering unit 605 will perform a panning between the right speaker R and the rear surround speaker RS, for an angle in the interval of [-30°; -110°]
  • the rendering unit 605 will perform a panning between the left speaker L and the left surround speaker LS, and for an angle in the interval of [-110°,- 180°] or [110°,180°]
  • the rendering unit 605 will perform a panning between the left surround speaker LS and the rear surround speaker RS.
  • the rendering unit 605 is arranged to select the two speakers nearest to the desired position and to position the audio signal between the two speakers using panning.
  • the two selected speakers are illustrated as speakers 607, 609 which e.g. may represent any speaker pair of Fig. 7 as described above.
  • the rendering unit 605 comprises a panning processor 611 which is arranged to perform a panning operation in order to generate output signals that when rendered will result in the audio signal being perceived by a listener at the nominal listening position to predominantly originate from the desired position.
  • the panning operation specifically determines relative signal levels for the sound rendered from the first speaker 607 and the second speaker 609 of the selected speaker pair 607, 609.
  • the panning includes determining a relative level difference between the first drive signal and the second drive signal corresponding to the desired rendering position.
  • the amplitude gains are determined for the first and second speaker's driver signals by means of so called panning.
  • Panning is a process where depending on the position of a virtual sound source between two or more speakers, the signal corresponding to the virtual sound source position is played over these speakers, with amplitude gains that are determined based on relative distance of the virtual sound source with respect to the speakers.
  • the amplitude gain for the driver signal corresponding to that speaker will be relatively high, e.g. 0.9, whereas the gain for the second speaker will be relatively low, e.g. 0.1, thereby creating the impression of a virtual sound source between the first and second speaker, close to the first speaker.
  • the panning processor 611 is coupled to the audio receiver 601 and the position receiver 603 and receives the audio signal and the desired position. It then proceeds to generate a signal for each of the first and second speaker 607, 609, i.e. the panning processor 611 generates two signals from the audio signal, namely one for the first speaker 607 and one for the second speaker 609. The two generated signals have an amplitude value which when rendered from the positions of the first and second speaker 607, 609 corresponds to an audio source perceived to be at the desired position.
  • the first signal is used to drive the first speaker 607 and the second signal is used to drive the second speaker 609. It will be appreciated that the first and/or second signals may in some embodiments directly be used to drive the loudspeakers 607, 609 but in many embodiments the signal paths may include further processing including for example amplification, filtering, impedance matching etc.
  • the rendering unit 605 of Fig. 6 is furthermore arranged to decorrelate one of the signals relative to the other signal.
  • the decorrelation may be performed prior to the panning, as part of the panning operation, or after the panning operation.
  • the rendering unit 605 may initially generate a decorrelated version of the audio signal and then generate the first and second signals by a gain adjustment of the original audio signal and the decorrelated version of this.
  • the panning operation is performed on the audio signal thereby generating the first and second signal.
  • the decorrelation is then performed by applying a decorrelation to the first signal.
  • the rendering unit 605 comprises a decorrelator 613 which decorrelates the first signal.
  • the decorrelator 613 is a processing component that generates an output signal that preserves the spectral and temporal amplitude envelopes of the input signal but has a cross-correlation of less than one between the input and output.
  • Many practical decorrelators will have a cross-correlation of close to zero between the input and the output.
  • the decorrelator 613 of Fig. 6 is an adjustable decorrelator for which the degree of correlation may be varied.
  • the decorrelator 613 may provide a partial and variable decorrelation thus allowing the input signal and output signal to have a cross-correlation of lower than one but possibly higher than zero.
  • panning is highly suitable for positioning virtual sound sources in positions between physical loudspeaker positions, it can also introduce artifacts in some scenarios. Specifically, if the distance between the speakers is too large, the listener will in addition to the desired phantom sound source position also tend to perceive secondary sound sources at the positions of the used loudspeakers. Indeed, for typical speaker distances, the panning operation will tend to result in clearly noticeable artifacts.
  • the inventors have specifically realized that it is often preferable to have a sound source which is perceived as coming from a less-defined, but still more or less correct direction, than to have a source which is perceived, e.g., as coming from two distinct loudspeaker positions or from a wrong position (e.g. front-back reversal).
  • the degree of decorrelation is dependent on the rendering speaker configuration. Specifically, it will depend on at least the position of one of the speakers, and typically on the positions of two speakers relative to each other.
  • the rendering unit 605 may receive information of the position of the speakers, e.g. simply provided as an azimuth relative to a nominal direction (typically the forward direction). In some embodiments, such information may be received by the rendering unit 605 from a measurement unit that performs measurements to determine relative positions. Alternatively or additionally, the position information may be received as a user input, e.g. by the user simply inputting the approximate angle from the listening position to each of the speakers relative to a forward direction. In yet other embodiments, the position information may be stored in the audio renderer and simply retrieved from memory by the rendering unit 605. In some embodiments, the position information may be assumed position information.
  • the degree of decorrelation is dependent on a distance between the positions of the first and second loudspeakers 607, 609, i.e. between the speakers used for the panning.
  • the distance may be determined as an actual distance, e.g. along a desired direction. However, in many scenarios, the distance may be determined as an angular distance measured from the (nominal) listening position.
  • the degree of decorrelation may be dependent on the angle between the lines from the listening position to the two loudspeakers 607, 609.
  • the system may specifically be arranged to only introduce additional decorrelation if the distance exceeds a threshold. Indeed, for an angular distance sufficiently low, it has been found that panning provides a very accurate perception with only very low and typically insignificant artifacts. It has been found that in many scenarios and for most people, loudspeakers that are at an angle of less than 60° provide an accurate perception of a sound source at the desired position and with typically insignificant degradations.
  • the rendering unit 605 may be arranged to increase decorrelation for increasing distance, or equivalently the rendering unit 605 may determine the decorrelation amount as a
  • the rendering unit 605 thus generates a desired correlation ( ⁇ ) between the speaker signals.
  • This desired correlation is dependent on the distance between the speakers between which a source is panned, where a lower correlation (and thus higher decorrelation) is chosen for wider spaced speakers.
  • Speaker spacing is never larger than 180° as there for large angles will exist a smaller alternative angle describing the same configuration. With panning on a speaker spacing of 180° there is only lateralization, therefore a corresponding correlation of zero may be used for this spacing.
  • a suitable function may be selected, such as e.g. a linear interpolation:
  • rendering unit 605 may be implemented in different ways in different embodiments.
  • Fig. 8 illustrates an example of an implementation of the rendering unit 605.
  • a decorrelator which performs a full decorrelation (i.e. with a cross correlation between input and output of substantially zero) is used.
  • the audio signal to be panned between two speakers 607, 609 is first split and decorrelated in order to achieve a desired correlation level before panning gains are applied to the two signals.
  • the degree of decoration is controlled by the first drive signal being generated as a weighted summation of the original audio signal and the fully decorrelated signal.
  • the relation between the decorrelation gains ( ⁇ , ⁇ , a 2 ) may for example be chosen to preserve signal energy.
  • ⁇ ⁇ cos(arccos(/l))
  • a 2 sin(arccos(/l)) ' where ⁇ indicates the desired correlation ( ⁇ e [ ⁇ , ... ,l]) between the two speaker signals.
  • the panning is then performed by scaling the resulting signals using appropriate panning gains ( ⁇ , ⁇ 2 ).
  • An advantage of the example of Fig. 8 is that no adjustments need to be made to the panning gains ( ⁇ , ⁇ 2 ) obtained from a panning algorithm which does not consider any decorrelation (e.g. from a conventional algorithm for determining panning gains).
  • Fig. 9 illustrates another example wherein the panning gains are applied before decorrelation.
  • the decorrelation gains ( ⁇ , ⁇ , a 2 ) should be chosen differently to preserve the correct energy in the output signals.
  • the decorrelation and panning may be performed jointly in a matrix operation on the audio signal and a decorrelated version thereof:
  • the degree of decorrelation may additionally or alternatively be dependent on the desired rendering position.
  • the decorrelation may be dependent on a relative distance from the rendering position to the nearest loudspeaker used for the panning.
  • the rendering unit 605 may increase the decorrelation for an increasing distance, or equivalently the amount to decorrelation may be a monotonically increasing function of the distance from the desired rendering position to the nearest speaker position.
  • a sound source panned closely to one of the speakers will have higher correlation than for a sound source panned halfway between the speakers.
  • Such an approach may provide an improved user experience in many scenarios.
  • the approach may reflect that panning works better for positions close to the speaker positions used in the panning than when further apart.
  • the degree of diffuseness in the rendering of the audio signal is adapted to reflect the degree of artifacts, thereby automatically achieving that the artifacts are obscured in dependence on the significance of the artifacts.
  • the amount of decorrelation may depend on the direction towards the desired rendering position with respect to a reference direction.
  • a nominal listening position and nominal front direction may be defined.
  • the nominal front direction may be from the listening position to the center speaker C.
  • the amount of decorrelation may be varied dependent on the angle between this frontal direction and the direction towards the desired rendering position.
  • the frontal direction may be assumed to correspond to the way a user is facing when listening to the rendering sound.
  • the rendering unit 605 may in such an embodiment introduce a higher decorrelation for a desired rendering position at an angle of 90° than for an angle of 0°.
  • more decorrelation may be introduced for desired rendering positions that are typically to the side of the user than for desired rendering positions that are typically in front or to the back of the user.
  • the system may automatically adjust for variations in the degree of degradation that is perceived as a function of the rendering position with respect to the user.
  • the approach be used to adapt the rendering to reflect that interpolation by the human brain tends to be more reliable for sound sources in the front or to the back of the listener and less reliable for rendering positions to the sides of the listener.
  • the exact amount of decorrelation may depend on a plurality of factors.
  • an algorithm for determining the desired decorrelation may take into account both the distance between the speakers, the distance from the desired rendering position to the nearest speaker position, as well as to whether the desired rendering position is in front or to the side of the listener.
  • the panning operation may include more than two speakers thereby allowing positioning in more dimensions.
  • panning not only takes place in the horizontal plane, but can also be performed in the vertical direction.
  • Such an approach may specifically use three speakers instead of two as shown in Fig.11.
  • three correlations are relevant and in order to control these, two (possibly) partial decorrelations can be applied as shown in Fig. 12.
  • the combined panning and decorrelation gains may be determined as:
  • the exemplary approach of Fig. 12 uses the input signal (S) for one driver and decorrelates the two other driver signals from the input signal accordingly.
  • Another approach may reorder the rows of matrix R 3 such that the speaker with most energy is fed the scaled input signal (i.e. the first row is permutated to the speaker signal (1, 2 or 3) that has the highest panning gain ⁇ ⁇ ).
  • An advantage of this approach is that the loudest output signal does not contain any decorrelator signal.
  • a decorrelator signal may introduce artifacts affecting audio quality.
  • the remaining rows may be permutated such that with decreasing energy more decorrelation signal energy is added.
  • the amount of decorrelation applied to a drive signals may depend on an energy of the drive signals, and in particular the decorrelation applied to a drive signal may depend on an energy of the drive signal relative to an energy of another drive signal. In some embodiments, no decorrelation is applied to the drive signal having a highest energy
  • R 3 defines three vectors describing the individual speaker signals:
  • the final degree of freedom i.e., rotation around the s axis may be used to ensure maximum signal continuity, e.g. by aligning the signals such that the vector with the maximum length of the three vectors z l s z 2 and Z3 is always associated to a single decorrelator, e.g. di.
  • the contribution of one of the decorrelators may be minimized by rotating the vectors z l s z 2 and Z3 around the s axis. This can be used beneficially to reduce the complexity of the least present decorrelator.
  • the rendering unit 605 may be arranged to modify the frequency response for at least one of the first and second signals dependent on the desired rendering position.
  • the transfer function representing the signal path from the audio signal (or decorrelated audio signal) to the drive signal for the loudspeaker may be dependent on the desired rendering position.
  • the frequency response may be modified to reflect an ear response.
  • the (front-back) asymmetry of a person's head and specifically ears introduce a frequency selective variation that depends on the direction from which the sound is received.
  • ear and head shadowing may introduce a frequency response dependent on the direction from which the sound is received.
  • the rendering unit 605 may emulate elements of such a frequency variation to improve the position perception for the listener.
  • the frequency response may alternatively or additionally be modified dependent on the position of the loudspeakers.
  • the frequency response may be dependent on the angle between the speakers and a reference direction, which may specifically be a direction corresponding to a (nominal) forward direction.
  • the frequency response may accordingly be different for speakers at different positions.
  • the frequency response may be dependent on both the desired rendering position and the speaker positions.
  • equalization may be applied to account for coloration differences due to speaker positions vs. intended source position.
  • sources in the back rendered with the surround speakers of a 5.1 configuration may benefit from a lowered level of high frequencies to account for increased head- and ear-shadowing for rear sources compared to shadowing from the position of the speakers.
  • colorization may also be applied to improve the perception of a virtual sound source. For example, for the
  • a virtual phantom center back speaker can be realized by playing a coherent (or decorrelated) sound through both the left and right surround speakers.
  • a virtual back speaker is perceived in front of the listener (known as so called front-back confusion).
  • front-back confusion One of the effects that cause this front-back confusion is a mismatch in the spectral cues between an actual center back speaker and the phantom sound source.
  • the frequency modification applied by the head and ear shadowing for sounds arriving from the back of the listener is not present for either of the sounds arriving from the surround speakers since these are substantially to the side of the user.
  • this effect can be emulated thereby reducing the risk of front-back confusion.
  • a position dependent filtering may be applied to the signals for the speakers.
  • the speaker signal is filtered with a filter h ⁇ p spkx , p kY , p obj ) (e.g. an FIR filter) to obtain a processed speaker signal.
  • the filter h ⁇ p spkx , p kY , p obj ) is a function of the actual speaker positions p spkx , p spkY and the object/virtual channel position, i.e. the desired rendering position.
  • the filter may be tabulated or may be parameterized.
  • Fig. 14 shows the ear responses for a physical center rear source, a phantom rear source and a physical center front source.
  • the coloration of the phantom source is clearly different from both physical sources, and clearly contains more high frequency content than the physical rear source. This may give rise to front-back confusions.
  • Fig. 15 illustrates the difference between the coloration of the phantom source and the physical rear source. This may be used for coloration compensation to compensate for the differences between the physical speaker and the phantom source. The compensation will vary with the position of the phantom source and the position of the physical sources used to create the phantom source.
  • the invention can be implemented in any suitable form including hardware, software, firmware or any combination of these.
  • the invention may optionally be
  • an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit or may be physically and functionally distributed between different units, circuits and processors.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Algebra (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Stereophonic System (AREA)

Abstract

An apparatus generates signals for audio transducers from an audio signal so that the audio signal can be spatially rendered. The apparatus comprises a receiver (603) which receives position data indicating a desired rendering position for the audio signal. A drive signal generator (605) generates a first drive signal for a first audio transducer associated with a first position and a second drive signal for a second audio transducer associated with a second position. The first and second signals are generated using a panning operation for the audio signal based on to the desired rendering position. The drive signal generator (605) furthermore decorrelates the first drive signal relative to the second drive signal. The degree of decorrelation depends on the first position, and may specifically depend on the distance (including angular distance) of the speakers and/or the desired rendering position. The invention may reduce the perceptibility of artifacts introduced due to the panning.

Description

Generating drive signals for audio transducers
FIELD OF THE INVENTION
The invention relates to generation of drive signals for audio transducers and in particular, but not exclusively, to generation of drive signals from audio signals
representing audio objects not associated with any specific audio transducer rendering configuration.
BACKGROUND OF THE INVENTION
Digital encoding of various source signals has become increasingly important over the last decades as digital signal representation and communication increasingly has replaced analogue representation and communication. For example, audio content, such as speech and music, is increasingly based on digital content encoding. Furthermore, audio consumption has increasingly become an enveloping three dimensional experience with e.g. surround sound and home cinema setups becoming prevalent.
Audio encoding formats have been developed to provide increasingly capable, varied and flexible audio services and in particular audio encoding formats supporting spatial audio services have been developed.
Well known audio coding technologies like DTS and Dolby Digital produce a coded multi-channel audio signal that represents the spatial image as a number of channels that are placed around the listener at fixed positions. For a speaker setup that is different from the setup that corresponds to the multi-channel signal, the spatial image will be suboptimal. Also, these channel based audio coding systems are typically not able to cope with a number of speakers that is different from the number of speakers represented by the multi-channel signal.
MPEG Surround provides a multi-channel audio coding tool that allows existing mono- or stereo-based coders to be extended to multi-channel audio applications. Fig. 1 illustrates an example of elements of an MPEG Surround system. Using spatial parameters obtained by analysis of the original multichannel input, an MPEG Surround decoder can recreate the spatial image by a controlled upmix of the mono- or stereo signal to obtain a multichannel output signal. Since the spatial image of the multi-channel input signal is parameterized, MPEG Surround allows for decoding of the same multi-channel bit-stream by rendering devices that do not use a multichannel speaker setup. An example is virtual surround reproduction on headphones, which is referred to as the MPEG Surround binaural decoding process. In this mode a realistic surround experience can be provided while using regular headphones. Another example is the pruning of higher order multichannel outputs, e.g. 7.1 channels, to lower order setups, e.g. 5.1 channels.
Indeed, the variation and flexibility in the rendering configurations used for rendering spatial sound has increased significantly in recent years with more and more reproduction formats becoming available to the mainstream consumer. This requires flexible representation of audio. Important steps have been taken with the introduction of the MPEG Surround codec. Nevertheless, audio is still produced and transmitted for a specific loudspeaker setup. Reproduction over different setups and over non-standard (i.e. flexible or user-defined) speaker setups is not specified. Indeed, there is a desire to make audio encoding and representation increasingly independent of specific predetermined and nominal speaker setups. It is increasingly preferred that flexible adaptation to a wide variety of different speaker setups can be performed at the decoder/rendering side.
In order to provide for a more flexible representation of audio, MPEG standardized a format known as 'Spatial Audio Object Coding' (MPEG-D SAOC). In contrast to multichannel audio coding systems such as DTS, Dolby Digital and MPEG Surround, SAOC provides efficient coding of individual audio objects rather than audio channels. Whereas in MPEG Surround, each speaker channel can be considered to originate from a different mix of sound objects, SAOC makes individual sound objects available at the decoder side for interactive manipulation as illustrated in Fig. 2. In SAOC, multiple sound objects are coded into a mono or stereo downmix together with parametric data allowing the sound objects to be extracted prior to the rendering thereby allowing the individual audio objects to be available for manipulation e.g. by the end-user.
Indeed, similarly to MPEG Surround, SAOC also creates a mono or stereo downmix. In addition, object parameters are calculated and included. At the decoder side, the user may manipulate these parameters to control various features of the individual objects, such as position, level, equalization, or even to apply effects such as reverb. Fig. 3 illustrates an interactive interface that enables the user to control the individual objects contained in an SAOC bitstream. By means of a rendering matrix individual sound objects are mapped onto speaker channels. SAOC allows a more flexible approach and in particular allows more rendering based adaptability by transmitting audio objects instead of only reproduction channels. This allows the decoder-side to place the audio objects at arbitrary positions in space, provided that the space is adequately covered by speakers. This way there is no relation between the transmitted audio and the reproduction or rendering setup, hence arbitrary speaker setups can be used. This is advantageous for e.g. home cinema setups in a typical living room, where the speakers are rarely at the intended positions. In SAOC, it is decided at the decoder side where the objects are placed in the sound scene, which is often not desired from an artistic point-of-view. The SAOC standard does provide ways to transmit a default rendering matrix in the bitstream, eliminating the decoder responsibility. However the provided methods rely on either fixed reproduction setups or on unspecified syntax. Thus SAOC does not provide normative means to fully transmit an audio scene independently of the speaker setup. Also, SAOC is not well equipped to the faithful rendering of diffuse signal components. Although there is the possibility to include a so called multichannel background object to capture the diffuse sound, this object is tied to one specific speaker configuration.
Another specification for an audio format for 3D audio is being developed by the 3D Audio Alliance (3 DA A) which is an industry alliance. 3DAA is dedicated to develop standards for the transmission of 3D audio, that "will facilitate the transition from the current speaker feed paradigm to a flexible object-based approach". In 3DAA, a bitstream format is to be defined that allows the transmission of a legacy multichannel downmix along with individual sound objects. In addition, object positioning data is included. The principle of generating a 3DAA audio stream is illustrated in Fig. 4.
In the 3DAA approach, the sound objects are received separately in the extension stream and these may be extracted from the multi-channel downmix. The resulting multi-channel downmix is rendered together with the individually available objects.
The objects may consist of so called stems. These stems are basically grouped (downmixed) tracks or objects. Hence, an object may consist of multiple sub-objects packed into a stem. In 3DAA, a multichannel reference mix can be transmitted with a selection of audio objects. 3DAA transmits the 3D positional data for each object. The objects can then be extracted using the 3D positional data. Alternatively, the inverse mix-matrix may be transmitted, describing the relation between the objects and the reference mix.
From the description of 3DAA, sound-scene information is likely transmitted by assigning an angle and distance to each object, indicating where the object should be placed relative to e.g. the default forward direction. This is useful for point-sources but fails to describe wide sources (like e.g. a choir or applause) or diffuse sound fields (such as ambiance). When all point-sources are extracted from the reference mix, an ambient multichannel mix remains. Similar to SAOC, the residual in 3DAA is fixed to a specific speaker setup.
Thus, both the SAOC and 3DAA approaches incorporate the transmission of individual audio objects that can be individually manipulated at the decoder side. A difference between the two approaches is that SAOC provides information on the audio objects by providing parameters characterizing the objects relative to the downmix (i.e. such that the audio objects are generated from the downmix at the decoder side) whereas 3DAA provides audio objects as full and separate audio objects (i.e. that can be generated independently from the downmix at the decoder side).
In MPEG, a new work item on 3D Audio, referred to as MPEG-H 3D Audio, is currently being initiated. Fig. 5 provides an illustration of the current high level block diagram of the intended MPEG 3D Audio system.
In addition to the traditional channel based format, object based and scene based formats are also to be supported. An important aspect of the system is that its quality should scale to transparency for increasing bitrate. This puts a burden on the use of parametric coding techniques that have been used quite heavily in the past (viz. MPEG-4 HE- AAC v2, MPEG-D MPEG Surround, MPEG-D SAOC, MPEG-D USAC).
An important feature of the standard is that the encoded bitstream should be independent of the reproduction/rendering setup. Envisioned reproduction possibilities include flexible loudspeaker setups (envisaged up to 22.2 channels), virtual surround over headphones, and closely spaced speakers. Flexible loudspeaker setups refer to any number of speakers at arbitrary physical locations.
The decoder of MPEG 3D Audio is intended to comprise a rendering module that is responsible for translating the decoded individual audio channels/objects into speaker feeds based on the physical location of the speakers, i.e. based on the specific rendering speaker configuration/ setup.
Thus, audio distribution and transmission approaches and standards are increasingly being driven towards an independence of the rendering setup. This requires the receiving/ decoding end to be able to adapt the processing and rendering to match the specific rendering setup used.
The rendering of the audio is accordingly dependent on the physical locations of the speakers of the rendering configuration. These positions may be determined or provided in various ways. For example, they may simply be provided by a direct user input, such as by the user directly providing a user input indicating the floor plan of speakers location, e.g. using a mobile app interface.
Several fully or semi-automatic methods also exist for determining speaker positions. Most methods comprise relative speaker position location algorithms. They use e.g. ultrasound or audible signals to determine the relative positions. The acoustic methods (both those using ultra- and audible sound) are typically based on the concept of acoustic Time-Of- Flight, which means that the distance between any two speakers is determined by measuring the time it takes for sound to travel from one speaker to the other. This requires a microphone (or ultrasound receiver) to be integrated into each loudspeaker.
For a truly transparent rendering, the positioning of the loudspeakers set within the room may also be relevant. Again this information may be provided manually or via automated methods. E.g. ultrasound reflections may be used to automatically detect the distance to room boundaries (walls, ceiling, floor) and general room dimensions. Together this information gives a full description of the rendering configuration.
Another requirement resulting from the speaker configuration independent audio provision is that the individual rendering device must position the different audio sources. Such positioning is traditionally performed at the content creation side, and is often manually performed or directly results from the recording signals. Furthermore, the positioning is conventionally performed based on a set of audio channels that are each associated with a fixed nominal position. Therefore, the rendering device merely needs to render the received audio signals and does not need to perform any positioning.
However, for rendering configuration independent audio provision, the rendering device needs to position the sound sources appropriately in the audio scene generated by the rendering of audio from the specific speaker configuration. The positioning may often be based on position information received from the source, e.g. a desired position may be received for each audio object, but may be locally modified or changed. Based on the position of a given audio signal, the rendering device must generate drive signals for the individual loudspeakers which at a (nominal) listening position is then perceived to originate from the given position.
An approach for positioning sounds sources is to use a panning algorithm where the relative levels of the resulting drive signals for individual speakers are adjusted such that the audio signal is perceived as a sound source at the desired position. In a simple, two speaker amplitude panning approach, two loudspeakers can radiate coherent signals with different amplitudes (except for the situation where the sound source is positioned exactly midway between the speakers). The listener perceives this as a virtual sound source positioned at a position between the speakers given by the relative amplitude levels. Thus, the relation of amplitudes of emanating signals controls the perceived direction of the virtual source. When multiple loudspeakers are applied to a horizontal plane, a virtual source can be positioned to any direction on the plane using two adjacent loudspeakers surrounding the virtual source. This method is called a pair-wise panning paradigm.
The loudspeaker pair need not be in front of the listener. There typically exists, however, some limitations in the effectiveness of the approach for loudspeaker placement to the side of the listener. The loudspeakers should furthermore preferably both either be in front of the listener or behind the listener. If a loudspeaker configuration has loudspeakers both behind and in front of the listener, the use of such a pair of speakers result in a gap in the directions at which the virtual sources can be positioned.
For 3D audio positioning, the loudspeaker setup will include speakers that are not in the same horizontal plane, e.g. it may include elevated loudspeakers. A suitable approach for 3D audio rendering is so-called Vector Base Amplitude Panning (VBAP) described in Pulkki V. Virtual source positioning using vector base amplitude panning, Journal of the Audio Engineering Society 1997; 45(6):456-466. The loudspeaker setup can be divided into triangles (loudspeaker triplets), with the audio signal for a given position being positioned by a panning of one triplet. A loudspeaker triplet may be formulated using vectors. The unit-length vectors Im, In and Ik point from the listening position to the loudspeakers. The direction of the virtual source is presented with unit-length vector p which is expressed as a linear weighted sum of the loudspeaker vectors
Figure imgf000007_0001
Here gm, gn, and gk are called gain factors of respective loudspeakers. The gain factors can be solved as g=pTL"1 mnk, where g=[gm gn gk]T and Lmnk=[Im In Ik].
The loudspeaker setup is divided into triangles forming a triangle set. During the panning process, a single triangle from the set is chosen to be used for the panning. The selection can be made by calculating the gain factors in each loudspeaker triangle in the triangle set and selecting the triangle that produced non-negative factors. If the triangles in the set are non-overlapping, the selection is unambiguous.
In summary, due to increasing number of different speaker setups in cinemas and households, the conventional channel-based format for audio transmission is becoming inefficient. Either multiple mixes must be transmitted (e.g. multicast of stereo and 5.1 multichannel) or the transmitted configuration must be mapped to correspond to the speaker setup (e.g. playing 5.1 over a 7.1 setup). Moreover, even with a matching number of speakers, the actual setup in a living room rarely matches the specification of the nominal setup.
In the upcoming trend of 3D audio, transmission is envisioned to be independent of the rendering speaker setup. Therefore, the received bitstream can be used for rendering to an arbitrary speaker setup. The scene intended by the audio engineer is mapped to the available speakers using their actual positions.
This moves the responsibility of the generation of suitable drive signals providing a perception of the sound originating at a given position to the
receiver/decoder/renderer. In practice this may result in a transmission of audio objects along with position information indicating where the object should be rendered in (3D) space. A multitude of algorithms is available to generate speaker signals from this information, for example Vector-Based Amplitude Panning.
However, where audio engineers take into account the limitations of a certain setup, most panning algorithms do not. Indeed, known panning rules require the speakers to be positioned relatively closely together. When speakers are further apart, the introduced binaural cues, consisting of amongst others inter-aural time difference and level difference in the listener's ears, do not correspond with the direction of the phantom source. Rather, the sound is perceived to be generated from two or three other positions. The physical sources still introduce their own localization cues in the form of coloration, so-called spectral cues.
Hence, panning between speakers that are widely spaced does not yield a well- placed source. Also, front-back confusion may arise when panning between the two surround speakers of a 5.1 configuration. In addition, part of the audio is perceived at the location of the speakers.
In practice, many speaker configurations, such as e.g. a 5.1 loudspeaker configuration, utilize speakers that are relatively far apart and which accordingly provide a suboptimal perception of the virtual sound source at the desired position.
Panning between two speakers (e.g. front-left and front-right) introduces a sweet-spot, or in fact a 'sweet-plane', which is the plane where the distance to both speakers is equal. When other pairs are introduced (e.g. front-left and left-surround) this 'sweet-plane' becomes a vertical 'sweet-line'. When elevated speakers are used to pan elevated objects the sweet-spot is also limited in height. This is even more problematic than the 'sweet-line' since people are generally not equally tall and therefore not listening at the same height.
Solutions based on crosstalk cancelation can be used to introduce improved localization cues at the ears of the listener. However, such approaches are complex, sensitive to imperfections, have a narrow sweet-spot due to phase manipulation, and require personalized components in order to work well.
Hence, an improved approach would be advantageous and in particular an approach allowing increased flexibility, improved positioning of audio sources, improved adaptability to different rendering configurations, reduced complexity, an improved user experience, and/or improved performance would be advantageous.
SUMMARY OF THE INVENTION
Accordingly, the Invention seeks to preferably mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination.
According to an aspect of the invention, there is provided an apparatus for generating drive signals for audio transducers, the apparatus comprising: an audio receiver for receiving an audio signal; a position receiver for receiving position data indicative of a desired rendering position for the audio signal; a drive signal generator for generating at least a first drive signal for a first audio transducer associated with a first position and a second drive signal for a second audio transducer associated with a second position, the drive signal generator being arranged to generate the drive signals in response to a panning for the audio signal in response to the desired rendering position; and wherein the drive signal generator is arranged to decorrelate the first drive signal relative to the second drive signal, a degree of decorrelation being dependent on an indication of the first position.
The invention may provide an improved audio experience, and in particular an improved spatial audio experience. The approach may support rendering over a wide range of loudspeaker configurations with increased adaptability of the user experience to the given configuration.
In particular, an improved perception of a sound source at a desired position may be provided, and often with a reduced sensitivity to specific loudspeaker configurations. For example, improved performance may be achieved for loudspeaker configurations having a relatively large distance between loudspeakers. The approach may in many scenarios result in mitigation of imperfections of a panning operation. Specifically, in many embodiments, the perception of a given sound source as also originating from the positions of the speakers involved in the panning may be reduced substantially.
The approach may specifically reduce the correlation between the speaker signals used to generate a panned phantom source thereby reducing the perceptibility of imperfections of the panning operation. For example, using panning for localization between widely spaced speakers tends to result in artifacts, including the perception of additional sound sources at the speaker positions. By reducing the correlation between the speaker signals, the sound source is rendered more diffusely but still with a directional component originating from the desired position. It is often preferable to have a perceived sound source which is perceived as coming from a less-defined, but still more or less correct direction, than to have a sound source which is perceived as, for example, coming from two distinct loudspeaker positions or from a completely wrong position (e.g. front-back reversal).
The panning operation may comprise and/or consist in setting relative levels and/or gains for the first and second drive signal in response to the desired rendering position. In particular, the levels/ gains may be set such that the audio signal will be perceived to originate from the desired rendering position at a (nominal) listening position.
The desired rendering position may be a three dimensional, two dimensional or one dimensional position. Similarly, the panning operation may be a three dimensional, two dimensional or one dimensional position. For example, a three dimensional system may consider both a horizontal angular direction (azimuth), a vertical angular direction (elevation), and a distance from a (nominal) listening position. A two dimensional system may e.g.
consider a horizontal angular direction (azimuth) as well as an elevation or distance. A one dimensional system may e.g. consider only a horizontal angular direction (azimuth).
Specifically, the desired rendering position may be an angular direction (azimuth) from a (nominal) listening position.
The apparatus may be arranged to receive audio transducer position data indicative of the positions of the first and second audio transducers, i.e. to receive an indication of at least the first position. The data may e.g. be received from an internal source (such as a memory), a user input, or a remote source. Likewise, the audio signal may be received from an internal or external source. The desired rendering position may also be received from any internal or external source, and may for example be received from a remote source together with the audio signal, or may be locally provided or generated. The first position may be a three dimensional, two dimensional, or one dimensional position. Similarly, the second position may be a three dimensional, two dimensional, or one dimensional position. The first position may be represented by any indication of a position including a three dimensional, two dimensional, or three dimensional position indication. In particular, the first (and/or second) position may be represented by an angular direction (azimuth) from a (nominal) listening position. The position receiver may receive an indication of the first position (from an external or internal source), and the drive signal generator may determine the degree of decorrelation dependent on the indication of the first position.
The indication of the first position may be an indication of an absolute position or may e.g. be an indication of a relative position, such as an indication of the first position relative to the second position and/or to a listening position. The indication of the first position may be a partial indication of the first position (e.g. may only provide an indication in one dimension, such as an indication of an angle from a listening position to the first position, e.g. relative to a reference direction).
The audio signal may for example be an audio object, audio scene, audio channel or audio component. The audio signal may be part of a set of audio signals, such as e.g. an audio component in an encoded data stream comprising a plurality of (possibly different types of) audio items.
In accordance with an optional feature of the invention, the degree of decorrelation is dependent on an indication of the first position relative to the second position.
This may provide improved rendering in many embodiments, and may in particular allow efficient and accurate adaptation of the characteristics of the rendering to the specific audio transducer configuration. In many embodiments, the relative positions of audio transducers involved in a panning operation may have a strong influence on the performance, accuracy and possible artifacts of the operation, and thus an adaptation of the decorrelation based on a measure of a relative positioning of the audio transducers may provide a particularly suitable adaptation of the rendering.
The dependency of the degree of decorrelation on the (indication of) the first position may specifically be a dependency on the (indication of) the first position relative to the second position. The indication of the first position relative to the second position may for example be an indication of the difference between the positions, e.g. measured as a distance along a line between the first and second position, or as an angular distance measured relative to a (typically nominal) listening position. In accordance with an optional feature of the invention, the degree of decorrelation is dependent on an indication of an angle between a direction from a listening position to the first position and a direction from the listening position to the second position.
This may provide improved rendering in many embodiments, and may in particular allow efficient and accurate adaptation of the characteristics of the rendering to the specific audio transducer configuration. In many embodiments, the angular
difference/distance to audio transducers involved in a panning operation from a listening position may have a strong effect on the performance, accuracy and possible artifacts of the operation, and thus an adaptation of the decorrelation based on a measure of the angular difference/ distance may provide a particularly suitable adaptation of the rendering.
The dependency of the degree of decorrelation on the (indication of) the first position may specifically be a dependency on the (indication of) the angle between a direction from a listening position to the first position and a direction from the listening position to the second position.
In accordance with an optional feature of the invention, the degree of decorrelation of the first drive signal relative to the second drive signal is dependent on an indication of a distance between the first position and the second position.
This may provide an improved adaptation to specific audio transducer configurations, and may in particular allow an improved trade-off between degradations resulting from imperfect panning and the definiteness of the perceived sound source position. An improved user experience is typically provided with the localization effect being adapted to the specific audio transducer setup.
The distance may be an angular distance. The angular distance may be measured from a (nominal) listening position.
In accordance with an optional feature of the invention, the drive signal generator is arranged to increase decorrelation for an indication of increasing distance.
This may provide improved rendering in many embodiments. The distance may be an angular distance. The angular distance may be measured from a (nominal) listening position. The drive signal generator may be arranged to increase decorrelation for increasing angular distance between the first and second positions. The degree of
decorrelation may specifically be a monotonically increasing function of the distance.
In accordance with an optional feature of the invention, the drive signal generator is arranged to only decorrelate the first drive signal relative to the second drive signal when the indication of the distance is indicative of a distance above a threshold. This may allow improved performance, and may specifically allow the increased diffuseness of the rendering to be limited to scenarios wherein the panning operation is considered to provide insufficient rendering quality. The threshold may in many embodiments advantageously correspond to an angular difference (from a nominal listening position) belonging to the interval of [45°;75°]; [50°;70°], or [55°;65°], and may specifically advantageously be substantially 60°.
In accordance with an optional feature of the invention, the degree of decorrelation of the first drive signal relative to the second drive signal is dependent on an indication of a distance between the desired rendering position and at least one of the first position and the second position.
This may provide an improved adaptation to specific audio transducer configurations, and may in particular allow an improved trade-off between degradations resulting from imperfect panning and a degree of localization. An improved user experience is typically provided with the localization effect being adapted to the specific audio transducer setup and position being rendered.
The distance may be an angular distance. The angular distance may be measured from a (nominal) listening position.
In some embodiments, the degree of decorrelation is dependent on a distance between the desired rendering position and a closest speaker position of at least one of the first position and the second position.
In accordance with an optional feature of the invention, the drive signal generator is arranged to increase decorrelation for an indication of increasing distance.
This may provide improved rendering in many embodiments. The distance may be an angular distance. The angular distance may be measured from a (nominal) listening position. The drive signal generator may be arranged to increase decorrelation for increasing angular distance between the desired rendering position and at least one of the first and second positions. The degree of decorrelation may specifically be a monotonically increasing function of the distance.
In some embodiments, the degree of decorrelation may be increased for an increasing distance between the desired rendering position and a closest speaker position of at least one of the first position and the second position.
In accordance with an optional feature of the invention, the drive signal generator furthermore comprises a frequency response modifier arranged to modify a frequency response for at least the first drive signal in response to the desired rendering position.
This may provide an improved rendering in many embodiments and may in particular allow improved direction perception by a listener. In particular, the feature may allow improved back to front resolution in many scenarios.
In accordance with an optional feature of the invention, the modification of the frequency response is dependent on an ear response for a direction from a listening position to the desired rendering position.
This may provide an improved rendering in many embodiments and may in particular allow improved direction perception by a listener. In particular, the feature may allow improved back to front resolution in many scenarios.
The modification of the frequency response may specifically be dependent on an ear response for a direction from the listening position to the desired rendering position relative to a reference direction, e.g. corresponding to a nominal listener orientation.
In accordance with an optional feature of the invention, the drive signal generator furthermore comprises a frequency response modifier arranged to modify a frequency response for at least the first drive signal dependent on the first position.
This may provide an improved rendering in many embodiments and may in particular allow improved direction perception by a listener. In particular, the feature may allow improved back to front resolution in many scenarios. Specifically, different frequency equalization/ coloration may be used for different speakers.
In some embodiments, the drive signal generator may further comprise means arranged to modify a frequency response for the second drive signal dependent on the first position.
In accordance with an optional feature of the invention, the degree of decorrelation of the first drive signal relative to the second drive signal is dependent on an angular direction from a listening position to the desired rendering position relative to a reference direction.
This may provide improved performance in many scenarios, and may specifically improve the spatial perception.
The reference direction may typically be a listening direction, such as a nominal forward direction of a listener at a nominal listening position.
In accordance with an optional feature of the invention, the signal generator is further arranged to generate a third drive signal for a third audio transducer associated with a third position in response to the panning operation for the audio signal in response to the desired rendering position; and drive signal generator is arranged to decorrelate the third drive signal relative to first drive signal and to decorrelate the third drive signal relative to the second drive signal.
This may in many embodiments allow improved rendering, and may allow positioning in an additional dimension.
In accordance with an optional feature of the invention, the signal generator comprises a decorrelator for decorrelating the first drive signal relative to the second drive signal.
This may allow a high performance and/or low complexity implementation.
According to an aspect of the invention there is provided a method of generating drive signals for audio transducers for rendering an audio signal, the method comprising: receiving the audio signal; receiving position data indicative of a desired rendering position for the audio signal; generating at least a first drive signal for a first audio transducer associated with a first position and a second drive signal for a second audio transducer associated with a second position, the drive signals being generated in response to a panning for the audio signal in response to the desired rendering position; and wherein generating the first drive signal comprises decorrelating the first drive signal relative to the second drive signal, a degree of decorrelation being dependent on an indication of the first position.
These and other aspects, features and advantages of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the invention will be described, by way of example only, with reference to the drawings, in which
Fig. 1 illustrates an example of elements of an MPEG Surround system in accordance with the prior art;
Fig. 2 exemplifies the manipulation of audio objects possible in MPEG SAOC;
Fig. 3 illustrates an interactive interface that enables the user to control the individual objects contained in an SAOC bitstream;
Fig. 4 illustrates an example of the principle of audio encoding of 3DAA in accordance with the prior art; Fig. 5 illustrates an example of the principle of audio encoding envisaged for MPEG 3D Audio in accordance with the prior art;
Fig. 6 illustrates an example of an audio rendering system in accordance with some embodiments of the invention;
Fig. 7 illustrates an example of a loudspeaker rendering configuration;
Fig. 8 illustrates an example of an audio rendering unit in accordance with some embodiments of the invention;
Fig. 9 illustrates an example of an audio rendering unit in accordance with some embodiments of the invention;
Fig. 10 illustrates an example of an audio rendering unit in accordance with some embodiments of the invention;
Fig. 11 illustrates an example of a three speaker rendering configuration;
Fig. 12 illustrates an example of an audio rendering unit in accordance with some embodiments of the invention;
Fig. 13 illustrates an example of a panning operation processing for rendering of an audio signal in accordance with some embodiments of the invention;
Fig. 14 illustrates an example of ear frequency responses for audio signals from different directions; and
Fig. 15 illustrates an example of a frequency response modification for rendering of an audio signal in accordance with some embodiments of the invention.
DETAILED DESCRIPTION OF SOME EMBODIMENTS OF THE INVENTION
The following description focuses on embodiments of the invention applicable to rendering of spatial audio data which includes audio objects that are not associated with any specific speaker configuration. However, it will be appreciated that the invention is not limited to this application but may be applied to many other audio signals and audio renderings.
Fig. 6 illustrates an example of an audio renderer in accordance with some embodiments of the invention.
The audio renderer comprises an audio receiver 601 which is arranged to receive audio data for audio that is to be rendered. The audio data may be received from any internal or external source. For example, the audio data may be received from any suitable communication medium including direct communication or broadcast links. For example, communication may be via the Internet, data networks, radio broadcasts etc. As another example, the audio data may be received from a physical storage medium such as a CD, Blu-Ray™ disc, memory card etc. As yet another example, the audio data may generated locally, e.g. by a 3D audio model (as e.g. used by a gaming application).
In the example, the audio data comprises a plurality of audio components which may include audio channel components associated with a specific rendering loudspeaker configuration (such as a spatial audio channel of a 5.1 surround signals) or audio objects that are not associated with any specific rendering loudspeaker configuration.
In the following, the rendering of an audio signal will be described. The audio signal is specifically one component of the received audio data, and indeed the following description will focus on a rendering of an audio object of the received audio data. However, it will be appreciated that the described approach may be used with other audio components, including for example audio channels and audio signals extracted e.g. from audio channels (e.g. corresponding to individual sound sources embedded in the audio channels). It will also be appreciated that whereas the following description focusses on rendering of a single audio signal, rendering of other audio signals may be performed in parallel and these audio signals may be rendered simultaneously from the same loudspeakers, and indeed the rendering of these other audio signals may follow the same approach. Indeed, in a typical embodiment, all received audio components of the received audio data will be rendered in parallel thereby generating an audio scene represented by the audio data. It will also be appreciated that the described approach may only be applied to some of the audio components, or may indeed be applied to all received audio components.
The audio renderer of Fig. 6 further comprises a position receiver 603 which is arranged to receive position data which is indicative of a desired rendering position for the audio signal.
The following description focusses on a scenario wherein the position data is received together with the audio data from a remote source. For example, a single data stream may be received, e.g. via the Internet, with the single data stream comprising a number of audio signals defining audio objects and position data defining a recommended rendering position for each of the audio objects. Thus, the following description will focus on an example wherein an audio signal corresponding to an audio object is rendered such that it is perceived to originate at a desired position indicated by position data received together with the audio signal.
It will be appreciated that in other embodiments, the audio signal may be rendered at other positions. For example, the position indicated by the received position data may be modified locally, e.g. in response to a manual user input. Thus, the desired position at which the audio renderer tries to render the audio signal may be determined by a local modification or manipulation of the received indicated position.
In other embodiments, the position data may not be received with the audio data but may be received from another source, including both external and internal sources. For example, the audio renderer may include a position processor which automatically or in response to user inputs generates desired positions for various audio objects. Such an embodiment may be particularly suitable for scenarios wherein the audio data is also locally generated. For example, for a gaming or virtual world application, a three dimensional model may be generated and used to generate both audio signals and associated positions.
The audio receiver 601 and position receiver 603 are coupled to a rendering unit 605 which is arranged to generate signals for individual audio transducers. Thus, for a given set of audio transducers the rendering unit 605 generates one signal for each of the audio transducers, and thus the output set of signals comprises one individual signal for each audio transducer of a set of audio transducers.
The system of Fig. 6 renders the audio using a plurality of audio transducers in the form of a set of loudspeakers 607, 609 that are (assumed to be) arranged in a given speaker configuration.
Fig. 7 illustrates an example of speaker configuration comprising five speakers, namely a center speaker C, a left front speaker L, a right front speaker R, a left surround (or rear) speaker LS, and a right surround (or rear) speaker RS. The speakers are in this example positioned at positions in a circle around a listening position. The speaker configuration is in the example referenced to a listening position and furthermore to a listening orientation. Thus, in the example, a nominal listening position and orientation is assumed for the rendering. Thus, the rendering seeks to position the audio signal such that it, for a listener positioned at the nominal listening position and with the nominal listening orientation, will be perceived to originate from a sound source in the desired direction.
In the example, the positions may specifically be positions that are defined with respect to the (nominal) listening position and to the (nominal) listening orientation. In many embodiments, positions may only be considered in a horizontal plane, and distance may often be ignored. In such examples, the position may be considered as a one- dimensional position which is given by an angular direction relative to the reference direction which is given as a specific direction from the listening position. The reference direction may typically correspond to the direction assumed to be directly in front of the nominal listener, i.e. to the forward direction. Specifically, in Fig. 7, the reference direction is that from the listening position to the front center speaker C.
In the following, the angle between the reference direction and the direction from the listening position to a given speaker will simply be referred to as the angular position of the speaker.
In the example of Fig. 7, the angular position of the front speakers are at ±30°, and the angular position of the rear speakers are at ±110°. Thus the angular distance between the front right speaker R and the surround right speaker RS is 80°.
The example of Fig. 7 corresponds to a five channel surround sound configuration. However, in other scenarios, other loudspeaker rendering configurations may be used, including for example a larger number of speakers, elevated speakers, asymmetric speaker locations etc.
The rendering unit 605 is arranged to render the audio signal to be perceived to originate from the desired position. For brevity, the following description will focus on an example wherein the desired position is given as an angle with respect to the reference (forward) direction from the listening position to the center speaker C. Thus, a simple one dimensional positioning given by the angular direction will be considered, and distance and elevation characteristics will be ignored.
The rendering unit 605 is arranged to position the audio signal at a sound source position using a panning operation. In the example, the positioning of the audio signal is specifically by panning using the two nearest speakers. Thus, for an angle in the interval of [0;30°], the rendering unit 605 will perform a panning between the center speaker C and the right speaker R, for an angle in the interval of [0;-30°], the rendering unit 605 will perform a panning between the center speaker C and the left speaker L, for an angle in the interval of [30°; 110°], the rendering unit 605 will perform a panning between the right speaker R and the rear surround speaker RS, for an angle in the interval of [-30°; -110°], the rendering unit 605 will perform a panning between the left speaker L and the left surround speaker LS, and for an angle in the interval of [-110°,- 180°] or [110°,180°], the rendering unit 605 will perform a panning between the left surround speaker LS and the rear surround speaker RS.
More generally, the rendering unit 605 is arranged to select the two speakers nearest to the desired position and to position the audio signal between the two speakers using panning. In Fig. 6, the two selected speakers are illustrated as speakers 607, 609 which e.g. may represent any speaker pair of Fig. 7 as described above. The rendering unit 605 comprises a panning processor 611 which is arranged to perform a panning operation in order to generate output signals that when rendered will result in the audio signal being perceived by a listener at the nominal listening position to predominantly originate from the desired position.
The panning operation specifically determines relative signal levels for the sound rendered from the first speaker 607 and the second speaker 609 of the selected speaker pair 607, 609. Thus, the panning includes determining a relative level difference between the first drive signal and the second drive signal corresponding to the desired rendering position.
In more detail, based on the desired rendering position and the (possibly assumed or nominal) position of the first and second speakers 607. 609, the amplitude gains are determined for the first and second speaker's driver signals by means of so called panning. Panning is a process where depending on the position of a virtual sound source between two or more speakers, the signal corresponding to the virtual sound source position is played over these speakers, with amplitude gains that are determined based on relative distance of the virtual sound source with respect to the speakers. E.g. in case of a virtual sound source between two speakers, if the virtual sound source is relatively close to the first speaker, the amplitude gain for the driver signal corresponding to that speaker will be relatively high, e.g. 0.9, whereas the gain for the second speaker will be relatively low, e.g. 0.1, thereby creating the impression of a virtual sound source between the first and second speaker, close to the first speaker.
In the example of Fig. 6, the panning processor 611 is coupled to the audio receiver 601 and the position receiver 603 and receives the audio signal and the desired position. It then proceeds to generate a signal for each of the first and second speaker 607, 609, i.e. the panning processor 611 generates two signals from the audio signal, namely one for the first speaker 607 and one for the second speaker 609. The two generated signals have an amplitude value which when rendered from the positions of the first and second speaker 607, 609 corresponds to an audio source perceived to be at the desired position.
The first signal is used to drive the first speaker 607 and the second signal is used to drive the second speaker 609. It will be appreciated that the first and/or second signals may in some embodiments directly be used to drive the loudspeakers 607, 609 but in many embodiments the signal paths may include further processing including for example amplification, filtering, impedance matching etc.
However, rather than merely using a panning operation, the rendering unit 605 of Fig. 6 is furthermore arranged to decorrelate one of the signals relative to the other signal. It will be appreciated that any suitable way of decorrelating one signal relative to the other may be used and at any stage of the processing. For example, the decorrelation may be performed prior to the panning, as part of the panning operation, or after the panning operation. Indeed, in some embodiments, the rendering unit 605 may initially generate a decorrelated version of the audio signal and then generate the first and second signals by a gain adjustment of the original audio signal and the decorrelated version of this.
In the example of Fig. 6, the panning operation is performed on the audio signal thereby generating the first and second signal. The decorrelation is then performed by applying a decorrelation to the first signal. Specifically, in the example, the rendering unit 605 comprises a decorrelator 613 which decorrelates the first signal. The decorrelator 613 is a processing component that generates an output signal that preserves the spectral and temporal amplitude envelopes of the input signal but has a cross-correlation of less than one between the input and output. Many practical decorrelators will have a cross-correlation of close to zero between the input and the output. However, the decorrelator 613 of Fig. 6 is an adjustable decorrelator for which the degree of correlation may be varied. Thus, the decorrelator 613 may provide a partial and variable decorrelation thus allowing the input signal and output signal to have a cross-correlation of lower than one but possibly higher than zero.
Although panning is highly suitable for positioning virtual sound sources in positions between physical loudspeaker positions, it can also introduce artifacts in some scenarios. Specifically, if the distance between the speakers is too large, the listener will in addition to the desired phantom sound source position also tend to perceive secondary sound sources at the positions of the used loudspeakers. Indeed, for typical speaker distances, the panning operation will tend to result in clearly noticeable artifacts.
In the system of Fig. 6, correlation between the speaker signals used to generate the panned phantom source is actively reduced by the rendering unit 605. This results in the audio signal being rendered more diffusely but still with a directional component that results in a perception of a sound source at the desired position. In this way, a more "smeared out" sound source may be perceived but still with the central position of the sound source being substantially at the desired position. The decorrelation furthermore results in the perception of additional sound sources at the speaker positions becoming less significant. The inventors have specifically realized that it is often preferable to have a sound source which is perceived as coming from a less-defined, but still more or less correct direction, than to have a source which is perceived, e.g., as coming from two distinct loudspeaker positions or from a wrong position (e.g. front-back reversal).
In the system of Fig. 6, the degree of decorrelation is dependent on the rendering speaker configuration. Specifically, it will depend on at least the position of one of the speakers, and typically on the positions of two speakers relative to each other. Thus, the rendering unit 605 may receive information of the position of the speakers, e.g. simply provided as an azimuth relative to a nominal direction (typically the forward direction). In some embodiments, such information may be received by the rendering unit 605 from a measurement unit that performs measurements to determine relative positions. Alternatively or additionally, the position information may be received as a user input, e.g. by the user simply inputting the approximate angle from the listening position to each of the speakers relative to a forward direction. In yet other embodiments, the position information may be stored in the audio renderer and simply retrieved from memory by the rendering unit 605. In some embodiments, the position information may be assumed position information.
In the system of Fig. 6, the degree of decorrelation is dependent on a distance between the positions of the first and second loudspeakers 607, 609, i.e. between the speakers used for the panning. The distance may be determined as an actual distance, e.g. along a desired direction. However, in many scenarios, the distance may be determined as an angular distance measured from the (nominal) listening position. Thus, the degree of decorrelation may be dependent on the angle between the lines from the listening position to the two loudspeakers 607, 609.
The system may specifically be arranged to only introduce additional decorrelation if the distance exceeds a threshold. Indeed, for an angular distance sufficiently low, it has been found that panning provides a very accurate perception with only very low and typically insignificant artifacts. It has been found that in many scenarios and for most people, loudspeakers that are at an angle of less than 60° provide an accurate perception of a sound source at the desired position and with typically insignificant degradations.
Accordingly, for angular distances of less than a threshold e.g. selected in the interval from 50° to 70°, no additional decorrelation will be applied. However, for higher angular distances, the amount of decorrelation may be gradually increased. Thus, the rendering unit 605 may be arranged to increase decorrelation for increasing distance, or equivalently the rendering unit 605 may determine the decorrelation amount as a
monotonically increasing function of the distance between the first and second speaker positions, and specifically as a monotonically increasing function of the angular distance between the speakers 607, 609.
The rendering unit 605 thus generates a desired correlation (λ) between the speaker signals. This desired correlation is dependent on the distance between the speakers between which a source is panned, where a lower correlation (and thus higher decorrelation) is chosen for wider spaced speakers.
Since panning typically works well up to an angular distance between the speakers of up to around 60°, it is often not necessary to apply decorrelation below a threshold of around 60°. Therefore, for speaker distances of less than approximately 60° no decorrelation is applied, and the two speaker signals may have a correlation of one, or close to one.
Speaker spacing is never larger than 180° as there for large angles will exist a smaller alternative angle describing the same configuration. With panning on a speaker spacing of 180° there is only lateralization, therefore a corresponding correlation of zero may be used for this spacing.
For angular distances between these extreme values, a suitable function may be selected, such as e.g. a linear interpolation:
. 180 - maxte,60)
= .
120
It will be appreciated that this is merely an example, and that the mentioned thresholds and correlation values may be chosen differently.
It will be appreciated that the rendering unit 605 may be implemented in different ways in different embodiments.
Fig. 8 illustrates an example of an implementation of the rendering unit 605. In the example, a decorrelator which performs a full decorrelation (i.e. with a cross correlation between input and output of substantially zero) is used.
In the example, the audio signal to be panned between two speakers 607, 609 is first split and decorrelated in order to achieve a desired correlation level before panning gains are applied to the two signals.
In the example, the degree of decoration is controlled by the first drive signal being generated as a weighted summation of the original audio signal and the fully decorrelated signal. The relation between the decorrelation gains (α,ι, a2) may for example be chosen to preserve signal energy. αγ = cos(arccos(/l))
a 2 = sin(arccos(/l)) ' where λ indicates the desired correlation (λ e [θ, ... ,l]) between the two speaker signals.
The panning is then performed by scaling the resulting signals using appropriate panning gains (βι, β2).
An advantage of the example of Fig. 8 is that no adjustments need to be made to the panning gains (βι, β2) obtained from a panning algorithm which does not consider any decorrelation (e.g. from a conventional algorithm for determining panning gains).
Fig. 9 illustrates another example wherein the panning gains are applied before decorrelation. In this example, the decorrelation gains (α,ι, a2) should be chosen differently to preserve the correct energy in the output signals. E.g.:
— · cos(arccosiA)) sin(arccos(/ ))
As illustrated in Fig. 10, the decorrelation and panning may be performed jointly in a matrix operation on the audio signal and a decorrelated version thereof:
βχ cos^b + ~ A ' Si11
R
jS2 · cosfft— 1 A - sin
with a = arccos(/ )
Figure imgf000024_0001
In some embodiments, the degree of decorrelation may additionally or alternatively be dependent on the desired rendering position. For example, the decorrelation may be dependent on a relative distance from the rendering position to the nearest loudspeaker used for the panning. Specifically, the rendering unit 605 may increase the decorrelation for an increasing distance, or equivalently the amount to decorrelation may be a monotonically increasing function of the distance from the desired rendering position to the nearest speaker position.
Thus, in such an embodiment, a sound source panned closely to one of the speakers will have higher correlation than for a sound source panned halfway between the speakers. Such an approach may provide an improved user experience in many scenarios. In particular, the approach may reflect that panning works better for positions close to the speaker positions used in the panning than when further apart. Thus, the degree of diffuseness in the rendering of the audio signal is adapted to reflect the degree of artifacts, thereby automatically achieving that the artifacts are obscured in dependence on the significance of the artifacts.
In some embodiments, the amount of decorrelation may depend on the direction towards the desired rendering position with respect to a reference direction. For example, a nominal listening position and nominal front direction may be defined. E.g. for the example of Fig. 7, the nominal front direction may be from the listening position to the center speaker C. In the example, the amount of decorrelation may be varied dependent on the angle between this frontal direction and the direction towards the desired rendering position. The frontal direction may be assumed to correspond to the way a user is facing when listening to the rendering sound.
The rendering unit 605 may in such an embodiment introduce a higher decorrelation for a desired rendering position at an angle of 90° than for an angle of 0°. Thus, more decorrelation may be introduced for desired rendering positions that are typically to the side of the user than for desired rendering positions that are typically in front or to the back of the user.
In this way, the system may automatically adjust for variations in the degree of degradation that is perceived as a function of the rendering position with respect to the user. For example, the approach be used to adapt the rendering to reflect that interpolation by the human brain tends to be more reliable for sound sources in the front or to the back of the listener and less reliable for rendering positions to the sides of the listener. It will be appreciated that in many embodiments, the exact amount of decorrelation may depend on a plurality of factors. For example, an algorithm for determining the desired decorrelation may take into account both the distance between the speakers, the distance from the desired rendering position to the nearest speaker position, as well as to whether the desired rendering position is in front or to the side of the listener.
The previous examples focused on a panning involving only two speakers. However, it will be appreciated that in other embodiments, the panning operation may include more than two speakers thereby allowing positioning in more dimensions.
For example, by adding elevated speakers to a configuration, panning not only takes place in the horizontal plane, but can also be performed in the vertical direction. Such an approach may specifically use three speakers instead of two as shown in Fig.11. In such a system, three correlations are relevant and in order to control these, two (possibly) partial decorrelations can be applied as shown in Fig. 12. In this example, and assuming that full decorrelation is performed by the decorrelations, the combined panning and decorrelation gains may be determined as:
Figure imgf000026_0001
With βί the three-channel panning gains (from e.g. VBAP) and inter-driver correlations indicated by A. as depicted in Fig. 11.
The freedom of choosing the three correlations is typically limited.
0 < λλ < 1
0 < λ2 < 1
' λ,^— ≤ 1— — + 2 · λ,',
The exemplary approach of Fig. 12 uses the input signal (S) for one driver and decorrelates the two other driver signals from the input signal accordingly. Another approach may reorder the rows of matrix R3 such that the speaker with most energy is fed the scaled input signal (i.e. the first row is permutated to the speaker signal (1, 2 or 3) that has the highest panning gain βί ). An advantage of this approach is that the loudest output signal does not contain any decorrelator signal. A decorrelator signal may introduce artifacts affecting audio quality.
The remaining rows may be permutated such that with decreasing energy more decorrelation signal energy is added.
Thus, in some embodiments the amount of decorrelation applied to a drive signals may depend on an energy of the drive signals, and in particular the decorrelation applied to a drive signal may depend on an energy of the drive signal relative to an energy of another drive signal. In some embodiments, no decorrelation is applied to the drive signal having a highest energy
In another approach one may choose to apply an additional 3D rotation to the output signals (which operation can be combined with the R3 matrix) to align the sum of the driver signals to coincide with signal S.
R3 defines three vectors describing the individual speaker signals:
Figure imgf000027_0001
A sum signal can be described as vector v: v = vl + v2 + V,
Rotating this vector around the d2 axis (ref Fig. 13) gives:
Figure imgf000027_0002
such that the second component (the component reflecting the di contribution) equals zero. This results
ψι = arctan
v(l)
Rotating the resulting vector w around the di axis forming the vector x: c -
Figure imgf000028_0001
such that the third component (the component reflecting the d2 contribution) equals zero gives:
y/2 =
Figure imgf000028_0002
This leads to the total operation as:
Figure imgf000028_0003
The final degree of freedom, i.e., rotation around the s axis may be used to ensure maximum signal continuity, e.g. by aligning the signals such that the vector with the maximum length of the three vectors zl s z2 and Z3 is always associated to a single decorrelator, e.g. di. Alternatively, the contribution of one of the decorrelators may be minimized by rotating the vectors zl s z2 and Z3 around the s axis. This can be used beneficially to reduce the complexity of the least present decorrelator.
In some embodiments, the rendering unit 605 may be arranged to modify the frequency response for at least one of the first and second signals dependent on the desired rendering position. Thus, the transfer function representing the signal path from the audio signal (or decorrelated audio signal) to the drive signal for the loudspeaker may be dependent on the desired rendering position. Specifically, the frequency response may be modified to reflect an ear response. The (front-back) asymmetry of a person's head and specifically ears introduce a frequency selective variation that depends on the direction from which the sound is received. In particular, ear and head shadowing may introduce a frequency response dependent on the direction from which the sound is received. In the system, the rendering unit 605 may emulate elements of such a frequency variation to improve the position perception for the listener.
In some embodiments, the frequency response may alternatively or additionally be modified dependent on the position of the loudspeakers. In particular, the frequency response may be dependent on the angle between the speakers and a reference direction, which may specifically be a direction corresponding to a (nominal) forward direction. The frequency response may accordingly be different for speakers at different positions.
In many embodiments, the frequency response may be dependent on both the desired rendering position and the speaker positions.
Specifically, equalization may be applied to account for coloration differences due to speaker positions vs. intended source position. E.g. sources in the back rendered with the surround speakers of a 5.1 configuration may benefit from a lowered level of high frequencies to account for increased head- and ear-shadowing for rear sources compared to shadowing from the position of the speakers.
In addition to the introduction of decorrelation, colorization may also be applied to improve the perception of a virtual sound source. For example, for the
configuration of Fig. 7, a virtual phantom center back speaker can be realized by playing a coherent (or decorrelated) sound through both the left and right surround speakers. However, often such a virtual back speaker is perceived in front of the listener (known as so called front-back confusion). One of the effects that cause this front-back confusion is a mismatch in the spectral cues between an actual center back speaker and the phantom sound source. Indeed, the frequency modification applied by the head and ear shadowing for sounds arriving from the back of the listener is not present for either of the sounds arriving from the surround speakers since these are substantially to the side of the user. However, by applying a virtual speaker location controlled colorization to the speaker signals this effect can be emulated thereby reducing the risk of front-back confusion.
Thus, a position dependent filtering may be applied to the signals for the speakers. For example, for a two dimensional example with a virtual phantom speaker in between two speakers "spkX" and "spkY", this can be represented as: where the speaker signal is filtered with a filter h{pspkx , p kY , pobj ) (e.g. an FIR filter) to obtain a processed speaker signal. The filter h{pspkx , p kY , pobj ) is a function of the actual speaker positions pspkx, pspkY and the object/virtual channel position, i.e. the desired rendering position. For a given speaker configuration (e.g. 5.1 speakers), the filter may be tabulated or may be parameterized.
Fig. 14 shows the ear responses for a physical center rear source, a phantom rear source and a physical center front source. The coloration of the phantom source is clearly different from both physical sources, and clearly contains more high frequency content than the physical rear source. This may give rise to front-back confusions. Fig. 15 illustrates the difference between the coloration of the phantom source and the physical rear source. This may be used for coloration compensation to compensate for the differences between the physical speaker and the phantom source. The compensation will vary with the position of the phantom source and the position of the physical sources used to create the phantom source.
It will be appreciated that the order of any of the operations may be different in different embodiments. In particular, the order of the panning, decorrelation and equalization/ colorization may be different in different embodiments.
It will be appreciated that the above description for clarity has described embodiments of the invention with reference to different functional circuits, units and processors. However, it will be apparent that any suitable distribution of functionality between different functional circuits, units or processors may be used without detracting from the invention. For example, functionality illustrated to be performed by separate processors or controllers may be performed by the same processor or controllers. Hence, references to specific functional units or circuits are only to be seen as references to suitable means for providing the described functionality rather than indicative of a strict logical or physical structure or organization.
The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these. The invention may optionally be
implemented at least partly as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit or may be physically and functionally distributed between different units, circuits and processors.
Although the present invention has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention is limited only by the accompanying claims. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention. In the claims, the term comprising does not exclude the presence of other elements or steps.
Furthermore, although individually listed, a plurality of means, elements, circuits or method steps may be implemented by e.g. a single circuit, unit or processor.
Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Also the inclusion of a feature in one category of claims does not imply a limitation to this category but rather indicates that the feature is equally applicable to other claim categories as appropriate.
Furthermore, the order of features in the claims do not imply any specific order in which the features must be worked and in particular the order of individual steps in a method claim does not imply that the steps must be performed in this order. Rather, the steps may be performed in any suitable order. In addition, singular references do not exclude a plurality. Thus references to "a", "an", "first", "second" etc. do not preclude a plurality. Reference signs in the claims are provided merely as a clarifying example shall not be construed as limiting the scope of the claims in any way.

Claims

CLAIMS:
1. An apparatus for generating drive signals for audio transducers, the apparatus comprising:
an audio receiver (601) for receiving an audio signal;
a position receiver (603) for receiving position data indicative of a desired rendering position for the audio signal;
a drive signal generator (605) for generating at least a first drive signal for a first audio transducer associated with a first position and a second drive signal for a second audio transducer associated with a second position, the drive signal generator (605) being arranged to generate the drive signals in response to a panning for the audio signal in response to the desired rendering position;
and wherein
the drive signal generator (605) is arranged to decorrelate the first drive signal relative to the second drive signal, a degree of decorrelation being dependent on an indication of the first position.
2. The apparatus of claim 1 wherein the degree of decorrelation is dependent on an indication of the first position relative to the second position.
3. The apparatus of claim 1 wherein the degree of decorrelation is dependent on an indication of an angle between a direction from a listening position to the first position and a direction from the listening position to the second position.
4. The apparatus of claim 1 wherein the degree of decorrelation of the first drive signal relative to the second drive signal is dependent on an indication of a distance between the first position and the second position.
5. The apparatus of claim 4 wherein the drive signal generator (605) is arranged to increase decorrelation for an indication of increasing distance.
6. The apparatus of claim 4 wherein the drive signal generator (605) is arranged to only decorrelate the first drive signal relative to the second drive signal when the indication of the distance is indicative of a distance above a threshold.
7. The apparatus of claim 1 wherein the degree of decorrelation of the first drive signal relative to the second drive signal is dependent on an indication of a distance between the desired rendering position and at least one of the first position and the second position.
8. The apparatus of claim 7 wherein the drive signal generator (605) is arranged to increase decorrelation for an indication of increasing distance.
9. The apparatus of claim 1 wherein the drive signal generator (605) furthermore comprises a frequency response modifier arranged to modify a frequency response for at least the first drive signal in response to the desired rendering position.
10. The apparatus of claim 9 wherein the modification of the frequency response is dependent on an ear response for a direction from a listening position to the desired rendering position.
11. The apparatus of claim 1 or 9 wherein the drive signal generator (605) furthermore comprises a frequency response modifier arranged to modify a frequency response for at least the first drive signal dependent on the first position.
12. The apparatus of claim 1 wherein the degree of decorrelation of the first drive signal relative to the second drive signal is dependent on an angular direction from a listening position to the desired rendering position relative to a reference direction.
13. The apparatus of claim 1 wherein the signal generator (605) is further arranged to generate a third drive signal for a third audio transducer associated with a third position in response to the panning operation for the audio signal in response to the desired rendering position; and drive signal generator is arranged to decorrelate the third drive signal relative to first drive signal and to decorrelate the third drive signal relative to the second drive signal.
14. The apparatus of claim 1 wherein the signal generator (605) comprises a decorrelator (613) for decorrelating the first drive signal relative to the second drive signal.
15. The apparatus of claim 1 wherein the panning comprises determining a relative level difference between the first drive signal and the second drive signal corresponding to the desired rendering position.
16. A method of generating drive signals for audio transducers for rendering an audio signal, the method comprising:
receiving the audio signal;
receiving position data indicative of a desired rendering position for the audio signal;
generating at least a first drive signal for a first audio transducer associated with a first position and a second drive signal for a second audio transducer associated with a second position, the drive signals being generated in response to a panning for the audio signal in response to the desired rendering position;
and wherein
generating the first drive signal comprises decorrelating the first drive signal relative to the second drive signal, a degree of decorrelation being dependent on an indication of the first position.
17. A computer program product comprising computer program code means adapted to perform all the steps of claim 16 when said program is run on a computer.
PCT/IB2013/059875 2012-12-06 2013-11-04 Generating drive signals for audio transducers WO2014087277A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201261733971P 2012-12-06 2012-12-06
US61/733,971 2012-12-06

Publications (1)

Publication Number Publication Date
WO2014087277A1 true WO2014087277A1 (en) 2014-06-12

Family

ID=49641813

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2013/059875 WO2014087277A1 (en) 2012-12-06 2013-11-04 Generating drive signals for audio transducers

Country Status (1)

Country Link
WO (1) WO2014087277A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016040623A1 (en) * 2014-09-12 2016-03-17 Dolby Laboratories Licensing Corporation Rendering audio objects in a reproduction environment that includes surround and/or height speakers
KR20170094078A (en) * 2016-02-08 2017-08-17 소니 주식회사 Ultrasonic speaker assembly for audio spatial effect
EP3220657A1 (en) * 2016-03-16 2017-09-20 Sony Corporation Ultrasonic speaker assembly with ultrasonic room mapping
TWI607655B (en) * 2015-06-19 2017-12-01 Sony Corp Coding apparatus and method, decoding apparatus and method, and program
CN110431853A (en) * 2017-03-29 2019-11-08 索尼公司 Loudspeaker apparatus, audio data provide equipment and voice data reproducing system
CN111800731A (en) * 2019-04-03 2020-10-20 雅马哈株式会社 Audio signal processing device and audio signal processing method
CN114846821A (en) * 2019-12-18 2022-08-02 杜比实验室特许公司 Audio device auto-location
CN118102569A (en) * 2023-10-20 2024-05-28 国电投核力同创(北京)科技有限公司 Three-section type penning ion source anode cavity

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BREEBAART J ET AL: "Background, concept, and architecture for the recent MPEG surround standard on multichannel audio compression", JOURNAL OF THE AUDIO ENGINEERING SOCIETY, AUDIO ENGINEERING SOCIETY, NEW YORK, NY, US, vol. 55, no. 5, 1 May 2007 (2007-05-01), pages 331 - 351, XP008099918, ISSN: 0004-7554 *
KENDALL G S: "THE DECORRELATION OF AUDIO SIGNALS AND ITS IMPACT ON SPATIAL IMAGERY", COMPUTER MUSIC JOURNAL, CAMBRIDGE, MA, US, vol. 19, no. 4, 1 January 1995 (1995-01-01), pages 71 - 87, XP008026420, ISSN: 0148-9267 *
KHOURY S ET AL: "Volumetric modeling of acoustic fields in CNMAT's sound spatialization theatre", VISUALIZATION '98. PROCEEDINGS RESEARCH TRIANGLE PARK, NC, USA 18-23 OCT. 1998, PISCATAWAY, NJ, USA,IEEE, US, 1 January 1998 (1998-01-01), pages 439 - 442, XP031172561, ISBN: 978-0-8186-9176-8, DOI: 10.1109/VISUAL.1998.745338 *
PULKKI V.: "Virtual source positioning using vector base amplitude panning", JOURNAL OF THE AUDIO ENGINEERING SOCIETY, vol. 45, no. 6, 1997, pages 456 - 466, XP002719359
WILLIAM G GARDNER: "3-D Audio Using Loudspeakers", 1 September 1997 (1997-09-01), Massachusetts Institute of Technology, pages 1 - 153, XP055098835, Retrieved from the Internet <URL:http://sound.media.mit.edu/Papers/gardner_thesis.pdf> [retrieved on 20140128] *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106688253A (en) * 2014-09-12 2017-05-17 杜比实验室特许公司 Rendering audio objects in a reproduction environment that includes surround and/or height speakers
WO2016040623A1 (en) * 2014-09-12 2016-03-17 Dolby Laboratories Licensing Corporation Rendering audio objects in a reproduction environment that includes surround and/or height speakers
US20170289724A1 (en) * 2014-09-12 2017-10-05 Dolby Laboratories Licensing Corporation Rendering audio objects in a reproduction environment that includes surround and/or height speakers
US11170796B2 (en) 2015-06-19 2021-11-09 Sony Corporation Multiple metadata part-based encoding apparatus, encoding method, decoding apparatus, decoding method, and program
TWI607655B (en) * 2015-06-19 2017-12-01 Sony Corp Coding apparatus and method, decoding apparatus and method, and program
KR101880844B1 (en) * 2016-02-08 2018-07-20 소니 주식회사 Ultrasonic speaker assembly for audio spatial effect
KR20170094078A (en) * 2016-02-08 2017-08-17 소니 주식회사 Ultrasonic speaker assembly for audio spatial effect
CN107205202A (en) * 2016-03-16 2017-09-26 索尼公司 The ultrasonic speaker component surveyed and drawn with ultrasonic wave room
CN107205202B (en) * 2016-03-16 2020-03-20 索尼公司 System, method and apparatus for generating audio
EP3220657A1 (en) * 2016-03-16 2017-09-20 Sony Corporation Ultrasonic speaker assembly with ultrasonic room mapping
CN110431853A (en) * 2017-03-29 2019-11-08 索尼公司 Loudspeaker apparatus, audio data provide equipment and voice data reproducing system
CN110431853B (en) * 2017-03-29 2022-05-31 索尼公司 Speaker apparatus, audio data providing apparatus, and audio data reproducing system
CN111800731A (en) * 2019-04-03 2020-10-20 雅马哈株式会社 Audio signal processing device and audio signal processing method
US11089422B2 (en) * 2019-04-03 2021-08-10 Yamaha Corporation Sound signal processor and sound signal processing method
CN114846821A (en) * 2019-12-18 2022-08-02 杜比实验室特许公司 Audio device auto-location
CN118102569A (en) * 2023-10-20 2024-05-28 国电投核力同创(北京)科技有限公司 Three-section type penning ion source anode cavity

Similar Documents

Publication Publication Date Title
US11503424B2 (en) Audio processing apparatus and method therefor
EP2805326B1 (en) Spatial audio rendering and encoding
US10506358B2 (en) Binaural audio processing
US9973871B2 (en) Binaural audio processing with an early part, reverberation, and synchronization
KR101858479B1 (en) Apparatus and method for mapping first and second input channels to at least one output channel
US9478228B2 (en) Encoding and decoding of audio signals
RU2752600C2 (en) Method and device for rendering an acoustic signal and a machine-readable recording media
WO2014087277A1 (en) Generating drive signals for audio transducers
KR20120006060A (en) Audio signal synthesis
WO2014091375A1 (en) Reverberation processing in an audio signal
US20150340043A1 (en) Multichannel encoder and decoder with efficient transmission of position information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13795586

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13795586

Country of ref document: EP

Kind code of ref document: A1

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载