WO2018193161A1

WO2018193161A1 - Spatially extending in the elevation domain by spectral extension

Info

Publication number: WO2018193161A1
Application number: PCT/FI2018/050274
Authority: WO
Inventors: Tapani PIHLAJAKUJA; Arto Lehtiniemi; Antti Eronen; Jussi LEPPÄNEN
Original assignee: Nokia Technologies Oy
Priority date: 2017-04-20
Filing date: 2018-04-19
Publication date: 2018-10-25
Also published as: GB2561594A; GB201706287D0

Abstract

An apparatus for generating a spatially extended audio signal, the apparatus configured to:analyse at least one audio signal to determine spectral content of the at least one audio signal;determine whether to spectrally extend the at least one audio signal based on the spectral content of the at least one audio signal, such that the at least one audio signal is to include a determined portion of frequencies above a defined frequency; and vertically spatially extend at least part of the at least one audio signal when the determined spectral content of the at least one audio signal is to be processed.

Description

SPATIALLY EXTENDING IN THE ELEVATION DOMAIN BY SPECTRAL

EXTENSION

Field

The present application relates to apparatus and methods for spatially extending audio signals in the elevation domain by spectrally extending the audio signals.

Background

Capture of audio signals from multiple sources and mixing of audio signals when these sources are moving in the spatial field requires significant effort. For example the capture and mixing of an audio signal source such as a speaker or artist within an audio environment such as a theatre or lecture hall to be presented to a listener and produce an effective audio atmosphere requires significant investment in equipment and training.

A commonly implemented system is where one or more 'external' microphones, for example a Lavalier microphone worn by the user or an audio channel associated with an instrument, is mixed with a suitable spatial (or environmental or audio field) audio signal such that the produced sound comes from an intended direction. This system is known in some areas as Spatial Audio Mixing (SAM).

The SAM system enables the creation of immersive sound scenes comprising "background spatial audio" or ambiance and sound objects for Virtual Reality (VR) applications. Often, the scene can be designed such that the overall spatial audio of the scene, such as a concert venue, is captured with a microphone array (such as one contained in the OZO virtual camera) and the most important sources captured using the 'external' microphones.

One of the aspects of SAM system is the generation and use of volumetric virtual sound sources. The term volumetric virtual sound source refers to a virtual sound source with a spatial volume, whereas a point-like virtual source is perceived from a single point in space. Volumetric virtual sound sources are useful in various applications including virtual and augmented reality and computer gaming. They enable creative opportunities for sound engineers and facilitate more realistic representation of sounds with a natural size, such as large sound-emitting objects. Consider, for example, a fountain, sea, or a large machine. Such volumetric virtual sound sources are discussed in Pihlajamaki et al. Synthesis of Spatially Extended Virtual Sources with Time-Frequency Decomposition of Mono Signals, JAES 2014.

The creation of volumetric virtual sound sources can be simply implemented by the creation of sounds with a perceived spatial extent. This is because the ability of listeners to perceive sounds at different distances is not good. A sound with a perceived spatial extent may surround the listener or it may have a specific width.

Because of a hearing effect called summing localization, the listener perceives simultaneously presented coherent audio signals as a virtual sound source between the original sources. If the coherence is lower, the signals may be perceived as separate audio objects or as a spatially extended auditory effect. Coherence can be measured with the interaural cross-correlation value between signals (IACC). When played identical signals from both headphones, the listener will perceive an auditory event in the center of the head. With identical signals the IACC value equals one. With an IACC value of zero, one auditory event will be perceived near each ear. When the IACC value is between one and zero, the listener may perceive a spatially extended or spread auditory event inside the head, with the extent varying according to the IACC value.

To synthesize a sound source with a perceived spatial extent, one approach is to divide to signal in non-overlapping frequency bands, and then present the frequency bands at distinct spatial positions around the listener. The area from which the frequency bands are presented may be used to control the perceived spatial extent. Special care needs to be taken on how to distribute the frequency bands, such that no degradation in the timbre of the sound occurs, and that the sound is perceived as a single spatially extended source rather than several sound objects.

When spatially extending a sound source, as in described in Pihlajamaki et al, the audio is split into frequency bands (512, for example). These are then rendered from a number of (9, for example) different directions defined by the desired spatial extent. The frequency bands are divided into the different directions using what is called a low-discrepancy sequence, e.g., a Halton sequence. This provides random looking uniformly distributed frequency component sets for the different directions. Thus, for each direction we have a filter which selects frequency components of the original signal based on the Halton sequence. Using these filters, we have signals for the different directions that, ideally, have similar frequency content (shape) as the original signal, but do not contain common frequency components with each other. This results in the sound being heard as having spatial extent.

However, the current spatially extending synthesizer is designed and tested only for horizontal extension.

Summary

There is provided according to a first aspect an apparatus for generating a spatially extended audio signal, the apparatus configured to: analyse at least one audio signal to determine spectral content of the at least one audio signal; determine whether to spectrally extend the at least one audio signal based on the spectral content of the at least one audio signal, such that the at least one audio signal is to include a determined portion of frequencies above a defined frequency; and vertically spatially extend at least part of the at least one audio signal when the determined spectral content of the at least one audio signal is to be processed.

The apparatus may be further configured to spectrally extend the at least one audio signal based on the spectral content of the at least one audio signal, such that the at least one audio signal includes a determined portion of frequencies above a defined frequency defined as a determined portion of energy of the audio signal above a defined frequency, wherein the at least part of the at least one audio signal is at least part of the spectrally extended at least one audio signal.

The apparatus may be further configured to divide the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal, wherein the at least part of the at least one audio signal is the first part of the at least one audio signal.

The apparatus may be further configured to: horizontally spatially extend at least part of the at least one audio signal; and combine the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal to generate at least one spatially extended audio signal comprising horizontal and vertical spatial extent.

The apparatus may be further configured to at least one of: receive the at least one audio signal from a microphone; and generate the at least one audio signal in a synthetic sound generator.

The apparatus configured to analyse at least one audio signal to determine spectral content of the at least one audio signal may be further configured to: determine a first energy content of the at least one audio signal below a first frequency value; and determine a second energy content of the at least one audio signal above the first frequency value.

The first frequency value may be 3 kHz.

The apparatus configured to spectrally extend the at least one audio signal based on the spectral content of the at least one audio signal may be configured to apply at least one of: add content with a frequency above the first frequency value to the at least one audio signal; apply spectral band replication to the at least one audio signal; select content from lower frequencies of the at least one audio signal, transpose the content to higher frequencies, and match a harmonic structure of the signal; add noise with a frequency above the first frequency value to the at least one audio signal; and apply a spectral tilt which amplifies higher frequencies above the first frequency value to the at least one audio signal.

The apparatus configured to divide the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal may be configured to perform divide the at least one audio signal into the second part below a determined frequency value and the first part above the determined frequency value.

The apparatus configured to divide the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal may be configured to divide the at least one audio signal into the second part comprising the at least one audio signal without spectrally extensions and the first part comprising the at least one audio signal spectral extensions.

The apparatus configured to divide the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal may be configured to divide the at least one audio signal using a 3 dB per octave mixing filter with a 50/50 centre-point at a determined frequency value, wherein the first part comprising a high pass filter version of the mixing filter and the second part comprising a low pass filter version of the mixing filter.

The apparatus configured to combine the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal to generate at least one spatially extended audio signal comprising vertically spatial extent may be configured to mix the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal based on a user input defining a head-pose parameter.

The apparatus configured to mix the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal based on a user input defining a head-pose parameter set of yaw, pitch and roll may be configured to: analyse the at least one audio signal to determine whether there is elevation extent with a spectral band extension; update a source position for mixing using all of the head-pose parameters yaw, pitch and roll parameters based on determining there is elevation extent and controlling the mixing based on the updated source position; update a source position for mixing using only head-pose yaw parameter based on determining there is no elevation extent; determine whether the source position needs to be perceivable; control the mixing based on the updated positions based on the source actual position being determined as not being needed to be perceivable; and control the mixing based on a source position before updating based on the source actual position being determined as being needed to be perceivable.

The apparatus may be further configured to receive a user input for controlling the vertically spatially extending of the second part of the at least one audio signal.

The apparatus configured to analyse at least one audio signal to determine spectral content of the at least one audio signal may be configured to: analyse at least one audio signal to determine spectral content of the at least one audio signal; and store a result of the analysis as metadata associated with the at least one audio signal prior to the apparatus being configured to spectrally extend the at least one audio signal based on the metadata spectral content of the at least one audio signal.

According to a second aspect there is provided a method for generating a spatially extended audio signal, comprising: analysing at least one audio signal to determine spectral content of the at least one audio signal; determining whether to spectrally extend the at least one audio signal based on the spectral content of the at least one audio signal, such that the at least one audio signal is to include a determined portion of frequencies above a defined frequency; and vertically spatially extending at least part of the at least one audio signal when the determined spectral content of the at least one audio signal is to be processed.

The method may further comprise spectrally extending the at least one audio signal based on the spectral content of the at least one audio signal, such that the at least one audio signal includes a determined portion of frequencies above a defined frequency defined as a determined portion of energy of the audio signal above a defined frequency, wherein the at least part of the at least one audio signal is at least part of the spectrally extended at least one audio signal.

The method may further comprise dividing the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal, wherein the at least part of the at least one audio signal is the first part of the at least one audio signal.

The method may further comprise: horizontally spatially extending at least part of the at least one audio signal; and combining the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal to generate at least one spatially extended audio signal comprising horizontal and vertical spatial extent.

The method may further comprise at least one of: receiving the at least one audio signal from a microphone; and generating the at least one audio signal in a synthetic sound generator.

Analysing at least one audio signal to determine spectral content of the at least one audio signal may further comprise: determining a first energy content of the at least one audio signal below a first frequency value; and determining a second energy content of the at least one audio signal above the first frequency value.

The first frequency value may be 3 kHz.

Spectrally extending the at least one audio signal based on the spectral content of the at least one audio signal may comprise at least one of: adding content with a frequency above the first frequency value to the at least one audio signal; applying spectral band replication to the at least one audio signal; selecting content from lower frequencies of the at least one audio signal, transposing the content to higher frequencies, and matching a harmonic structure of the signal; adding noise with a frequency above the first frequency value to the at least one audio signal; and applying a spectral tilt which amplifies higher frequencies above the first frequency value to the at least one audio signal.

Dividing the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal may comprise dividing the at least one audio signal into the second part below a determined frequency value and the first part above the determined frequency value. Dividing the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal may comprise dividing the at least one audio signal into the second part comprising the at least one audio signal without spectrally extensions and the first part comprising the at least one audio signal spectral extensions.

Dividing the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal may comprise dividing the at least one audio signal using a 3 dB per octave mixing filter with a 50/50 centre-point at a determined frequency value, wherein the first part comprises a high pass filter version of the mixing filter and the second part comprising a low pass filter version of the mixing filter.

Combining the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal to generate at least one spatially extended audio signal comprising vertically spatial extent may comprise mixing the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal based on a user input defining a head-pose parameter.

The user input may define a head-pose parameter set of yaw, pitch and roll wherein mixing the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal based on a user input defining a head-pose parameter may comprise: analysing the at least one audio signal to determine whether there is elevation extent with a spectral band extension; updating a source position for mixing using all of the head-pose parameters yaw, pitch and roll parameters based on determining there is elevation extent and controlling the mixing based on the updated source position; updating a source position for mixing using only head-pose yaw parameter based on determining there is no elevation extent; determining whether the source position needs to be perceivable; controlling the mixing based on the updated positions based on the source actual position being determined as not being needed to be perceivable; and controlling the mixing based on a source position before updating based on the source actual position being determined as being needed to be perceivable.

The method may further comprise receiving a user input for controlling the vertically spatially extending of the second part of the at least one audio signal. Analysing at least one audio signal to deternnine spectral content of the at least one audio signal may comprise: analysing at least one audio signal to determine spectral content of the at least one audio signal; and storing a result of the analysis as metadata associated with the at least one audio signal prior to the spectrally extending of the at least one audio signal based on the metadata spectral content of the at least one audio signal.

According to a third aspect there is provided an apparatus for generating a spatially extended audio signal, comprising: means for analysing at least one audio signal to determine spectral content of the at least one audio signal; means for determining whether to spectrally extend the at least one audio signal based on the spectral content of the at least one audio signal, such that the at least one audio signal is to include a determined portion of frequencies above a defined frequency; and means for vertically spatially extending at least part of the at least one audio signal when the determined spectral content of the at least one audio signal is to be processed.

The apparatus may further comprise means for spectrally extending the at least one audio signal based on the spectral content of the at least one audio signal, such that the at least one audio signal includes a determined portion of frequencies above a defined frequency defined as a determined portion of energy of the audio signal above a defined frequency, wherein the at least part of the at least one audio signal is at least part of the spectrally extended at least one audio signal.

The apparatus may further comprise means for dividing the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal, wherein the at least part of the at least one audio signal is the first part of the at least one audio signal.

The apparatus may further comprise: means for horizontally spatially extending at least part of the at least one audio signal; and means for combining the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal to generate at least one spatially extended audio signal comprising horizontal and vertical spatial extent.

The apparatus may further comprise at least one of: means for receiving the at least one audio signal from a microphone; and means for generating the at least one audio signal in a synthetic sound generator. Means for analysing at least one audio signal to determine spectral content of the at least one audio signal may further comprise: means for determining a first energy content of the at least one audio signal below a first frequency value; and means for determining a second energy content of the at least one audio signal above the first frequency value.

The first frequency value may be 3 kHz.

The means for spectrally extending the at least one audio signal based on the spectral content of the at least one audio signal may comprise at least one of: means for adding content with a frequency above the first frequency value to the at least one audio signal; means for applying spectral band replication to the at least one audio signal; means for selecting content from lower frequencies of the at least one audio signal, means for transposing the content to higher frequencies, and means for matching a harmonic structure of the signal; means for adding noise with a frequency above the first frequency value to the at least one audio signal; and means for applying a spectral tilt which amplifies higher frequencies above the first frequency value to the at least one audio signal.

The means for dividing the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal may comprise means for dividing the at least one audio signal into the second part below a determined frequency value and the first part above the determined frequency value.

The means for dividing the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal may comprise means for dividing the at least one audio signal into the second part comprising the at least one audio signal without spectrally extensions and the first part comprising the at least one audio signal spectral extensions.

The means for dividing the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal may comprise means for dividing the at least one audio signal using a 3 dB per octave mixing filter with a 50/50 centre-point at a determined frequency value, wherein the first part comprising a high pass filter version of the mixing filter and the second part comprising a low pass filter version of the mixing filter.

The means for combining the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal to generate at least one spatially extended audio signal comprising vertically spatial extent may comprise means for mixing the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal based on a user input defining a head-pose parameter.

The user input may define a head-pose parameter set of yaw, pitch and roll wherein the means for mixing the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal based on a user input defining a head-pose parameter may comprise: means for analysing the at least one audio signal to determine whether there is elevation extent with a spectral band extension; means for updating a source position for mixing using all of the head-pose parameters yaw, pitch and roll parameters based on determining there is elevation extent and controlling the mixing based on the updated source position; means for updating a source position for mixing using only head-pose yaw parameter based on determining there is no elevation extent; means for determining whether the source position needs to be perceivable; means for controlling the mixing based on the updated positions based on the source actual position being determined as not being needed to be perceivable; and means for controlling the mixing based on a source position before updating based on the source actual position being determined as being needed to be perceivable.

The apparatus may further comprise means for receiving a user input for controlling the vertically spatially extending of the second part of the at least one audio signal.

The means for analysing at least one audio signal to determine spectral content of the at least one audio signal may comprise: means for analysing at least one audio signal to determine spectral content of the at least one audio signal; and means for storing a result of the analysis as metadata associated with the at least one audio signal prior to the spectrally extending of the at least one audio signal based on the metadata spectral content of the at least one audio signal.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art. Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

Figure 1 shows schematically an example system for spatial audio mixing featuring original and spatially extended audio signals in the horizontal and vertical domain according to some embodiments;

Figure 2 shows a flow diagram of the operation of the system shown in figure

1 ;

Figure 3 shows schematically an example spatially extending synthesizer implementation shown in figure 1 in further detail according to some embodiments;

Figure 4 shows schematically an example mixer control system for the mixer shown in figure 1 according to some embodiments;

Figure 5 shows the operation of the example mixer shown in figure 4; and

Figure 6 shows schematically an example device suitable for implementing the apparatus shown in Figures 1 , 3 and 5.

Embodiments of the Application

The following describes in further detail suitable apparatus and possible mechanisms for the provision of effective audio signal generation including the generation of volumetric virtual sound sources with elevation spatial extent. This is because perception contributes to the full spatial immersion, and it is important to include it. It is known that human perception on sound source localization differs in azimuth and elevation (Butler, R.; Humanski, R.; Localization of sound in the vertical plane with and without high-frequency spectral cues, Perception & Psychophysics, 1992). Elevation perception requires high-frequency content to be present. Thus, spatial extent synthesis needs to focus on high frequencies when synthesizing elevation extent or no elevation extent is perceived. Likewise, if there is no content on high frequencies, no elevation extent is perceived.

Without implementing the embodiments described hereafter there is no guarantee that the output of the spatially extending synthesizer for different sound types contains perceivable extent in the elevation domain.

A conventional approach to the capturing and mixing of sound sources with respect to an audio background or environment audio field signal would be for a professional producer to utilize an external microphone (a close or Lavalier microphone worn by the user, or a microphone attached to an instrument or some other microphone) to capture audio signals close to the sound source, and further utilize a 'background' microphone or microphone array to capture a environmental audio signal. These signals or audio tracks may then be manually mixed to produce an output audio signal such that the produced sound features the sound source coming from an intended (though not necessarily the original) direction.

The concepts as discussed in detail hereafter is a system that implements spatially extending synthesis, which ensures that perceivable spatial extent in the elevation domain is also created. Synthesis of elevation extent increases the immersive effect of sound compared to only spatially extending in the azimuth. However, elevation based spatial extending requires a different approach than azimuth based spatial extending as the listener perceives elevation strongly only with high frequency content. Thus, the embodiments as discussed herein show a system in two parts: firstly, the system is configured to analyse the input signal to determine whether there is sufficient high frequency content present and generate high frequency content when necessary. Secondly the system is configured to perform the spatially extending synthesis in the elevation domain as well as the azimuth domain.

With respect to figure 1 is shown an example system for generating a vertical and horizontally spatially extended audio signal according to some embodiments.

The system in some embodiments may comprise an audio signal input 101 . In the example shown in figure 1 the audio signal input 101 is configured to receive or generate a mono audio signal. The mono audio signal may be one from a microphone such as an external microphone. The external microphone may be any microphone external or separate to a microphone array (for example a Lavalier microphone) which may capture a spatial audio signal. Thus the concept is applicable to any external/additional microphones be they Lavalier microphones, hand held microphones, mounted mics, or whatever. The external microphones can be worn/carried by persons or mounted as close-up microphones for instruments or a microphone in some relevant location which the designer wishes to capture accurately. A Lavalier microphone typically comprises a small microphone worn around the ear or otherwise close to the mouth. For other sound sources, such as musical instruments, the audio signal may be provided either by a Lavalier microphone or by an internal microphone system of the instrument (e.g., pick-up microphones in the case of an electric guitar) or an internal audio output (e.g., a electric keyboard output). In some embodiments the close microphone may be configured to output the captured audio signals to a mixer. The external microphone may be connected to a transmitter unit (not shown), which wirelessly transmits the audio signal to a receiver unit (not shown).

In some embodiments the external microphone, mic sources and thus the performers and/or the instruments that are being played positions may be tracked by using position tags located on or associated with the microphone source. Thus for example the external microphone comprises or is associated with a microphone position tag. The microphone position tag may be configured to transmit a radio signal such that an associated receiver may determine information identifying the position or location of the close microphone. It is important to note that microphones worn by people can be freely moved in the acoustic space and the system supporting location sensing of wearable microphone has to support continuous sensing of user or microphone location. The close microphone position tag may be configured to output this signal to a position tracker. Although the following examples show the use of the HAIP (high accuracy indoor positioning) radio frequency signal to determine the location of the close microphones it is understood that any suitable position estimation system may be used (for example satellite-based position estimation systems, inertial position estimation, beacon based position estimation etc.).

Although in the following example the mono audio signal is determined from an external microphone, the audio signal may be a stored audio signal or a synthetic (for example a generated or significantly processed audio signal).

In some embodiments the system comprises a spectral content analyser 103. The spectral content analyser 103 may be configured to receive the audio signal (for example from the microphone). The spectral content analyser 103 may be configured to analyse the audio signal for its spectral content distribution.

As there needs to be enough spectral content above 3 kHz in order to perform the vertical spatially extending of the audio signal the spectral content analyser 103 may be configured to check how much of the signal energy is located above a 3 kHz frequency boundary compared to the signal energy below the boundary. In some embodiments the frequency boundary may be any suitable frequency.

In some embodiments the spectral content analyser 103 is configured to determine whether the energy content of the audio signal about the boundary is greater than a determined threshold value and control a spectral band extender 105, a vertical signal selector 107 and the horizontal and vertical spatially extending synthesizers 109, 1 1 1 . In some embodiments the determined threshold value may be, for example, that at least 10% of the signal energy is located above 3kHz. In some embodiments any other suitable threshold may be used, and the threshold may be adjustable or adjusted depending on a user or sound engineer's preferences.

Thus for example where the spectral content analyser 103 determines that there is enough energy present, the audio signal can be used as is without any spectral extending.

In some embodiments where there is not enough high-frequency energy present in the signal, then energy must be added to create good elevation extent perception. In such examples the spectral content analyser is configured to control the spectral band extender 105.

In some embodiments the system comprises a spectral band extender 105. The spectral band extender 105 may be configured to receive the audio signal and furthermore receive a control input from the spectral content analyser 103. The spectral band extender 105 in some embodiments may be configured to add high (or higher) frequency spectral content to the audio signal in order to create energy for the vertical spatially extending synthesis.

In some embodiments the spectral band extender 105 is configured to add harmonic distortion to the signal with a specific distortion effect. However this in some listener's ears can be perceived as annoying. In some embodiments the spectral band extender 105 is configured to apply specific spectral bandwidth extension methods such as spectral band replication (SBR). However any suitable spectral bandwidth extension methods may be implemented, such as for example those shown in Larsen, E; Aarts, R; Audio Bandwidth Extension, 2004 and Larsen, E; Aarts, R; Danessis M; Efficient high-frequency bandwidth extension of music and speech. These methods are normally used, for example, in audio coding to reduce the used data rate for high frequencies. In these implementations the implementations create energy in the higher frequencies regions by picking or selecting content from lower frequencies, transposing the content to higher frequencies (i.e., pitch shifts or moves in spectral domain), and matching the assumed harmonic structure of the signal. In some embodiments where the signal is more noise-like, noise is added instead of or in addition to the harmonic signal. These spectral bandwidth extension processed may be performed until the specified threshold for energy is met. In some circumstances this spectral bandwidth extension process can generate artefacts in the output audio signal. Where only a small increase in energy is needed in some embodiments the spectral bandwidth extension may be achieved by the spectral band extender performing a spectral tilt (i.e., the application of a filter that amplifies higher frequencies slightly, e.g., +3-6 dB) might be more suitable. In some embodiments the spectral bandwidth extender is configured to apply speech codec based spectral extension, especially where the audio signal is determined to comprise a significant proportion of speech or voice activity.

Thus at the output of the spectral band extender 105 there may in some embodiments be a system which generates one of three options with respect to the audio signal output to the vertical signal selector:

1 . No spectral bandwidth extension where there is enough higher frequency energy present

2. High-frequency spectrum amplified with spectral tilt where a small amount of spectral bandwidth extension was required to reach the enough higher frequency energy present level

3. Spectral bandwidth extension used to create new spectral content where a larger amount of spectral bandwidth extension was required to reach the enough higher frequency energy present level

However, care must be taken as not to make the sound too abnormal with additions. In some embodiments the user of the system may be able to control the spectral band extension operation and thus tone down the extension if necessary.

In some embodiments, the system is implemented in digital audio workstation (DAW). In this case, the different operations may be configured on a settings GUI of the DAW. In this case, the sound engineer can customize the system operation to their liking.

The output of the spectral band extender 105 may be passed to the vertical signal selector 107.

In some embodiments the system comprises a vertical signal selector 107. The vertical signal selector 107 is configured to receive the input audio signal (from the microphone or synthetic sound generator) and the spectral bandwidth extended audio signal from the spectral band extender 105. Furthermore the vertical signal selector 107 is configured to receive a control signal from the spectral content analyser 103.

The vertical signal selector 107 is configured to select or filter the audio signals and pass the selected audio signals to the horizontal spatially extending synthesizer (H-Spatially Extending synthesizer) 109 and to the vertical spatially extending synthesizer (V-Spatially Extending synthesizer) 1 1 1 . In some embodiments the vertical signal selector 107 is configured such that, as the lower frequencies are generally important for azimuth perception, more of the lower-frequency energy is selected for the horizontal extension. Furthermore the vertical signal selector 107 is configured such that, as higher frequencies are important for elevation perception, the higher frequency energy is selected for the vertical extension. Thus in some embodiments the vertical signal selector 107 is configured to divide the audio signal using the 3 kHz border with a large crossing bandwidth.

In some embodiments, in order to reduce any artefacts produced by a strict division of the audio signal a different division of the audio signal can be used to create the separate signals for spatially extending the audio signal. Thus in some embodiments the vertical signal selector 107 is configured to divide the audio signal such that the horizontal audio signal, the audio signal passed to the horizontally spatially extending synthesizer 109 should contain only the whole bandwidth of the original signal to preserve the quality. In other words the vertical signal selector 107 is configured to output the original audio signal to the horizontal spatially extending synthesizer 109. The vertical signal selector 107 furthermore in some embodiments is configured such that if spectral bandwidth extension has been used to create most of the high-frequency energy, then the spectrally extended part of the audio signal is passed to the vertical spatially extending synthesiser 1 1 1 (in other words the extended spectral energy is used for the vertical spatial extension). Furthermore in some embodiments the vertical signal selector 107 is configured such that, where there is already enough or almost enough energy present on the high frequencies, a gradual frequency-dependent filtering of the original or spectral extended audio signal is performed. For example in some embodiments the vertical signal selector 107 is configured to divide the audio signal using a 3 dB per octave mixing filter with a 50/50 centre-point at 3 kHz and with a rising curve for vertical content and falling curve for horizontal content. In some embodiments a spectral tilt can be added to this for vertical content if used.

In some embodiments where there is a small amount of spectral bandwidth extension then a combination of above methods can be used.

In some embodiments the system may comprise a horizontal spatially extending synthesizer 109 configured to receive the horizontal audio signal and generate a horizontally (or azimuth) spatially extended signal 1 10 output which is passed to a spatial mixer 1 13.

In some embodiments the system may further comprise a vertical spatially extending synthesizer 109 configured to receive the vertical audio signal and generate a vertically (or elevation) spatially extended signal 1 12 output which is passed to the spatial mixer 1 13. The operation of the spatially extending synthesizer is described in further detail later.

In some embodiments the system may further comprise a spatial mixer 1 13. The spatial mixer 1 13 is configured to receive the horizontally (or azimuth) spatially extended signal 1 10 and vertically (or elevation) spatially extended signal 1 12 and generate a horizontally and vertically (azimuth and elevation) spatially extended audio signal. The operation of the spatial mixer 1 13 is shown in further detail later.

With respect to figure 2 a flow diagram of the operation of the system shown in figure 1 is shown.

Firstly the audio signal is input in the form of a microphone generated audio signal or a determined or generated synthetic audio signal as shown in figure 2 by step 201 .

Then the spectral content of the audio signal is analysed as shown in figure 2 by step 203.

The audio signal may be spectrally bandwidth extended based on the analysis of the spectral content of the audio signal as shown in figure 2 by step 205.

The audio signal and the spectrally bandwidth extended audio signal may then be divided or filtered based on the analysis of the spectral content of the audio signal as shown in figure 2 by step 207.

The horizontal audio signal output may be spatially extended by the application of the horizontal spatially extending synthesizer as shown in figure 2 by step 208.

The vertical audio signal output may be spatially extended by the application of the vertical spatially extending synthesizer as shown in figure 2 by step 209.

The horizontal spatially extended audio signal and the vertical spatially extended audio signal may then be combined using separate horizontal and vertical source as shown in figure 2 by step 21 1 .

The combined spatially extended audio signal may then be output as shown in figure 2 by step 213. With respect to figure 3 an example spatially extending synthesizer is shown. The spatially extending synthesizer may be the horizontal and/or vertical spatially extending synthesizer 109/1 1 1 . The input audio signals are different for each of the spatially extending synthesizers and furthermore the spatial arrangement of the 'desired loudspeaker' arrangement discussed herein will be different. The spatially extending synthesizer is configured to receive the input audio signal and generate a spatially extended audio signal. In some embodiments the spatially extending synthesizer is configured to receive a further input from the spectral content analyser 103. As described herein the spatially extending synthesizer 141 receives the input audio signal and spatially extends the audio signal to a defined spatial range using spatially extending methods.

In some embodiments where the audio signal input is a time domain signal the spatially extending synthesizer comprises a suitable time to frequency domain transformer. For example as shown in Figure 3 the spatially extending synthesizer comprises a Short-Time Fourier Transform (STFT) 401 configured to receive the audio signal and output a suitable frequency domain output. In some embodiments the input is a time-domain signal which is processed with hop-size of 512 samples. A processing frame of 1024 samples is used, and it is formed from the current 512 samples and previous 512 samples. The processing frame is zero-padded to twice its length (2048 samples) and Hann windowed. The Fourier transform is calculated from the windowed frame producing the Short-Time Fourier Transform (STFT) output. The STFT output is symmetric, thus it is sufficient to process the positive half of 1024 samples including the DC component, totalling 1025 samples. Although the STFT is shown in Figure 3 any suitable time to frequency domain transform may be used.

In some embodiments the spatially extending synthesizer further comprises a filter bank 403. The filter bank 403 is configured to receive the output of the STFT 401 and using a set of filters generated based on a Halton sequence (and with some default parameters) generate a number of frequency bands 405. In statistics, Halton sequences are sequences used to generate points in space for numerical methods such as Monte Carlo simulations. Although these sequences are deterministic, they are of low discrepancy, that is, appear to be random for many purposes. In some embodiments the filter bank 409 comprises set of 9 different distribution filters, which are used to create 9 different frequency domain signals where the signals do not contain overlapping frequency components. These signals are denoted Band 1 F 405i to Band 9 F 405g in Figure 2. The filtering can be implemented in the frequency domain by multiplying the STFT output with stored filter coefficients for each band.

In some embodiments the spatially extending synthesizer 141 further comprises a spatial extent input 400. The spatial extent input 400 may be configured to define the spatial extent of the audio signal.

Furthermore in some embodiments the spatially extending synthesizer may further comprise an object position input/determiner 402. The object position input/determiner 402 may be configured to determine the spatial position of sound sources. This information may be determined in some embodiments by a sound object processor.

In some embodiments the spatially extending synthesizer may further comprise a band position determiner 404. The band position determiner 404 may be configured to receive the outputs from the object position input/determiner 402 and the spatial extent input 400 and from these generate an output passed to the vector base amplitude panning processor 406.

In some embodiments the spatially extending synthesizer 141 may further comprise a vector base amplitude panning (VBAP) processor 406. The VBAP 406 may be configured to generate control signals to control the panning of the frequency domain signals to desired spatial positions. Given the spatial position of the sound source (azimuth, elevation) and the desired spatial extent for the source (width in degrees), the system calculates a spatial position for each frequency domain signal. For example, if the spatial position of the sound source is zero degrees azimuth (front), and spatial extent (90 degrees azimuth in the horizontal spatially extending synthesizer), the VBAP may position the frequency bands at positions azimuth 45, 33.75, 22.5, 1 1 .25, 0, -1 1 .2500, -22.5000, -33.7500, -45 degrees. Thus, we use a linear allocation of bands around the source position, with the span defined by the spatial extent.

The VBAP processor 406 may therefore be used to calculate a suitable gain for the signal, given the 'desired' loudspeaker positions. The VBAP processor 406 may provide gains for a signal such that it can be spatially positioned to a suitable position. These gains may be passed to a series of multipliers 407. In the following example the spatially extending synthesizer (or spatially extending controller) is implemented using a vector based amplitude panning operation. However it is understood that the spatial extent synthesis or spatially extending control may be implementation agnostic and any suitable implementation used to generate the spatially extending control. For example in some embodiments the spatially extending control may implement direct binaural panning (using Head related transfer function filters for directions), direct assignment to the output channel locations (for example direct assignment to the loudspeakers without using any panning), synthesized ambisonics, and wave-field synthesis.

In some embodiments the spatially extending synthesizer may further comprise a series of multipliers. In Figure 3 is shown one multiplier for each frequency band. Thus the series of multipliers comprise multipliers 407i to 407g, however any suitable number of multipliers may be used. Each frequency domain band signal may be multiplied in the multiplier 407 with the determined VBAP gains.

The products of the VBAP gains and each frequency band signal may be passed to a series of output channel sum devices 409.

In some embodiments the spatially extending synthesizer may further comprise a series of sum devices 409. The sum devices 409 may receive the outputs from the multipliers and combine them to generate an output channel band signal 41 1 . In the example shown in Figure 3, a 4.0 loudspeaker format output is implemented with outputs for front left (Band FL F 41 1 1 ), front right (Band FR F 41 1₂), rear left (Band RL F 41 13), and rear right (Band RR F 41 14) channels which are generated by sum devices 409i, 4092, 4093 409₄ respectively. In some other embodiments other loudspeaker formats or number of channels can be supported.

Furthermore in some embodiments other panning methods can be used such as panning laws, or the signals could be assigned to the closest loudspeakers directly.

In some embodiments the spatially extending synthesizer may further comprise a series of inverse Short-Time Fourier Transforms (ISTFT) 413. For example as shown in Figure 2 there is an ISTFT 413i associated with the FL signal an ISTFT 4132 associated with the FR signal, an ISTFT 4133 associated with the RL signal output and an ISTFT 413₄ associated with the RR signal. In other words it provides N component audio signals to be played from different directions based on the spatial extent parameters. The signals are subjected to Inverse Short-Time Fourier Transform (ISTFT) and overlap-added to produce time-domain outputs.

With respect to figure 4, an example spatial mixer implementation according to some embodiments is shown. A spatial mixer may be implemented in some embodiments by combining the audio signals using a simple mixer however in some situations for example those implementing binaural representation, where a normal head-tracking operation is used to change the directions of the signals, the vertical spatially extended signal may become the horizontally extended signal and vice versa. This happens when the elevation or tilt (also called pitch and roll) of the head are non- zero and especially when either is ±90 degrees. In these situations the output audio signals may be incorrect. In some embodiments the system may comprise a head- locking mixer where the direction compensation driven by head-tracking does not affect the extended signal where the orientation changes. Figure 4 shows the example spectral mixer comprising a spectral band extension determiner 501 . The spectral band extension determiner 501 may be configured to receive the head-tracker input (the head-pose in the form or yaw/pitch/roll or other) and the spatially extended audio signals and be configured to determine whether the source has an elevation extent with a spectral band extension. This determination may be output to the extended position updater 503.

The mixer may furthermore comprise an extended position updater 503 configured to receive the determination from the spectral band extension determiner 501 and configured to update the sound source position using yaw only where it is determined that the source has an elevation extent and using yaw, pitch and roll otherwise.

The system may furthermore comprise a source perception determiner 505.

The source perception determiner may be configured to determine whether the source actual position is needed to be perceivable. The output of this determination can be used to control the original signal determiner 507.

In some embodiments the original signal determiner 507 can be configured to receive the original signal and furthermore the output of the source perception determiner 505. The original signal determiner can be configured to update the original signal position with the head-tracker (yaw/pitch/roll) and control the mixing of the audio signals based on this updating.

With respect to figure 5 the operation of the system mixer shown in figure 4 is shown by a flow diagram.

Firstly the extended signal is received as shown in figure 5 by step 601 , the original signal is received as shown in figure 5 by step 603 and the head-pose information from the headtracker is received as shown in figure 5 by step 605. The source is then analysed to deternnine whether there is elevation extent with a spectral band extension as shown in figure 5 by step 609.

Where there is elevation extent with a spectral band extension then the source position is updated using the yaw, pitch and roll parameters from the headpose and the mixing is controlled based on the updated position as shown in figure 5 by step 609 and the output as shown in figure 5 by step 610.

Where there is no elevation extent with a spectral band extension then the source position is updated using yaw parameters from the headpose as shown in figure 5 by step 61 1 .

The source is then analysed to determine whether the source actual position needs to be perceivable as shown in figure 5 by step 613.

Where the source actual position is determined as not being needed then the mixing is controlled based on the updated positions and output as shown in figure 5 by step 614.

Where the source actual position is determined as being needed to be perceivable then the original signal position is updated using the yaw, pitch and roll parameters from the headpose and the mixing is controlled based on the updated positions as shown in figure 5 by step 615 and the output as shown in figure 5 by step 616.

In other words the mixer can be configured to control the mixing such that the correct horizontal/vertical audio signals are used where head-tracking only is performed in azimuth (also known as yaw). However, the above system enables the taking into account that if direction of the source is important, then there needs to be a direct part of the sound included that has point-like clear direction with full head- tracking.

In some embodiments in addition to automatic processing, the system could receive a user input to control the spatially extending synthesis. For example where the user can decide whether to implement vertical spatially extending an audio signal and the extent of the vertical spatial extent. Furthermore in some embodiments the user can also monitor the output signal, for example, a binaurally rendered version and determine how aggressive the vertical extent and/or high frequency content creation algorithms are. For example, the user can determine that less vertical extent (than the automatic algorithm produces) is enough, then control the system manually, for example, by forcing the system to use less aggressive high frequency content creation scheme and/or changing the extent narrower.

In some embodiments as the parameters for controlling extent creation are only dependent on the input signal, it is possible to precompute the analysis beforehand and store it as separate metadata. For example this analysis may be integrated within an audio file input to the system and can be very advantageous as the user can also tune the parameters just right beforehand. When spectral bandwidth extent is then created, the parameters are fetched from the metadata and used to control the extent. Additionally, these parameters could be time dependent and thus change through time if the user so desires.

Thus for example systems as described above may be employed where a user wants to spatial ize a combination of bird sounds into fully spatial sound. Horizontal spatially extending of the audio signal is determined to be not enough so the user may enable the application of vertical spatially extension of the audio signal. As different bird song produces relatively wide band signal, this does not need any spectral band extension and the audio signal is just divided to horizontal and vertical parts and reproduced as surrounding.

Another example may be where the user wants to spatialize an electric guitar signal in a music mix for VR content. This is desired to be done both in the horizontal and vertical planes. As most of the spectral content for the guitar is below the 3kHz mark, the system uses bandwidth extension to extend the frequencies for the vertical extension.

With respect to Figure 6 an example electronic device which may be used as the mixer and/or system is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1200 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.

The device 1200 may comprise a microphone 1201 . The microphone 1201 may comprise a plurality (for example a number N) of microphones. However it is understood that there may be any suitable configuration of microphones and any suitable number of microphones. In some embodiments the microphone 1201 is separate from the apparatus and the audio signal transmitted to the apparatus by a wired or wireless coupling. The microphone 1201 may in some embodiments be the microphone array as shown in the previous figures. The microphone may be a transducer configured to convert acoustic waves into suitable electrical audio signals. In some embodiments the microphone can be solid state microphones. In other words the microphone may be capable of capturing audio signals and outputting a suitable digital format signal. In some other embodiments the microphone 1201 can comprise any suitable microphone or audio capture means, for example a condenser microphone, capacitor microphone, electrostatic microphone, Electret condenser microphone, dynamic microphone, ribbon microphone, carbon microphone, piezoelectric microphone, or microelectrical-mechanical system (MEMS) microphone. The microphone can in some embodiments output the audio captured signal to an analogue-to-digital converter (ADC) 1203.

The device 1200 may further comprise an analogue-to-digital converter 1203. The analogue-to-digital converter 1203 may be configured to receive the audio signals from each of the microphone 1201 and convert them into a format suitable for processing. In some embodiments where the microphone is an integrated microphone the analogue-to-digital converter is not required. The analogue-to-digital converter 1203 can be any suitable analogue-to-digital conversion or processing means. The analogue-to-digital converter 1203 may be configured to output the digital representations of the audio signal to a processor 1207 or to a memory 121 1 .

In some embodiments the device 1200 comprises at least one processor or central processing unit 1207. The processor 1207 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1200 comprises a memory 121 1 . In some embodiments the at least one processor 1207 is coupled to the memory 121 1 . The memory 121 1 can be any suitable storage means. In some embodiments the memory 121 1 comprises a program code section for storing program codes implementable upon the processor 1207. Furthermore in some embodiments the memory 121 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1207 whenever needed via the memory-processor coupling.

In some embodiments the device 1200 comprises a user interface 1205. The user interface 1205 can be coupled in some embodiments to the processor 1207. In some embodiments the processor 1207 can control the operation of the user interface 1205 and receive inputs from the user interface 1205. In some embodiments the user interface 1205 can enable a user to input commands to the device 1200, for example via a keypad. In some embodiments the user interface 205 can enable the user to obtain information from the device 1200. For example the user interface 1205 may comprise a display configured to display information from the device 1200 to the user. The user interface 1205 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1200 and further displaying information to the user of the device 1200. In some embodiments the user interface 1205 may be the user interface for communicating with the position determiner as described herein.

In some implements the device 1200 comprises a transceiver 1209. The transceiver 1209 in such embodiments can be coupled to the processor 1207 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver 1209 or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

For example as shown in Figure 6 the transceiver 1209 may be configured to communicate with the renderer as described herein.

The transceiver 1209 can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver 1209 or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

In some embodiments the device 1200 may be employed as at least part of the renderer. As such the transceiver 1209 may be configured to receive the audio signals and positional information from the microphone/close microphones/position determiner as described herein, and generate a suitable audio signal rendering by using the processor 1207 executing suitable code. The device 1200 may comprise a digital-to-analogue converter 1213. The digital-to-analogue converter 1213 may be coupled to the processor 1207 and/or memory 121 1 and be configured to convert digital representations of audio signals (such as from the processor 1207 following an audio rendering of the audio signals as described herein) to a suitable analogue format suitable for presentation via an audio subsystem output. The digital-to-analogue converter (DAC) 1213 or signal processing means can in some embodiments be any suitable DAC technology.

Furthermore the device 1200 can comprise in some embodiments an audio subsystem output 1215. An example as shown in Figure 6 shows the audio subsystem output 1215 as an output socket configured to enabling a coupling with headphones 121 . However the audio subsystem output 1215 may be any suitable audio output or a connection to an audio output. For example the audio subsystem output 1215 may be a connection to a multichannel speaker system.

In some embodiments the digital to analogue converter 1213 and audio subsystem 1215 may be implemented within a physically separate output device. For example the DAC 1213 and audio subsystem 1215 may be implemented as cordless earphones communicating with the device 1200 via the transceiver 1209.

Although the device 1200 is shown having both audio capture, audio processing and audio rendering components, it would be understood that in some embodiments the device 1200 can comprise just some of the elements.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS:

1 . An apparatus for generating a spatially extended audio signal, the apparatus configured to:

analyse at least one audio signal to determine spectral content of the at least one audio signal;

determine whether to spectrally extend the at least one audio signal based on the spectral content of the at least one audio signal, such that the at least one audio signal is to include a determined portion of frequencies above a defined frequency; and

vertically spatially extend at least part of the at least one audio signal when the determined spectral content of the at least one audio signal is to be processed.

2. The apparatus as claimed in claim 1 , further configured to spectrally extend the at least one audio signal based on the spectral content of the at least one audio signal, such that the at least one audio signal includes a determined portion of frequencies above a defined frequency defined as a determined portion of energy of the audio signal above a defined frequency, wherein the at least part of the at least one audio signal is at least part of the spectrally extended at least one audio signal.

3. The apparatus as claimed in any of claims 1 and 2, further configured to divide the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal, wherein the at least part of the at least one audio signal is the first part of the at least one audio signal.

4. The apparatus as claimed in claim 3, further configured to:

horizontally spatially extend at least part of the at least one audio signal; and combine the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal to generate at least one spatially extended audio signal comprising horizontal and vertical spatial extent.

5. The apparatus as claimed in any of claims 1 to 4, further configured to at least one of: receive the at least one audio signal from a microphone; and

generate the at least one audio signal in a synthetic sound generator.

6. The apparatus as claimed in any of claims 1 to 5, configured to analyse at least one audio signal to determine spectral content of the at least one audio signal is further configured to:

determine a first energy content of the at least one audio signal below a first frequency value; and

determine a second energy content of the at least one audio signal above the first frequency value.

7. The apparatus as claimed in claim 6, wherein the first frequency value is 3 kHz.

8. The apparatus as claimed in any of claims 6 and 7, when dependent on claim 2, configured to spectrally extend the at least one audio signal based on the spectral content of the at least one audio signal is configured to apply at least one of:

add content with a frequency above the first frequency value to the at least one audio signal;

apply spectral band replication to the at least one audio signal;

select content from lower frequencies of the at least one audio signal, transpose the content to higher frequencies, and match a harmonic structure of the signal;

add noise with a frequency above the first frequency value to the at least one audio signal; and

apply a spectral tilt which amplifies higher frequencies above the first frequency value to the at least one audio signal.

9. The apparatus as claimed in any of claims 6 to 8, when dependent on claim 3 configured to divide the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal is configured to divide the at least one audio signal into the second part below a determined frequency value and the first part above the determined frequency value.

10. The apparatus as claimed in any of claims 6 to 8, when dependent on claims 3 and 2, configured to divide the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal is configured to divide the at least one audio signal into the second part comprising the at least one audio signal without spectrally extensions and the first part comprising the at least one audio signal spectral extensions.

1 1 . The apparatus as claimed in any of claims 6 to 8, when dependent on claim 3, configured to divide the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal is configured to divide the at least one audio signal using a 3 dB per octave mixing filter with a 50/50 centre- point at a determined frequency value, wherein the first part comprising a high pass filter version of the mixing filter and the second part comprising a low pass filter version of the mixing filter.

12. The apparatus as claimed in any of claims 6 to 8 when dependent on claim 4, configured to combine the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal to generate at least one spatially extended audio signal comprising vertically spatial extent is configured to mix the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal based on a user input defining a head-pose parameter.

13. The apparatus as claimed in claim 12, configured to mix the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal based on a user input defining a head-pose parameter set of yaw, pitch and roll is configured to:

analyse the at least one audio signal to determine whether there is elevation extent with a spectral band extension;

update a source position for mixing using all of the head-pose parameters yaw, pitch and roll parameters based on determining there is elevation extent and controlling the mixing based on the updated source position;

update a source position for mixing using only head-pose yaw parameter based on determining there is no elevation extent;

determine whether the source position needs to be perceivable; control the mixing based on the updated positions based on the source actual position being determined as not being needed to be perceivable; and

control the mixing based on a source position before updating based on the source actual position being determined as being needed to be perceivable.

14. The apparatus as claimed in any of claims 1 to 13, further configured to receive a user input for controlling the vertically spatially extending of the second part of the at least one audio signal.

15. The apparatus as claimed in any of claims 1 to 14, configured to analyse at least one audio signal to determine spectral content of the at least one audio signal is configured to:

analyse at least one audio signal to determine spectral content of the at least one audio signal; and

store a result of the analysis as metadata associated with the at least one audio signal prior to the apparatus being configured to spectrally extend the at least one audio signal based on the metadata spectral content of the at least one audio signal.

16. A method for generating a spatially extended audio signal, comprising:

analysing at least one audio signal to determine spectral content of the at least one audio signal;

determining whether to spectrally extend the at least one audio signal based on the spectral content of the at least one audio signal, such that the at least one audio signal is to include a determined portion of frequencies above a defined frequency; and

vertically spatially extending at least part of the at least one audio signal when the determined spectral content of the at least one audio signal is to be processed.

17. The method as claimed in claim 16, further comprising spectrally extending the at least one audio signal based on the spectral content of the at least one audio signal, such that the at least one audio signal includes a determined portion of frequencies above a defined frequency defined as a determined portion of energy of the audio signal above a defined frequency, wherein the at least part of the at least one audio signal is at least part of the spectrally extended at least one audio signal.

18. The method as claimed in any of claims 16 and 17, further comprising dividing the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal, wherein the at least part of the at least one audio signal is the first part of the at least one audio signal.

19. The method as claimed in claim 18, further comprising:

horizontally spatially extending at least part of the at least one audio signal; and combining the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal to generate at least one spatially extended audio signal comprising horizontal and vertical spatial extent.

20. The method as claimed in any of claims 16 to 19, further comprising at least one of:

receiving the at least one audio signal from a microphone; and

generating the at least one audio signal in a synthetic sound generator.

21 . The method as claimed in any of claims 16 to 20, wherein analysing at least one audio signal to determine spectral content of the at least one audio signal further comprises:

determining a first energy content of the at least one audio signal below a first frequency value; and

determining a second energy content of the at least one audio signal above the first frequency value.

22. The method as claimed in claim 21 , wherein the first frequency value is 3 kHz.

23. The method as claimed in any of claims 21 and 22, when dependent on claim 17, wherein spectrally extending the at least one audio signal based on the spectral content of the at least one audio signal comprises at least one of:

adding content with a frequency above the first frequency value to the at least one audio signal;

applying spectral band replication to the at least one audio signal; selecting content from lower frequencies of the at least one audio signal, transposing the content to higher frequencies, and matching a harmonic structure of the signal;

adding noise with a frequency above the first frequency value to the at least one audio signal; and

applying a spectral tilt which amplifies higher frequencies above the first frequency value to the at least one audio signal.

24. The method as claimed in any of claims 21 to 23, when dependent on claim 18, wherein dividing the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal comprises dividing the at least one audio signal into the second part below a determined frequency value and the first part above the determined frequency value.

25. The method as claimed in any of claims 21 to 23, when dependent on claims 18 and 17, wherein dividing the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal comprises dividing the at least one audio signal into the second part comprising the at least one audio signal without spectrally extensions and the first part comprising the at least one audio signal spectral extensions.

26. The method as claimed in any of claims 21 to 23, when dependent on claim 18, wherein dividing the at least one audio signal into a first part and a second part based on the spectral content of the at least one audio signal comprises dividing the at least one audio signal using a 3 dB per octave mixing filter with a 50/50 centre-point at a determined frequency value, wherein the first part comprising a high pass filter version of the mixing filter and the second part comprising a low pass filter version of the mixing filter.

27. The method as claimed in any of claims 21 to 23 when dependent on claim 19, wherein combining the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal to generate at least one spatially extended audio signal comprising vertically spatial extent comprises mixing the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal based on a user input defining a head-pose parameter.

28. The method as claimed in claim 27, wherein the user input defines a head-pose parameter set of yaw, pitch and roll wherein mixing the horizontally spatially extended at least part of the at least one audio signal and the vertically spatially extended at least part of the at least one audio signal based on a user input defining a head-pose parameter comprises:

analysing the at least one audio signal to determine whether there is elevation extent with a spectral band extension;

updating a source position for mixing using all of the head-pose parameters yaw, pitch and roll parameters based on determining there is elevation extent and controlling the mixing based on the updated source position;

updating a source position for mixing using only head-pose yaw parameter based on determining there is no elevation extent;

determining whether the source position needs to be perceivable;

controlling the mixing based on the updated positions based on the source actual position being determined as not being needed to be perceivable; and

controlling the mixing based on a source position before updating based on the source actual position being determined as being needed to be perceivable.

29. The method as claimed in any of claims 16 to 28, further comprising receiving a user input for controlling the vertically spatially extending of the second part of the at least one audio signal.

30. The method as claimed in any of claims 16 to 29, wherein analysing at least one audio signal to determine spectral content of the at least one audio signal comprises:

analysing at least one audio signal to determine spectral content of the at least one audio signal; and

storing a result of the analysis as metadata associated with the at least one audio signal prior to the spectrally extending of the at least one audio signal based on the metadata spectral content of the at least one audio signal.