WO2007036844A2

WO2007036844A2 - Method and apparatus for automatic structure analysis of audio

Info

Publication number: WO2007036844A2
Application number: PCT/IB2006/053388
Authority: WO
Inventors: Aweke N. Lemma; Francesco F. M. Zijderveld
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2005-09-30
Filing date: 2006-09-20
Publication date: 2007-04-05
Also published as: WO2007036844A3

Abstract

Method and apparatus for determining the intra and/or outro of an audio content item in which the location (EOIBS; SOOBS) at which the beat strength of the audio content item remains above a first threshold level for a minimum time interval and the location (EOILVL; SOOLVL) at which the signal level of the audio content item remained above a second threshold level for a minimum time interval are used to identify the end of the intra (EOI) and/or start of the outro (SOO).

Description

Method and apparatus for automatic structure analysis of audio

FIELD OF THE INVENTION

The present invention relates to a method and apparatus for automatic analysis of the structure of an audio content item. In particular, but not exclusively, it relates to locating the intro and outro of a music track.

BACKGROUND OF THE INVENTION

Generally a music track (or song) can be segmented into three major parts: the intro, meat and outro. The intro is the portion of the song which gradually builds up to the steady state part of the song and the outro is the portion of the song that gradually fades away from the steady state part of the song. The steady state part of the song that more or less defines the property of the song such as genre, mode, etc is referred to as the meat of the song. In an automatic DJ (hereinafter referred to as "AutoDF'), the aim is to provide a smooth transition between music tracks reflected by smooth continuation of rhythmical behaviours such as tempo, playback level etc. This can prove difficult in the intro and/or outro sections of the audio track as generally they differ greatly from the main part of the track. It is thus desirable to automatically determine and if necessary remove the intro and/or outro of a song in order to enable rhythmically and artistically smooth transition between the tracks.

A generic schematic of the function of a commonly used AutoDJ system is shown in Figure 1. First, music tracks stored in database 101 are analysed to extract representative parameters 103. These include, among other things, the end of the intro section, the beginning of the outro section, phrase or bar boundaries, tempo, beat locations, and harmonic signature, etc. These parameters are usually computed offline and stored in a linked database 105. A playlist generator 107 uses the pre-computed representative parameters in the database 105 and a set of user preferences 117 to generate a playlist at the output of the playlist generator 107. Given such a playlist, a transition planner 109 compares the extracted parameters corresponding to the music tracks in the playlist and generates a set of mix commands to be used by the player 111. The mix commands are such that an appropriately defined penalty iunction combining user preferences and similarity between the computed features is minimized.

Finally, the player 111 streams the music tracks in the playlist to the output device 113, for example loudspeaker. For a smooth and natural transition between the tracks in the playlist, it is desirable to rhythmically synchronise songs to be mixed. This implies equalisation of tempo and synchronisation of selected events of successive playlist entries. The selection of the events has high impact on the perceived artistic quality of the transition between the music tracks. For transitions where tempo and level are intended to slide smoothly from one track to the next, intro and outro areas that differ greatly from the rest of the song must be identified and excluded from the streaming.

Some attempts have been made in existing automatic DJ systems to determine the location of the intro and/or outro of a music track. However, these have proved to be unreliable.

SUMMARY OF THE INVENTION

Therefore, it is desirable to provide a reliable method to determine the intro/outro areas of the audio content item, in particular the intro-outro areas of songs to be mixed. These areas can then be exempted from beat mixing process or, preferably, to minimise complexity, synchronisation events are not be determined in these areas. It is understood that the audio content item may be the analogue or digital audio component of a multimedia file or it may be a music track or song.

According to an aspect of the present invention, there is provided a method and apparatus for determining the intro and/or outro of an audio content item, comprising: determining a location at which a beat strength of the audio content item remains above a first threshold level for a minimum time interval to identify the end of the intro and/or start of the outro.

In this way a reliable method is provided to determine location of the intro and outro sections by use of the beat strength of the audio content item. The beat strength of an audio content item is the spectral and power properties of a beat and a measure of the conformance to the other beats in the audio content item. The article "Human perception and computer extraction of musical beat strength" by George Tzanetakis, Georg Essl and Perry

Cook (Proc. of the 5^th Int. Conference on Digital Audio Effects, Hamburg, Germany,

September 26-28,2002) gives an example of how beat strength can be determined. In order to locate the intro and/or the outro areas, a significant and definite changes in beat strength is detected. For that purpose, statistical filters may be applied to the data. Possible filters are moving median and moving minimal functions. In the filtered functions, locations where a certain threshold is "definitely" crossed are determined. The value of the threshold is, preferably, related to a representation of the beat strength through the entire audio content item, for example mean, median or peak values.

The end of the intro may be determined as the location at which the beat strength of the audio content item remains, for the first time, above the first threshold level for a minimum time interval and the start of the outro may be determined as the location at which the beat strength of the audio content item remains, for the last time, above the first threshold level for a minimum time interval.

In this way, the location of the intro is found searching from the start, then demanding that the beat strength evolution remains above the threshold for a minimum amount of time. The outro can be located in a similar way, but starting from the end of the audio file.

The beat strength may be derived from the moving minimum of the tempo data of the audio file.

Preferably, the method further comprises the step of: determining the location at which the signal level is above a second threshold level for a minimum time interval. In this way the end of the intro is determined as the maximum of the location at which the beat strength of the audio content item remains, for the first time, above the first threshold level for a minimum time interval and the location at which the signal level remains, for the first time, above the second threshold level for a minimum time interval and the start of the outro is determined as the minimum of the location at which the beat strength of the audio content item remains, for the last time, above the first threshold level for a minimum time interval and the location at which the signal level remains, for the last time, above the second threshold level for a minimum time interval

In the preferred embodiment, the beat onsets are identified and the evolution of the corresponding beat strengths is recorded. In parallel, the evolution of the signal level representation in the music track is determined and recorded as well. This representation can, for example, be the RMS value in a window that has the same order of magnitude as the beat period of a song. For maximal correlation between information derived from beat strength and signal level, it is preferable to synchronise the level detection time windows with the beat locations, i.e. measuring the level between beats. The signal level may be derived from the moving median of the signal level data of the audio content item.

BRIEF DESCRIPTION OF DRAWINGS For a more complete understanding of the present invention, reference is made to the following description taken in conjunction with the accompanying drawings, in which:

Figure 1 is a schematic diagram of the functions of a known automatic DJ system;

Figure 2 is a schematic diagram of an algorithm for determining the beat strength for locating the intro and/or outro according to an embodiment of the present invention; and

Figure 3 is a graphical representation of determining the location of the intro and/or outro according to the embodiment of the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

With reference to Figure 2, a basic algorithm for detecting the beat and beat strength of an audio content item will be described.

The algorithm 200 for detecting the beat and beat strength of an input audio content item x[n] comprises framing means 203 which is connected to an input 201 of the algorithm 200. The input 201 is provided with the input audio content item (track) x[n]. The output of the framing means 203 is connected to a Finite Fourier Transform unit (FTT unit) 205. The output of the FFT unit 205 is connected to a weighted energy computing unit 207 in which a weighted energy per frame is computed. If E[k] represents the output of the energy computing unit and Xk[m] the FFT of the k-th frame signal, then E[k] is computed as

where wfmj is the weighting value of the rø-th FFT bin. The output of the weighted energy computing unit 207 is connected to a buffer 209 where the function E[k] is stored for further processing. Subsequently, in the event detector unit 211, the energy function E[k] is evaluated to determine hypotheses on the position and strength of the beats. The output of the event detector 211 is then provided to an event selector 213. The event selector 213 is additionally provided with the a rough estimate of the tempo in beats per minute (BPM) of the content on the input 215 of the algorithm 200. This rough estimate could, for example, be computed by analysing the harmonic property of the energy signal E[k] or obtained from the metadata of the song. The output of the event selector 213 provides the tempo in beats per minute (BPM), beat onset positions and the beat strength on the output 217 of the algorithm 200.

The algorithm is based on statistics of the spectral evolution of the audio track. The output of the algorithm provides the estimates of beat onsets and instantaneous tempo. It also calculates certain parameters related to each detected beat, reflecting among other things its spectral and power related properties as well as a measure of conformance to the other beats in the song. This value is referred to as the beat strength.

The method according to an embodiment of the present invention delivers four candidate locations in the music track: level based End Of Intro (EOI_LVL) indicating the location of the start of a region in which, for the first time, the signal level of an audio content item exceeds and remains above a threshold for at least a certain time interval; beat strength based End Of Intro (EOI_BS) indicating the location of the start of a region in which, for the first time, the beat strength of an audio content item exceeds and remains above a threshold for at least a certain time interval ; level based Start Of Outro (SOO_LVL) indicating the location of the end of a region in which, for the last time, the signal level of an audio content item exceeds and remains above a threshold for at least a certain time interval ; and beat strength based Start Of Outro (SOO_BS) indicating the location of the end of a region in which, for the last time, the beat strength of an audio content item exceeds and remains above a threshold for at least a certain time interval .

The definitive End Of Intro (EOI) and Start Of Outro (SOO) are calculated using the relations

EOI = max {EOILVL, EOIBS} (Eqn 1) SOO = min {SOOLVL, SOOBS} (Eqn 2)

Let x[k] , and y[k] be the raw tempo and level data respectively, then in one embodiment of the Intro/Outro extraction stage, the following computations are made. First the moving minimum m_BS[k] of the tempo data and the moving median m_LrL[k] of the level data are computed as

m_BS[k] = ran {x[n]} ne[k,k+D_BS] m_LVL [k] = median {y[n]\ m[k,k+D_M] where D_BS and D_LVL are window sizes used to compute the sliding quantities. Typically D_BS = 16 and D_LVL = 200ms . Once these quanties are determined, corresponding candidates for end of intro (EOI) and start of outro (SOO) are computed:

EOIBS = min{arg{»iBs[k]>TBs}} EOILVL = min{arg{»2_LvL{k}>T_LvL}}

SOOBS = max{arg{»iBs[k]>TBs}} SOOLVL = max{arg{»iLVL[k]>TLVL} }

wherein T_Bs and T_LVL are the thresholds corresponding to the tempo and level data, respectively. Finally, the EOI and SOO are computed using Equations 1 and 2. To prevent outstanding spikes, it is necessary that the moving minimum of the beat strength and the moving median of the level should stay long enough above the threshold. Graphical representation of the above process is shown in Figure 3.

The graphical representation 300 illustrates the variation of the beat strength of an audio content item over time from the start of the track until the end. EOI_BS is the point at which the beat strength of the audio content item goes above the threshold T_Bs for at least a certain predetermined time interval. Therefore, for example if the beat strength exceeds the threshold T_Bs for a short interval, as shown at point 301, this will not be an indication of the end of the intro.

In a similar way the start of the outro SOO_BS is located by locating the point at which the beat strength exceeds the threshold T_Bs for a predetermined time interval from the end of the music track. Since this is located from the end of the track, if any point, for example of point 303, the beat strength goes below the threshold level in the main part of the track, this is not taken as the location for the start of the outro.

The graphical representation 350 illustrates the variation of the signal level of the audio content item over time from the start of the track until the end. EOI_LVL is the point at which the signal level of the audio content item goes above the threshold T_LVL for a predetermined time interval.

In a similar way the start of the outro SOO_LVL is located by locating the point at which the signal level exceeds the threshold T_LVL for a predetermined time interval from the end of the music track. Therefore, for example if the signal level exceeds the threshold T_LVL for a short interval as shown for example at point 351, this will not be an indication of the start of the outro.

Using Equations 1 and 2 above, the end of the intro EOI of the track is located, in the example shown in figure 3, as EOI_BS, graphical shown on the time line 370, and the start of the outro SOO of the track is located as SOO_LVL as graphical shown on the time line 370.

Although a preferred embodiment of the present invention has been illustrated in the accompanying drawings and described in the foregoing detailed description, it will be understood that the invention is not limited to the embodiment disclosed, but is capable of numerous modifications without departing from the scope of the invention as set out in the following claims.

Claims

CLAIMS:

1. A method for determining the intro and/or outro of an audio content item, the method comprising the step of: determining a location at which a beat strength of the audio content item remains above a first threshold level for a minimum time interval to identify the end of the intro and/or start of the outro.

2. A method according to claim 1, wherein the end of the intro is determined as the location at which the beat strength of the audio content item remains, for the first time, above the first threshold level for a minimum time interval.

3. A method according to claim 1 or 2, wherein the start of the outro is determined as the location at which the beat strength of the audio content item remains, for the last time, above the first threshold level for a minimum time interval.

4. A method according to any one of the preceding claims, wherein the beat strength is derived from the moving minimum of the tempo data of the audio content item.

5. A method according to any one of the preceding claims, wherein the method further comprises the step of: determining the location at which the signal level is above a second threshold level for a minimum time interval.

6. A method according to claim 5, wherein the end of the intro is determined as the maximum of the location at which the beat strength of the audio content item remains, for the first time, above the first threshold level for a minimum time interval and the location at which the signal level remains, for the first time, above the second threshold level for a minimum time interval.

7. A method according to claim 5 or 6, wherein the start of the outro is determined as the minimum of the location at which the beat strength of the audio content item remains, for the last time, above the first threshold level for a minimum time interval and the location at which the signal level remains, for the last time, above the second threshold level for a minimum time interval.

8. A method according to any one of claims 5 to 7, wherein the signal level is derived from the moving median of the signal level data of the audio content item.

9. Apparatus for determining the intro and/or outro of an audio content item, the apparatus comprising: means for determining the location at which the beat strength of the audio content item remains above a first threshold level for a minimum time interval to identify the end of the intro and/or start of the outro.

10. Apparatus according to claim 9, wherein the beat strength is derived from the moving minimum of the tempo data of the audio content item.

11. Apparatus according to claim 9 or 10, wherein the apparatus further comprises: means for determining the location at which the signal level is above a second threshold level for a minimum time interval.

12. Apparatus according claim 11, wherein the signal level is derived from the moving median of the signal level data of the audio content item.

13. A computer program product comprising a plurality of program code portions for carrying out the method according to any one of claims 1 to 8.

14. An automatic DJ system for providing a smooth transition between audio tracks such that the system includes apparatus according to any one of claims 9 to 12.