WO2007020544A2

WO2007020544A2 - Method and apparatus for extracting feature information from a multimedia file

Info

Publication number: WO2007020544A2
Application number: PCT/IB2006/052588
Authority: WO
Inventors: Ralf Funken; Javier F. Aprea
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2005-08-12
Filing date: 2006-07-28
Publication date: 2007-02-22
Also published as: WO2007020544A3

Abstract

Feature information from a multimedia file is extracted from an analysis window. To ensure that feature information is extracted from a relevant portion of the file, the location of an analysis window is determined according to the occurrence of an event, steps 101 to 117.

Description

Method and apparatus for extracting feature information from a multimedia file

TECHNICAL FIELD

The present invention relates to a method and apparatus for extracting feature information from a multimedia file. The feature information extracted may be used to classify the multimedia file. In particular, but not exclusively, it relates to identifying audio files (songs) from a large collection of songs which are similar to a seed song to assist a user in compiling playlists that are similar to a given song.

BACKGROUND OF THE INVENTION

Many systems exist that assist users in compiling playlists. Many such systems compare all available songs to a seed song. To achieve this all candidate songs in the collection are analysed and classified in accordance with a number of extracted features. These extracted features are stored in a database. The corresponding features of a seed song are extracted and compared to those features stored in the database. The matching features identified in the database point to the matching/similar candidate songs in the collection and, if desired, this can be added to the playlist.

The features of the audio files (songs) are extracted by means of a feature extraction algorithm. These algorithms are generally very expensive in terms of required computational power, especially when integrated in consumer devices.

Invariably, these known algorithms analyse the whole audio file and extract features (e.g. EFCC coefficients) of audio files in frames or chunks of data at a time. The process of extraction is a very time consuming one because it needs complex operations to be performed for all frames. These algorithms process the audio file from begin to end, extracting a feature vector in respect of each frame (known as a local feature vector). These local feature vectors are collated for the whole file and are averaged. In this way, the average feature vector represents the song being analysed. The average feature vector f for M features has the form

f = [f\f²,...,f^M] With a well-chosen set of features, each music track or audio file can be classified as belonging to a given music genre. The distance between two such average feature vectors is an indication of how similar the corresponding songs are. To be more specific, let J₁ and f _j be the average feature vectors corresponding to the i -th and j -th song, respectively and let N be the total number of items in the database. Given the MxM data covariance matrix C defined as having components

the distance between the i -th and the j -th song is given by

A distance of zero means that both songs are equal (actually, that they have the same average features); a small distance indicates that they are similar songs (actually, that they have similar average features) whereas a large distance indicates that the songs are not related.

Furthermore, it has been recognised that songs belonging to a given genre will have local feature vectors that are normally distributed around the average feature vector. Therefore, there is no need for extracting the local feature vectors from the whole song but from a representative part of the song. However, it is desirable to extract local feature vectors from relevant parts of the song and not from irrelevant parts such as silence or background noise since these will lower the average values.

The problem here is to choose the appropriate part of the song, which is representative for the song as a whole.

Current implementations of extraction algorithms take one of a few approaches: they analyse the whole song, they analyse a fixed portion of audio from the middle of the song or they specify a fixed portion of audio after skipping a fixed portion of audio. From the methods described above, the first is very inefficient but has the largest probability of better representing the song in its average. However, including intro and outro of the song plus potential zero-valued samples could lead to average feature vectors that are not the best choice for the given song.

The other two approaches tend to solve the above problem by choosing a region in the middle of the song or after a given time. However, they may fail in that they don't take into consideration the actual music content.

Furthermore, many audio files are stored in a compressed data format, for example MPEG-I (MP3), MPEG-2 or MPEG-4 AAC to maximise available storage capacity of portable devices used for playback. Existing extraction algorithms have to decode the data of these files before extracting the feature information which requires additional computational resources .

SUMMARY OF THE INVENTION

The present invention aims to reduce the amount of processing required to extract feature information from a multimedia file. This is achieved according to an aspect of the present invention comprising a method for extracting feature information from a multimedia file, the method comprising the steps of: determining the location of an analysis window of the multimedia file in accordance with occurrence of an event within the multimedia file; and extracting the feature information from data within the analysis window. Since the occurrence of an event is determined, such as, for example the maximum energy of the data of the multimedia file or the first occurrence of the signal amplitude of the data of the multimedia file exceeding a predetermined threshold, the chosen region of the file for analysis is more likely to comprise a relevant portion which takes into consideration the content of the file. In extracting the feature information in this way an average feature vector is produced which is representative of the file and prevent analysis taking place in regions with low amplitude.

In the case of compressed data format in which the multimedia file comprises a plurality of frames, each frame having a global gain associated therewith, the event is the maximum global gain. In utilising the existing global gain values of the frames, the creation of the analysis window can be easily established by merely parsing the header of the frames and reading the global gain values without decoding the whole file which speeds up the extraction of the feature information and reduces the computational resources required. BRIEF DESCRIPTION OF DRAWINGS

For a more complete understanding of the present invention and by way of example, reference is made to the following description taken in conjunction with the accompanying drawings, in which: Figure 1 a is a flow diagram of a first embodiment of the present invention;

Figure Ib is a flow diagram of a second embodiment of the present invention;

Figure 2 is a schematic diagram of an example of frame structure of compressed data of an audio file;

Figure 3 illustrates determination of an analysis window according to an embodiment of the present invention; and

Figure 4 is a schematic diagram of apparatus according to a further embodiment of the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS A method according to an embodiment of the present invention will now be described with reference to Fig. Ia. At step 101, the duration of an input multimedia file, t₀, is tested. If the input multimedia file is shorter than a predetermined time interval, t_ls for example 15 seconds, an error is generated, step 103, since this file is considered too short for feature extraction. If the input file is greater than t_ls the method proceeds to step 105 for feature extraction.

In step 105, the duration of the input multimedia file, t₀, is tested again. If the duration of the input file, t₀, is less than the predetermined duration of an analysis window, t₂, for example 90 seconds, the entire input file becomes the analysis window, step 107. If the duration of the input file is greater than the predetermined duration of an analysis window, t₂, the input file is scanned at step 109 for the maximum signal amplitude or acoustic energy level of the acoustic signal of the data contained in the multimedia file. The location a_ls of the maximum signal amplitude, energy level, etc., of the acoustic signal is determined at step 111.

The input file is then scanned again, step 113, to establish the extremes of the analysis window a₂. This is determined either as a time interval t₂ subsequent to a_ls that is, a2⁼ai+t2, step 115, or, alternatively when the signal amplitude or power of the acoustic signal of the multimedia file first reaches a level 6dB below the maximum, step 125. The location of the analysis window is then determined at step 117. The analysis window may be located between ai and a₂ or, alternatively, the analysis window may be centred at ai between locations + a₂.

Following this the feature vectors are extracted from the data within the analysis window, step 119. The extracted feature vectors are averaged, step 121, and stored, step 123, in a feature database for later reference.

Fig. Ib illustrates an alternative, preferred embodiment of the present invention. The method according to this embodiment follows the steps in respect of the first embodiment of Fig. Ia except that step 111 of Fig. Ia is replaced by steps 111 1, 111 3 and 111 5 shown in Fig. Ib.

In the case of a compressed audio file, for example, MP3 format as shown in Fig. 2, the absolute value of the maximum signal amplitude or power cannot be determined for such compressed data files without decoding the file. To overcome this, the global gain values in each frame of every granule in the left/mid channel is read, step 111 1 of Fig. Ib. As illustrated in Fig. 2, a compressed MP3 audio file comprises a plurality of frames 200 1 to 20O n (only frames 200 1 to 200 4 are shown in Fig.2). Each frame comprises a header portion 201 1 to 201_n, an error check portion 203 1 to 203_n, side information portion 205 1 to 205_n, and main data portion 207 1 to 207_n. The side information portions 205 1 to 205_n comprise a plurality of granules of left and mid channels 209L0 1 , 209L1 1 , 209R0 1 and 209R1 1. Each granule contains a global gain value.

To avoid decoding the entire frame, the global gain values provided in each frame is read, step 111 1. The global gain values are then filtered, step 111 2, for example, using a moving average filter with a depth of 100 granules. The location of the maximum global gain value is determined to provide ai of the analysis window, step 111 3.

Fig. 3 illustrates a plot of the amplitude of the acoustic signal of an uncompressed audio file over time. In this embodiment, ai of the analysis window is determined as the first occurrence of the signal amplitudes exceeding a predetermined threshold value S_T. The analysis window W is then determined as starting at ai for a subsequent time interval t₂, say 90 seconds.

The apparatus according to an embodiment of the present invention will now be described with reference to Fig. 4. The apparatus 400 comprises a pre-processor 401 connected to an input terminal 402 of the apparatus 400. The pre-processor 401 is connected to a processor 403. The output of the processor 403 is connected to a comparator 409 and a feature database 405. The output of the comparator 409 and the feature database 405 are connected to a multimedia file store 407. The output of the file store 407 is connected to the output terminal 408 of the apparatus 400.

In use, a multimedia file is input on the input terminal 402. The pre-processor 401 scans the input file to determine the location of an analysis window according to the steps 101 to 117 (and 125) of Fig. Ia. The input file and the location of the analysis window are fed to the processor 403. The processor 403 executes the feature extraction algorithm in accordance with the steps 119 to 123 of Fig. Ia. The feature vectors extracted from the data within the analysis window are averaged and stored in the feature database 405. The multimedia file is stored in the file store 407.

Upon input of a seed song (or multimedia file), the feature vectors are extracted and averaged as described above and are fed to the input of the comparator 409 whereupon the feature vectors of the seed song and the candidate songs stored in the feature database 405 are compared. The comparator 409 determines the distance between the seed song and each candidate song. The candidate songs considered "similar" to the seed sing are then selected from the file store 407 and placed on the output terminal 408 of the apparatus 400 to be forwarded to a user interface device or playlist generator (not shown here) for consideration by the user.

Although preferred embodiments of the present invention has been illustrated in the accompanying drawings and described in the foregoing detailed description, it will be understood that the invention is not limited to the embodiments disclosed, but is capable of numerous modifications without departing from the scope of the invention as set out in the following claims.

Claims

CLAIMS:

1. A method for extracting feature information from a multimedia file, the method comprising the steps of: determining the location of an analysis window of the multimedia file in accordance with occurrence of an event within the multimedia file; - extracting the feature information from data within the analysis window.

2. A method according to claim 1, wherein the occurrence of the event determines the starting point of the analysis window.

3. A method according to claim 1, wherein the occurrence of the event determines the centre of the analysis window.

4. A method according to any one of the preceding claims, wherein the event is the maximum energy of the data of the multimedia file.

5. A method according to any one of claims 1 to 3, wherein the event is the first occurrence of the signal amplitude of the data of the multimedia file exceeding a predetermined threshold.

6. A method according to any one of the claims 1 to 4, wherein the multimedia file comprises compressed data, the compressed data comprising a plurality of frames, each frame having a global gain associated therewith and the event is the maximum global gain.

7. A method according to any one of the preceding claims, wherein the step of extracting feature information includes the steps of: extracting a plurality of features vectors from data within the analysis window; and averaging the plurality of feature vectors.

8. A method according to any one of the preceding claims, wherein the duration of the analysis window comprises a predetermined time interval.

9. A method according to any one of claims 1 to 7, wherein the duration of the analysis window is determined on the basis of a change in the event.

10. Apparatus for extracting feature information from a multimedia file comprising: a preprocessor for determining the location of an analysis window in the multimedia file in accordance with occurrence of an event within the multimedia file; and a processor for extracting feature information from data within the analysis window.

11. Apparatus according to claim 10, wherein the preprocessor further comprises scanning means for scanning the multimedia file to determine the occurrence of the event.

12. Apparatus according to claim 10 or 11, wherein the event is the maximum energy of the data of the multimedia file.

13. Apparatus according to claim 10 or 11, wherein the event is the first occurrence of the signal amplitude of the data of the multimedia file exceeding a predetermined threshold.

14. Apparatus according to any one of claims 10 to 12, wherein the multimedia file comprises compressed data, the compressed data comprising a plurality of frames, each frame having a global gain associated therewith and the event is the maximum global gain.

15. Apparatus according to claim 14, wherein the apparatus further comprises means for reading the global gain values for each frame; and a moving average filter for filtering the read global gain values to determine the maximum global gain value.

16. Apparatus according to any one of claim 10 to 15, wherein the processor comprises: extraction means for extracting the feature vectors form the data of the analysis window of the multimedia file; averaging means for averaging the extracted feature vectors.

17. A computer program product comprising a plurality of program code portions for carrying out the method according to any one of claim 1 to 9.