WO2002013067A2

WO2002013067A2 - System for online rule-based video classification

Info

Publication number: WO2002013067A2
Application number: PCT/US2001/024719
Authority: WO
Inventors: Asha Velliakal; Wensheng Zhou
Original assignee: Hrl Laboratories, Llc
Priority date: 2000-08-05
Filing date: 2001-08-04
Publication date: 2002-02-14
Also published as: AU2001284740A1; WO2002013067A3

Abstract

A rule based video based classification system, which initiates classification with offline training 500. Sample video clips 502 of different categories are identified. Thereafter, an entropy-based inductive tree-learning algorithm is utilized to establish the trained knowledge base. A classifier 504 then accepts video of data 506 to be classified, and utilizes rules in conjunction with extracted video features to classify the input video or data 506. The classifier 504 then provides the classification results 508.

Description

SYSTEM FOR ONLINE RULE-BASED VIDEO CLASSIFICATION

STATEMENT OF GOVERNMENT RIGHTS This invention is related to work performed in contract with the U.S. Government under DARPA ITO Contract #N66001-97-2-8901, and the U.S. Government may have certain rights in this invention.

PRIORITY CLAIM This application claims the benefit of priority to provisional application number 60/223,555, filed in the United States on August 5, 2000, and titled "System for Online Rule-Based Video

Classification" and utility application number 09/708,272 filed in the United States on November 7, 2000, and titled "System for Online Rule-Based Video Classification".

TECHNICAL FIELD The present invention relates to a video classification method and apparatus and more particularly, to a means for categorizing video streams into a fixed set of classes.

BACKGROUND

Automated or semi-automated understanding of multimedia data is critical in navigating the plethora of data that is available. This need continues to grow as multimedia data become increasingly prevalent. This need is most acute in areas such as multimedia database retrieval, data filtering, and content-based navigation. Video continues to distinguish itself as an indispensable data type. It follows that video classification, allowing for quick and easy retrieval is essential in many applications; ranging from database searching to target recognition. Most work in the area of video classification and database retrieval has focused on query-by-example systems with standard similarity metrics directly applied to low-level features. One of the problems with this approach is that combining multiple low-level features to classify video into semantic categories is difficult because generic classification models are difficult to establish. Most of the research on video classifications has used models for each specific type of video. Based on the model, the classification is conducted from the heuristic rules. As the nature of video content varies, it is hard to establish full models for all different kinds of video, and thus this model-based video classification cannot apply to random real-time applications. Previous research work focused mostly on the off-line video classifications for "pull" applications, so they normally have enough time to extract video low-level features and do preprocessing to find patterns for each video. In this method, the video classifier has no clue as to which video low- level is most associated with the video category, and which feature has the most differentiating power for multiple video categories, i.e., what is the weight, priority or order to be applied to video classification.

SUMMARY OF INVENTION

This invention discloses a video classification system that relies on supervised learning (rule- based decision-tree classifier) to categorize video streams into a fixed set of classes. Video classification is Important in many different applications, including defense applications where automatic scene understanding is important and useful in areas ranging from Unmanned Aerial Vehicle (UAV) video surveillance monitoring to special missions.

In the case of commercial applications such as satellite based media, this invention can be used for innovative new services, which can be offered by satellite based media systems, such as broadband-based access and novel interactive TV applications.

One embodiment the invention includes an apparatus, such as a computer, which is configured to process video programming or footage. The processing sequence includes the assignment of semantic designations to a predetermined number of events. For instance, the processor might begin by identifying a number of general characteristics unique to a basketball game. Once having made this threshold determination as to the general nature of a sequence of footage, the processor further processes the footage by breaking it down into a plurality of footage segments. These segments might correspond to "cuts" or breaks in the footage, as might occur when video editors switch from one camera to another camera, or where the camera pans. Utilizing low- level features extracted from the footage segments, and applying the features to one or more if- then else rules each segment can be classified as at least one semantic category. For instance, in the case of a basketball game, the processor might extract a segment and based on the color, motion and other factors determine that the segment shows team A scoring. The processor would then assign at least one semantic classification to that segment of footage. In many situations, more than one classification might be appropriate. For instance, three possible classifications, which might be appropriate to a single segment of footage, are: team A scores, the score and a three point score tally.

The invention may be utilized to classify any type of video; therefore the invention could be used to extract information related to news programming, weather forecasting, financial news, or other programming having characteristics that are distinguishable. The invention also allows a user, or operator, to specify the general characteristics of footage sequence. Thus, the apparatus allows an external source to input the general nature of footage.

The low level features, or primitives, that are extracted and used to semantically classify data include, color information such as color histogram, dominant color and regional color. Additionally, a gradient edge operator may be utilized to detect edges along different directions such as horizontal, vertical, diagonal, cross-diagonal axes. Edge densities along different directions and having different lengths may be calculated as edge features.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 shows a forward prediction codec for MPEG-1 video;

FIGs. 2a and 2b show one technique for motion direction extraction;

FIG. 3 depicts various key frames of basketball scenes; FIG. 4 shows a flow chart for key-frame classification and clustering;

FIG. 5 shows a knowledge based video classification system;

FIG. 6 shows a rule tree for video classification;

FIG. 7 shows a decision tree and rule tree for basketball video classification;

Table 1 represents the results of basketball video classification from rules-based system; and FIG. 8 depicts a flow chart for online video filtering based on a user's profile.

DETAILED DESCRIPTION

The present invention provides a method and an apparatus useful for rule-based video classification, thereby enhancing data categorization. The present invention may be tailored to a variety of other applications. The following description, taken in conjunction with the referenced drawings, is presented to enable one of ordinary skill in the art to make and use the invention and to incorporate it in the context of particular applications. Various modifications, as well as a variety of uses in different applications, will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to a wide range of embodiments. Thus, the present invention is not intended to be limited to the embodiments presented, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. Furthermore it should be noted that unless explicitly stated otherwise, the figures included herein are illustrated diagrammatically and without any specific scale, with the express clarification that as the figures are specifically examples, no restrictive or exclusive character should be assigned to them, their purpose being merely illustrative of the fundamental concept on which the invention is based.

This specification discloses a video classification apparatus that relies on supervised learning (rule-based decision-tree classifier) to categorize video streams into a fixed set of classes. The system of the present invention utilizes the decision-tree method described herein to select a set of if-then-else rules for video classification which can be directly applied to a set of matching functions for low level features such as, but not limited to, color, motion, and texture that are identifiable in video data.

In embodiments of the present invention, an unsupervised learning method can be utilized to cluster the low level features of video data in order to simplify the representation of the feature for further analysis. Motion features, for example, provide effective cues for video analysis. Motion features are a good cue to be used in video, as they are an integral part of a motion sequence. In addition, motion is typically already calculated in most video codecs, and motion compensation information is present in the compressed data stream. In MPEG-1 video, for example, as shown in FIG. 1 there are three types of frames, I frame 100, B frame 102, and P frame 104. In one embodiment of the present invention, in order to find the motion patterns for certain video, direction and magnitude of video sequence motion flows 106 of only P frames 104 were evaluated. P frames 104 only provide forward prediction and this permits calculation of the direction of the "flow" of video data.

In aforementioned embodiments of the present invention, statistical motion descriptors can be utilized to provide macro information about the clip. These statistical descriptors can include, but are not limited to, dominant motion direction and motion magnitude. Use of such descriptors requires significantly less time and computational complexity as compared to the extraction of object motion. Extraction of motion direction is depicted for one embodiment of the present invention in FIGs. 2a and 2b. Each motion vector shown in FIG. 2a in the video data, 1 through 12, 200 can be clustered and classified. In the embodiment presented in FIG. 2b, vectors in regions 1 and 8 202 are termed as RIGHT; vectors in regions 2 and 3 204 as UP; vectors in regions 4 and 5 206 as LEFT; and vectors in regions 6 and 7 208 as DOWN. The amount of motion along each direction can be evaluated by counting the total numbers of vectors along that direction in each class for the whole video sequence. This results in a four-dimensional motion direction vector.

Normally the magnitude of the motion is also encoded in the motion vector. In order to obtain the instances of magnitude and speed of the motion descriptors along both X and Y directions, the motion magnitude of the whole frame can be calculated according to the following equation:

'^■=0 , , _ - ,-\ _ J j⁼=⁰sι

^X _aVe(ⁱ ) = - >y_ave = (ⁱ ) = m (1)

where n and m are the total number of motion vectors with respect to the X direction and Y direction, respectively, in a frame of video data. In addition to motion, color and edge information may also be utilized in video classification according to the present invention. In embodiments of the present invention utilizing these characteristics for classification, video data can be divided into key image categories. An example of such categorization is provided in FIG. 3. In this example, video from a basketball game is divided into key frames, specifically left court 300, right court 302, middle court 304, and close-up images 306. Color information including, but not necessarily limited to, color histogram, dominant color and regional color may be extracted from the video data to develop the categorization. In addition, a gradient edge operator may be utilized to detect edges along different directions including, but not necessarily limited to, horizontal, vertical, diagonal, or cross-diagonal. Edge densities along different directions and with different lengths may then be calculated as edge features. In addition, the edge information can be analyzed along one or more directions by distribution of visible edges and by clustering into horizontal, vertical and/or other category edge type. A flow chart for key-frame clustering as might be applied to basketball game video data is shown in FIG. 4. According to this flowchart data, in the form of a frame or image 400 is input into an apparatus. The data is then exposed to elements configured to extract edge features 402 and color features 404. Thereafter the data is sent to a clustering or classification step 406; the clusters are then assigned to their respective category 408. It is anticipated that either additional or fewer categories may be appropriate depending on the nature of the data.

As noted above, the present invention utilizes decision-tree learning in order to develop an effective video classification system. Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented by a decision tree. Learned decision trees are constructed by first evaluating what attribute, should be tested at the root of the tree. Next, the most useful attribute for classifying examples and a quantitative measurement of its value to the classification must be identified. The ultimate purpose is to develop a quantitative measurement of how well a given attribute separates the training examples according to their target classification.

With regard to video data classification, these attributes can be in the form of the low level video data features discussed in the preceding paragraphs. A broad variety of features are used as indexing keys to query video/image databases, but each feature is not always the best selection for all queries under different semantic environments. In order to achieve efficient feature indexing arid classification, the effectiveness of a particular feature for identifying proper classification must be properly assessed.

In the decision tree learning method, each potential attribute is statistically evaluated to determine how well it alone classifies the training data. The best attribute for achieving classification is then selected and used as a test at the root node of the decision tree. Branches from the root node are then created to the possible values of the chosen attribute, or descendant nodes, and training data are sorted to the appropriate descendant nodes. The entire process is then repeated such that the possible attributes are statistically evaluated for relevance at each of the descendant nodes.

Evaluation of attributes is based on a statistical value known as information gain. The information gain for a given attribute is the expected reduction in entropy resulting from classification of a data set according to the attribute. The entropy itself of a given collection of data S wherein S can be classified into two target concepts is defined according to equation 2:

Entropy(S) = -p_x log₂ p_l - p₂ log₂ p₂ , (2) wherein p is the proportion of S belonging to a first category and p₂ is the proportion of S belonging to a second category of target concepts.

More generally, if the target attribute can take on c different values, then the entropy of S relative to this c-wise classification is defined according to equation 3:

Entropy (S) = ∑-p, log₂ p, (3)

1=1

where p. is the proportion of S belonging to category z. As the target attribute can take on c possible values, the entropy can be as large as log_2c. Entropy is equal to 0 (minimum) when all the examples in a set of data belong to the same class, and entropy is equal to 1 (maximum) when each class is equally distributed in the given set of data.

The value of the information gain, Gain (S, A) , for a particular attribute A, for the same collection of data S can thus be calculated according to equation 4:

Gain(S,A) = , (4)

where Value (A) is the set of all possible values for attribute A, and S_v is the subset of S for which attribute A has value v (i.e., S_v = {s e S | A(s) = v} ). Note the first term in Equation (4) is just the entropy of the original collection S, and the second term is the expected value of the entropy after S is partitioned using attribute A.

The information gain value, as discussed above, is utilized for each attribute to identify the best attribute to serve as a test at the root node of the tree. Descendant nodes are created based on the possible values for the selected attribute and the process then repeats itself at each of these descendant nodes. Once an attribute is identified for a node, it is no longer evaluated for subsequent nodes in the tree. The process of evaluating attributes and generating additional descendants continues until either of the two conditions is met: (1) Every attribute has already been included along this path through the tree, or (2) the training examples associated with a given "leaf node all have the same target attribute value (i.e. their entropy is zero).

The development of the tree depends upon the split criterion and the stop criterion for generation of nodes. A good tree should have few levels, as it is better to classify with as few decisions as possible. In addition, a good tree should have a large leaf population, as it is better to classify with as many cases as possible. In the learning algorithm, the training data is split upon the variable that maximizes the Gain(S, A). Here the "gain" value is only based on the class distribution, which makes the computation easy to perform. Further, "entropy gain measure" does not take popularity into consideration. If the stop criterion is chosen with entropy equal to 0 (same class for all cases), over-fitting will result and undesirably deep trees with few cases on the leaf nodes will be created. So, it is preferable to choose the stop criterion either at a minimum popularity allowed for by the leaves, at a certain entropy value to be reached, or, most preferably, as a combination of these two conditions. Typically, the technique will over-fit the tree first and then proceed with "pruning".

In practice, one successful method for finding high accuracy hypotheses for classification is a technique called rule post-pruning. A variant of this pruning method is used by the well-known program, C4.5. Rule post-pruning involves the following steps:

1. Inferring the decision tree from the training data set, growing the tree until the training data is fit as well as possible and allowing overfitting to occur;

2. Converting the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node; 3. Pruning (generalizing) each rule by removing any preconditions that result in improving its estimated accuracy; and

4. Sorting the pruned rules by their estimated accuracy, and considering them in this sequence when classifying subsequent instances.

In rule post-pruning, one rule is generated for each leaf node in the tree. Each attribute test along the path from the root to the leaf becomes a rule antecedent (precondition) and the classification at the leaf node becomes a rule consequent (postcondition). Next, each such rule is pruned by removing any antecedent, or precondition, whose removal does not worsen its estimated accuracy. It would select whichever of the above pruning steps produced the greatest improvement in estimated rule accuracy, then consider pruning the second precondition as a further pruning step. No pruning step is performed if it reduces the estimated rule accuracy.

In the C4.5 technique, the performance is evaluated based on the training set, itself, using a pessimistic estimate to make up for the fact that the training data gives an estimate biased in favor of the rules. More specifically, C4.5 calculates its pessimistic estimate by calculating the standard deviation in this estimated accuracy assuming a binomial distribution. For a given confidence level, the low-bound estimate is then taken as the measure of rule performance. For large data sets, this pessimistic estimate is very close to the observed accuracy.

The rules can be generally expressed as follows: IF (featurel = valuel ) & (feature2 = value2) THEN Class = class 1 ELSE Class ≠ class 1

Converting decision frees to rules permits distinguishing among the different contexts in which a decision node is used and removes the distinction between attribute tests that occur near the root of the tree from those that occur near the "leaves" of the tree. Further, conversion to rules greatly improves readability of the classification system and renders the system easier to incorporate with a knowledge base for further intelligent inference and reasoning.

An embodiment of the video classification system of the present invention is shown in FIG. 5. This embodiment of the present invention initiates classification with offline training 500. Sample video clips 502 of different categories are identified and appropriate low-level features are identified. Thereafter, an entropy-based inductive free- learning algorithm is utilized to establish the frained knowledge base. This knowledge base is represented as a decision-tree with each node in the tree being an if-then-else rule as applied to a similarity metric utilizing an appropriate low-level feature that is either user specified or generated, along with a good threshold. This threshold, similarly, may also be user specified or derived; generally it will be derived. The classifier 504 then accepts video or data 506 to be classified and utilizes the rules in the Decision Tree in conjunction extracted visual features from the input video or data 506 to classify the input video or data 506. The classifier 504 then provides the classification results 508. The rule scheme for this embodiment of the invention is shown in FIG. 6, where the rule at each level is depicted as <F, θ > 600 wherein F represents the low-level feature and θ represents the derived threshold. In most situations, as was stated above, the appropriate feature F and a good threshold θ are automatically created by the training process. Furthermore, the semantic categories for classification form the leaves of the free. New video data is classified according to the decision tree as follows. First, the feature, which is utilized at the root level, is initially exfracted and the corresponding rule is applied to direct the data to the next level on the rule tree. Second, at the next level, the same step is carried out, whereby an appropriate feature is selected and the corresponding rule applied. Note that if a feature was already calculated earlier on in the free, it need not be recalculated.

As an example of the above embodiments of the present invention, the decision-tree training algorithm described in the preceding paragraphs was applied to motion, color, and edge features in video data of a basketball game. These features were extracted directly from MPEG-1 video. Nine classes of data for the basketball game date were identified and are outlined in FIG. 7. These classes were as follows: 1. team offense at left court 700; 2. team offense at right court 702; 3. fastbreak to left 704; 4. fastbreak to right 706; 5. dunk in left court 708; 6. dunk in right court 710; 7 score in right court 712; 8. score in left court 714; 9. close-ups of audience members or players 716. In one embodiment of the present invention a set of data from one basketball game was utilized to train the learning algorithm and obtain the critical patterns of the classes for efficient rule classifications. Using the training data for the identified classes, an at-maximum three-level decision tree containing fourteen rules was generated. Thus, classification required at most three calculations for each class. No more than 6 features were needed to classify all nine basketball event classes.

The resultant decision tree as outlined in FIG. 7, was prepared according to the present invention. The decision tree prepared according to the present invention was applied to approximately 55 video data clips from a different set of basketball game video data. The results of the classification are outlined in Table 1. Use of the classification method according to the present invention when applied to basketball game footage provided classification accuracies of 70% to 82% for all identified class of the basketball game video data. Results maybe improved through additional training.

The rule-based video classification system of the present invention is useful both for on-line and off-line video classifications, and thus has applications in video indexing systems, video scene understanding and data mining, on-line video filtering, intelligent video summarization, and fast video browsing. General video classification problems can be easily resolved by following the embodiment of the present invention illustrated in FIG. 5.

The present invention can further be easily applied to "smart" video filtering based upon a specified user profile. FIG. 8 illustrates the data flow for this type of application according to the present invention. The classification system of the present invention may be utilized for realtime user-preference oriented multimedia data distribution via the Internet in "push" applications. This system can provide on-line feature extraction and semantic classification that would be used for data matching to user profile and filtering tools for real-time multimedia distribution and sharing over the Internet. The system can also be extended to other related applications, including interactive TV broadcast services, training services and collaboration.

Claims

1. An apparatus configured to operate on video footage and assign semantic designations to a predetermined number of events, comprising: a. a first element configured to provide video footage; b. a second element, operatively connected to said first element, configured to accept said provided video footage; c. a third element, operatively connected to said second element, configured to identify a plurality of general characteristics in said accepted video footage, wherein said accepted video footage is comprised of a plurality of video footage segments; d. a fourth element, operatively connected to said third element, configured to extract low-level features of said video footage segments, and apply if-then-else rules to classify, each video footage segment as at least one of a plurality of different semantic categories; and e. a fifth element, operatively connected to said fourth element, configured to assign at least one semantic classification to each video footage segment.

2. The apparatus of claim 1 , wherein the identification of said general characteristics of said sequence of accepted video footage are constrained to identify characteristics that are unique to sporting events.

3. The apparatus of claim 1 , wherein the identification of said general characteristics of said sequence of accepted video footage is designated by an operator.

4. The apparatus of claim 1 wherein the extended low level features are comprised of at least one of the following: a. motion features: i. direction; ii. accumulated motion magnitude; and iii. direction distribution; and b. color features: i. color histogram; ii. dominant color; iii. regional color; and iv. low-level color; and c. edges along different directions: i. horizontal; ii. vertical; iii. diagonal; iv. cross-diagonal; v. edge densities along different directions; and vi. edge densities with different lengths; and d. textual features.

5. The apparatus of claim 1, wherein the at least one semantic classification for each video footage segment is given a plurality of synonymous classifications.

6. The apparatus of claim 1, wherein the at least one semantic classification for each video footage segment includes at least one general and at least one specific classification.

7. The apparatus of claim 1, wherein the apparatus is trained using inductive decision-free training, to develop if-then-else rules operating on low-level features for different categories of footage.

8. The apparatus of claim 7, wherein said decision tree forms the basis for the semantic classification.

9. The apparatus of claim 1 , wherein the classification rules may be calculated using an inductive decision-free learning approach applied to multiple low-level image features.

10. A method for operating on video footage and for assigning semantic designations to a predetermined number of events, comprising the steps of: a. identifying general characteristics of a sequence of video footage, wherein said sequence of video footage is comprised of a plurality of video footage segments; b. utilizing low-level features of said video footage segments, and if-then-else rules to classify each video footage segment as at least one of a plurality of different semantic categories; and c. assigning at least one semantic classification to each video footage segment.

11. The method of claim 10, wherein the identification of said general characteristics of said sequence of video footage are constrained to identify characteristics that are unique to sporting events.

12. The method of claim 10, wherein the identification of said general characteristics of said sequence of video footage is designated by an operator.

13. The method of claim 10, wherein the low level features are comprised of at least one of the following: a. motion features: i. direction; ii. accumulated motion magnitude; and iii. direction distribution; and b. color features: i. color histogram; ii. dominant color; iii. regional color; and iv. low-level color; and c. edges along different directions: i. horizontal; ii. vertical; iii. diagonal; iv. cross-diagonal; v. edge densities along different directions; and vi. edge densities with different lengths; and d. textual features.

14. The method of claim 10, wherein the at least one semantic classification for each video footage segment is given a plurality of synonymous classifications.

15. The method of claim 10, wherein the at least one semantic classification for each video footage segment includes at least one general and at least one specific classification.

16. The method of claim 10, further comprising a step of using an inductive decision-tree fraining, to develop if-then-else rules operating on low-level features for different categories of video footage.

17. The method of claim 16 wherein said decision tree forms the basis for the semantic classification.

18. The method of claim 10 wherein the classification rules may be calculated using an inductive decision-free learning approach applied to multiple law-level image features.

19. At apparatus for video data classification comprising: a. an offline training element configured to accept pre-classified fraining video data; b. a classifying element, wherein said classii element is configured to i. accept input from the offline training element; ii. accept video input to be classified; and iii. provide classified video output.

20. The apparatus of claim 19, wherein the offline training element is comprised of: a. a visual feature extraction clement; b. an application specific feature post processor; and c. an entropy-based decision free created from training data.

21. The apparatus of claim 19, wherein the classifying element is comprised of: a. a first classifying sub-element comprising a knowledge base relying on rules from a decision free; b. a second classifying sub-element comprising a visual feature extraction and a post processing sub-clement; and c. a third classifying sub-element comprising a rule based video classifier.