CN101404032A

CN101404032A - Video retrieval method and system based on contents

Info

Publication number: CN101404032A
Application number: CNA2008102262681A
Authority: CN
Inventors: 尹浩; 惠雯; 张焕强; 黄东; 李铮; 陈文涛
Original assignee: Beijing Blue It Technologies Co ltd; Tsinghua University
Current assignee: Beijing Blue It Technologies Co ltd; Tsinghua University
Priority date: 2008-11-11
Filing date: 2008-11-11
Publication date: 2009-04-08
Anticipated expiration: 2028-11-11
Also published as: CN101404032B

Abstract

The invention discloses a content-based video retrieval method and system. In order to solve the problem of low efficiency of content-based video retrieval, the method disclosed in the invention includes: obtaining the feature vector of the video frame to be detected; The index number of the feature vector of the sensitive video frame to be compared and the predefined conversion rule are used to find the corresponding Cartesian coordinates; according to the found Cartesian coordinates, the feature vector of the video frame to be detected is sent to the retrieval module of the corresponding server; The retrieval module judges the similarity between the eigenvectors of the video frames to be detected and the eigenvectors of the sensitive video frames to be compared. Since the content addressing network is used to organize the video fingerprint database reasonably, establish an index, and obtain the video instances to be detected, some kind of retrieval method is used. The algorithm finds the most matching sensitive video feature vector from the fingerprint library, which improves the efficiency of video retrieval.

Description

A kind of Content-based Video Retrieval method and system

Technical field

The invention belongs to the multimedia application field, particularly a kind of Content-based Video Retrieval method and system.

Background technology

Content-based Video Retrieval (CBID) is can be according to the technology of the quick search video object of video content.By feature extraction object video is mapped as point (promptly extracting video finger print) in the high dimension vector space, so just the arest neighbors that the similarity searching problem of object video is converted in the higher dimensional space searches problem.

For mass data, how setting up effective index structure is the key issue that improves retrieval precision and efficient.Popular multi-dimensional indexing technology comprises gridfile, k-d-B tree, quaternary tree, hB tree, R tree and mutation R+ tree and R* tree etc. now, these all are based on the space or based on the division methods of DATA DISTRIBUTION, have good performance under the situation of dimension not too high (below 10 dimensions).

The multi-dimensional indexing technology comprises that also some are suitable for the more indexing means of higher-dimension, as vector approximation method (VA-file), LSH (Locality Sensitive Hashing, local sensitivity Hash table), space filling curve (space-filling curve) etc.

Adopting which kind of multi-dimensional indexing technology in the prior art all is to finish relevant treatment in a station server, and the efficient of Content-based Video Retrieval is lower like this.

Summary of the invention

For the lower problem of efficient that solves Content-based Video Retrieval, the embodiment of the invention provides a kind of Content-based Video Retrieval method, this method is applied to content addressed network, is provided with a plurality of servers on the node in content addressed network cartesian coordinate space, comprising:

Acquisition module obtains the proper vector of frame of video to be detected;

Search the call number of module according to sensitive video frequency frame proper vector to be compared in the video finger print storehouse, and predefined transformation rule, from content addressed network, find the Cartesian coordinates that correspondence is preserved the server of sensitive video frequency frame proper vector to be compared in the Cartesian coordinates of each server;

Sending module sends to the proper vector of frame of video to be detected the retrieval module of corresponding with service device according to the Cartesian coordinates that finds;

Retrieval module is judged the proper vector of frame of video to be detected and the similarity of sensitive video frequency frame proper vector to be compared, and determines the sensitive video frequency proper vector of coupling.

Simultaneously the embodiment of the invention also provides a kind of Content-based Video Retrieval system, comprising:

The video finger print storehouse: be used to preserve the sensitive video frequency frame proper vector with call number, described video finger print storehouse is evenly distributed in a plurality of servers on the content addressed network cartesian coordinate space node;

Acquisition module: the proper vector that is used to obtain frame of video to be detected;

Search module: be used for call number according to video finger print storehouse sensitive video frequency frame to be compared proper vector, and predefined transformation rule, from the Cartesian coordinates of each server, find the Cartesian coordinates that correspondence is preserved the server of sensitive video frequency frame proper vector to be compared;

Sending module: be used for the proper vector of frame of video to be detected being sent to the retrieval module of corresponding with service device according to the Cartesian coordinates that finds;

Retrieval module: be used to judge the proper vector of frame of video to be detected and the similarity of sensitive video frequency frame proper vector to be compared, and determine the sensitive video frequency proper vector of coupling.

The specific embodiments that is provided by the invention described above as can be seen, just because of application content addressing network rationalization is carried out in the video finger print storehouse, set up index, after obtaining video example to be detected, adopt certain searching algorithm from fingerprint base, to search the sensitive video frequency proper vector of mating most, make the efficient of video frequency searching be improved.

Description of drawings

Fig. 1 is the first embodiment method flow diagram provided by the invention;

Fig. 2 is a content addressed network diagram provided by the invention;

Fig. 3 is the second embodiment system construction drawing provided by the invention.

Embodiment

For the lower problem of efficient that solves Content-based Video Retrieval, the embodiment of the invention provides a kind of Content-based Video Retrieval method, to improve recall precision, adapts to the needs of large scale network video frequency searching.When the user imports video example to be measured, adopt certain searching algorithm from fingerprint base, to search the sensitive video frequency frame and the corresponding sensitive video frequency segment of coupling, and pairing similar sensitive video frequency fragment is returned to the user.The video finger print here is meant the frame of video proper vector of extracting from original video data, can represent the content of this video.

Wherein, the vector approximation method can solve the problem of accurate arest neighbors retrieval, and additive method is then only at approximate arest neighbors retrieval.Because video finger print itself is exactly the approximate representation of video content, the arest neighbors of spatial signature vectors does not also mean that arest neighbors on the video content, so even accurately arest neighbors retrieval does not guarantee to obtain Query Result the most accurately yet.And, under many circumstances, select suitable approximate query algorithm can return and the accurate identical result of search algorithm, and have higher efficient.What video frequency searching needed is the balance of a precision and efficient.Particularly in big, higher to the response time requirement occasion of data scale, approximate arest neighbors retrieval will be brought into play more importantly effect, therefore adopt LSH (LocalitySensitive Hashing) algorithm as preferred version.

The LSH algorithm is at first proposed by Indyk and Motwani, utilizes statistical theory, can solve the k-neighbour fast and inquire about problem under the prerequisite that guarantees certain accuracy (with probabilistic manner).Paper " SimilaritySearch in High Dimensions via Hashing " has provided the specific implementation step of this algorithm, its basic thought is, for the point data collection, utilize one group of hash function to set up a plurality of Hash tables with certain constraint condition, make under certain similarity measure condition, the probability that similar point clashes is bigger, and the probability that dissimilar point clashes is less relatively.

First embodiment provided by the invention is a kind of Content-based Video Retrieval method, adopts the LSH algorithm to carry out index in the present embodiment and sets up, and the LSH function definition is: one group of hash function H={h1, ..., hm}, m are positive integer, for data point p, if q is p, distance D (p between the q, q)＜R, then P[hi (q)=hi (p)]＞P1, if D (p, q)＞cR, then P[hi (q)=hi (p)]＜P2.Wherein function P (.) is a probability function, and P1, P2 are given probability, and P1＞P2, i are random number, i ∈ 1 ..., m}.This group hash function is called as so that (P1 P2) is the LSH group of functions of parameter for R, cR.Data point p wherein, different sensitive video frequency frame proper vector in the corresponding present embodiment of q

The LSH function is:

Wherein vectorial

Satisfy normal distribution (Gaussian distribution), w is any real number, and b is any real number between [0, w].Method flow comprises as shown in Figure 1:

Step 101: n sensitive video frequency frame proper vector is mapped among L the Hash table g.

N sensitive video frequency frame proper vector arranged in the video finger print storehouse

Adopt the LSH algorithm to carry out index and set up, use L hash function g (.) n sensitive video frequency frame proper vector

Be mapped among L the Hash table g, for example: 10 sensitive video frequency frame proper vectors are arranged

With

Wherein

With

Be mapped among the Hash table g1 all the other 5

With

Be mapped among the Hash table g2, adopt different hash function g (.) to be mapped to the sensitive video frequency frame proper vector of Hash table

Can be different, as

With Be mapped among the Hash table g1 all the other 5

With

Be mapped among the Hash table g2.

Step 102: the sensitive video frequency frame proper vector among each Hash table g is carried out hash by the LSH function, and the gained result is carried out the secondary hash again, and the sensitive video frequency frame proper vector in each Hash table is mapped in a plurality of hash buckets.

Hash table gj=[h1 wherein ^(j)..., hk ^(j)] (j=1,2 ..., L), hi (.) ∈ H (i=2 ... k), wherein H is a LSH family of functions, i.e. one group of hash function H={h1 ..., hm}.As in the g1 Hash table

With

Carry out the secondary hash, will

With

Be mapped in 7 hash buckets.Hash table g1=[h1 ⁽¹⁾, h2 ⁽¹⁾, h3 ⁽¹⁾, h4 ⁽¹⁾] expression employing h1 ⁽¹⁾, h2 ⁽¹⁾, h3 ⁽¹⁾, h4 ⁽¹⁾Function will

With

Be mapped in 7 hash buckets.As pass through h1 ⁽¹⁾Will

With

Be mapped to first hash bucket, will

With

Be mapped to second hash bucket, pass through h2 ⁽¹⁾Will

With

Be mapped to the 3rd hash bucket, pass through h3 ⁽¹⁾Will

With Be mapped to the 4th hash bucket, will

With

Be mapped to the 5th hash bucket, pass through h4 ⁽¹⁾Will With

Be mapped to the 6th hash bucket, will

With

Be mapped to the 7th hash bucket.

Step 103:, determine the Cartesian coordinates of L Hash table numbering correspondence in content addressed network according to the numbering and the predefined transformation rule of L Hash table.

The number table of Hash table g1 is shown binary sequence x, as 0100001010.If

Wherein d is the dimension such as the d=2 of Virtual Space, and m is the figure place 10 of binary sequence x.X is divided into groups to a high position from low level, and per 8 is one group, and being divided into is 2 groups (last group can be discontented with 8), and first group is 00001010, the second group is 10, the dimension of 2 in the corresponding Virtual Space.Calculate every group decimal value xi (i=1 ..., d), first group is 10, the second groups is 1, xi%2 ^dBe the i dimension coordinate of pairing node, promptly the Cartesian coordinates of 0100001010 correspondence is (2,1).

Adopt the call number of the numbering of Hash table g in the present embodiment as sensitive video frequency frame proper vector.

Step 104: according to the corresponding relation of numbering and the Cartesian coordinates of Hash table g, L Hash table (including the sensitive video frequency frame proper vector that is mapped to wherein) be distributed in N the server in the content addressed network preserve, each server all has corresponding Cartesian coordinates, wherein N≤L in content addressed network.

The Cartesian coordinates that No. 8 servers (Hash table is numbered 0100001010) are corresponding is (2,1), preserves Hash table g1 in this server.Wherein have Cartesian coordinates each server content addressed network diagram as shown in Figure 2.

Step 105: the acquisition module of arbitrary server obtains the proper vector of frame of video to be detected in the content addressed network

One section video to be detected is arranged from network, and No. 4 servers therefrom obtain the proper vector of the frame of video to be detected of this video in the content addressed network

Step 106: obtain

Server search module, determine the Cartesian coordinates of this Hash table place server successively according to the numbering of sensitive video frequency frame proper vector to be compared place Hash table.

As sensitive video frequency frame proper vector to be compared in Hash table g1, No. 4 servers search the Cartesian coordinates that module at first will be determined No. 8 servers in Hash table g1 place, determine that according to the numbering of Hash table g1 the Cartesian coordinates of No. 8 server correspondences is (2,1).

Step 107: obtain

Server sending module will

Send to the retrieval module of the server of storage sensitive video frequency frame proper vector to be compared place Hash table.

The Cartesian coordinates of No. 4 server correspondences is (1,1), is that (2,1) are adjacent because Cartesian coordinates is (1,1) and Cartesian coordinates, and the sending module of No. 4 servers directly will

Send to the retrieval module of No. 8 servers.

If acquisition module obtains the proper vector of frame of video to be detected

Server be not No. 4 servers but No. 8 servers, then the sending module of No. 8 servers directly sends

Retrieval module to book server.

Step 108: the retrieval module of server of storing sensitive video frequency frame proper vector to be compared place Hash table is right

Carry out the secondary hash, it is mapped in the hash bucket of Hash table.

Retrieval module is right

Carrying out the secondary hash is mapped to it in first hash bucket of Hash table g1.

Step 109: take out with

Fall into the sensitive video frequency frame proper vector of same hash bucket, and calculate wherein each sensitive video frequency frame proper vector and

Euclidean distance, judge similarity each other, and determine the sensitive video frequency proper vector of coupling.

In first hash bucket of taking-up Hash table g1

With

And calculate respectively with Euclidean distance, judge similarity each other, and determine

For with (similar) sensitive video frequency proper vector of coupling.Up to obtaining abundant similar sensitive video frequency frame proper vector, or relatively finish with whole sensitive video frequency frame proper vectors.

Wherein in the step 102 in order to guarantee the performance of LSH algorithm, need to consider two important parameters here---the number k of LSH function h (.) in the number L of Hash table g and each Hash table.The value of L and k can directly have influence on the performance of this algorithm.Consider following performance index: index Time Created: O (nLkt), wherein t is for calculating required time of each h (.), space: O (the nL)+required space of preservation data point, query time: O (L (kt+dnP ₂ ^k)), should guarantee that L and k have following relation:

L = P_{1}^{- k} = {(\frac{1}{P_{1}})}^{k}

Wherein P1 is a given probability in the LSH function as previously mentioned.

Also can adopt in the present embodiment as the multi-dimensional indexing technology as: gridfile, k-d-B tree, quaternary tree, hB tree, R tree and mutation R+ tree and R* tree etc., these all are based on the space or based on the division methods of DATA DISTRIBUTION, generate the index of each sensitive video frequency frame proper vector in the video finger print storehouse by said method, sensitive video frequency frame proper vector with different index number (as 1-10000), be distributed in 10 servers in the content addressed network and preserve, each server all has corresponding Cartesian coordinates in content addressed network.When Content-based Video Retrieval, after wherein the acquisition module of 3# server obtains the proper vector of frame of video to be detected, the 3# server search the call number 1000 of module according to sensitive video frequency frame proper vector to be compared, and predefined transformation rule, from the Cartesian coordinates of 10 servers, find correspondence and preserve the Cartesian coordinates of server of sensitive video frequency frame proper vector to be compared for (0,0); The sending module of 3# server sends to the proper vector of frame of video to be detected the retrieval module of 2# server (Cartesian coordinates is (0,0)) according to the Cartesian coordinates (0,0) that finds;

The retrieval module of 2# server is judged the proper vector of frame of video to be detected and the similarity of sensitive video frequency frame proper vector to be compared.

Second embodiment provided by the invention is an a kind of Content-based Video Retrieval system, and its structure comprises as shown in Figure 3:

Video finger print storehouse 201: be used to preserve the sensitive video frequency frame proper vector with call number, described video finger print storehouse is evenly distributed in a plurality of servers on the content addressed network cartesian coordinate space node;

Acquisition module 202: the proper vector that is used to obtain frame of video to be detected;

Search module 203: be used for call number according to video finger print storehouse sensitive video frequency frame to be compared proper vector, and predefined transformation rule, from the Cartesian coordinates of each server, find the Cartesian coordinates that correspondence is preserved the server of sensitive video frequency frame proper vector to be compared;

Sending module 204: be used for the proper vector of frame of video to be detected being sent to the retrieval module of corresponding with service device according to the Cartesian coordinates that finds;

Retrieval module 205: be used to judge the proper vector of frame of video to be detected and the similarity of sensitive video frequency frame proper vector to be compared, and determine the sensitive video frequency proper vector of coupling.

Further, the video finger print storehouse 201 of each server comprises Hash table 2011: be used to preserve sensitive video frequency frame proper vector, the numbering of described Hash table is as the call number of sensitive video frequency frame proper vector;

Described system also comprises:

Secondary Hash module 206: be used for using the LSH algorithm that each the sensitive video frequency frame proper vector that is kept at each server Hash table is carried out hash, again the gained result carried out the secondary hash, obtain a plurality of hash buckets;

Hash table 2011 comprises a plurality of hash buckets 20111: each the sensitive video frequency frame proper vector that is used for Hash table is preserved and is carried out the sensitive video frequency frame proper vector after the hash twice after carrying out twice hash;

Retrieval submodule 2051: be used for the proper vector of frame of video to be detected is carried out hash twice, obtain the hash bucket of the correspondence that the proper vector of frame of video to be detected is mapped to;

Judge the similarity of sensitive video frequency frame proper vector to be compared in the proper vector of frame of video to be detected and the corresponding hash bucket.

Further, the secondary Hash module 206: also be used for L hash function g (.) with each sensitive video frequency frame proper vector

After being mapped to L the Hash table gj that is kept on each server, by k LSH function hi (.) to each the sensitive video frequency frame proper vector among the Hash table gj Carry out hash, i.e. gj=[h1 (j) ..., hk (j)] (j=1,2 ..., L), hi (.) ∈ H (i=1,2 ... k), H is a LSH family of functions.

Further, retrieval module 205: also be used for the proper vector by calculating frame of video to be detected and the Euclidean distance of sensitive video frequency frame proper vector to be compared and judge similarity each other.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1, a kind of Content-based Video Retrieval method is characterized in that, is provided with a plurality of servers on the node in content addressed network cartesian coordinate space, and this method comprises:

Acquisition module obtains the proper vector of frame of video to be detected;

2, the method for claim 1 is characterized in that, also comprises before finding corresponding Cartesian coordinates step:

Use local sensitivity Hash table LSH algorithm that each the sensitive video frequency frame proper vector that is kept in each server Hash table is carried out hash, obtain a plurality of hash buckets, each sensitive video frequency frame proper vector is mapped in each hash bucket, simultaneously with the numbering of each Hash table call number as sensitive video frequency frame proper vector;

Also comprise after sending the retrieval module step of proper vector to the corresponding with service device of frame of video to be detected:

The proper vector of frame of video to be detected is carried out hash twice, obtain the hash bucket of the correspondence that the proper vector of frame of video to be detected is mapped to;

The step of judging characteristic vector similarity is specially:

3, method as claimed in claim 2 is characterized in that, uses the LSH algorithm to carry out hash and be specially being kept at each sensitive video frequency frame proper vector in each server Hash table:

With L hash function g (.) with each sensitive video frequency frame proper vector

Be mapped among L the Hash table gj that is kept on each server;

By k LSH function hi (.) to each the sensitive video frequency frame proper vector among the Hash table gj

Carry out hash, i.e. gj=[h1 (j) ..., hk (j)] (j=1,2 ..., L), hi (.) ∈ H (i=1,2 ... k), H is a LSH family of functions.

4, method as claimed in claim 3, it is characterized in that, use the LSH algorithm that each the sensitive video frequency frame proper vector that is kept in each server Hash table is carried out hash, the number k of LSH function hi (.) has following relation in the number L of Hash table and each Hash table:

Or

At the LSH of LSH algorithm group of functions H={h1 ..., among the hm}, 2 sensitive video frequency frame proper vectors wherein

Between distance

D (\overset{&RightArrow;}{v}, \overset{&RightArrow;}{v} 2) < R,

Then

P [hi (\overset{&RightArrow;}{v} 1) = hi (\overset{&RightArrow;}{v} 2)] > P 1,

Wherein R is predefined distance, P ₁Be predefined probable value.

5, method as claimed in claim 3 is characterized in that, the LSH function is:

Wherein vectorial

Satisfy normal distribution (Gaussian distribution), w is any real number, and b is any real number between [0, w].

6, method as claimed in claim 2 is characterized in that, the step of comparative feature vector similarity is specially:

The Euclidean distance of sensitive video frequency frame proper vector to be compared is judged similarity each other in proper vector by calculating frame of video to be detected and the corresponding hash bucket.

7, a kind of Content-based Video Retrieval system is characterized in that, comprising:

8, system as claimed in claim 7 is characterized in that,

The video finger print storehouse of each server comprises Hash table: be used to preserve sensitive video frequency frame proper vector, the numbering of described Hash table is as the call number of sensitive video frequency frame proper vector;

Described system also comprises:

Secondary Hash module: be used for using the LSH algorithm that each the sensitive video frequency frame proper vector that is kept at each server Hash table is carried out hash, again the gained result carried out the secondary hash, obtain a plurality of hash buckets;

Hash table comprises a plurality of hash buckets: each the sensitive video frequency frame proper vector that is used for Hash table is preserved and is carried out the sensitive video frequency frame proper vector after the hash twice after carrying out twice hash;

Described retrieval module also comprises:

Retrieval submodule: be used for the proper vector of frame of video to be detected is carried out hash twice, obtain the hash bucket of the correspondence that the proper vector of frame of video to be detected is mapped to;

9, system as claimed in claim 8 is characterized in that,

Secondary Hash module: also be used for L hash function g (.) with each sensitive video frequency frame proper vector

After being mapped to L the Hash table gj that is kept on each server, by k LSH function hi (.) to each the sensitive video frequency frame proper vector among the Hash table gj

10, system as claimed in claim 7 is characterized in that,

Retrieval module: also be used for the proper vector by calculating frame of video to be detected and the Euclidean distance of sensitive video frequency frame proper vector to be compared and judge similarity each other.