Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limiting the application in any way.
The technical field to which the present application relates is first described below:
Artificial intelligence ARTIFICIAL INTELLIGENCE, AI this technology is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, it means to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing to make the Computer process into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.
The video text tracking method provided by the embodiment of the application can be used for various application scenes of video processing, for example, the application scenes comprise but are not limited to video content analysis, video popularization, video auditing and the like. Specifically, taking checking the text content in the video as an example, by using the video text tracking method, firstly detecting a certain video frame in the video, determining a text region in the video, obtaining a text box, determining a text box with highest similarity with the text box as the position of the text region in the adjacent video frame according to the position of the text box in the video frame, and determining the target tracking track of the video text according to the corresponding position in the video frame adjacent to the video frame (which can be a video frame before the video frame or a video frame after the video frame). When the content of the video text needs to be extracted, the content can be extracted from the corresponding text box through the position information in the target tracking track.
In extracting or auditing video text, the refresh rate of video frames is typically faster because the identified objects are text in the video frames. Therefore, if text information in each video frame is directly detected and extracted, a large amount of redundant data can appear, which is unfavorable for subsequent analysis and processing. In the related art, a track tracking mode is generally adopted, that is, it is determined which video frames have the same text information continuously displayed, and only one time is needed for extracting or auditing the repeated text information. However, in this embodiment, when tracking video text, it is necessary to detect the position of the video text in each frame, and then determine the target tracking track of the video text by matching, which requires a long processing time, consumes a large amount of computing resources, and has high cost and low application benefit.
In view of this, in the embodiment of the present application, matching of video text track tracking is achieved by determining a text box in a first video frame of a video, and determining a possible position of the text box in a second video frame adjacent to the first video frame by the position of particles around the text box. When the video text tracking method processes adjacent video frames, the step of detecting the first text box in the first video frame is not needed to be executed each time, namely the position of the text box in each video frame is not needed to be determined from the beginning, on one hand, the time needed in the video text track tracking processing process can be effectively reduced, the computing resources are saved, and the processing speed is improved, on the other hand, after the target tracking track of the video text is determined, when the text content of the video is needed to be extracted or checked, any video frame covered by the target tracking track of the video text is processed, and the analysis and the check of the video content are convenient. It should be noted that, in the embodiment of the present application, the video may refer to an aggregate formed by a plurality of consecutive pictures, and the video frame refers to one of the pictures in the aggregate, so it may be understood that the aggregate includes, but is not limited to, files in a Format such as MPEG (Moving Picture Experts Group ) Format, AVI (Audio Video Interleaved, audio video interleave) Format, na vi (new AVI) Format, ASF (ADVANCED STREAMING Format), MOV Format (film Format of software QuickTime), WMV (Windows Media Video ) Format, 3GP (3 rd Generation Partnership Project, third generation partnership project) Format, RM (REALMEDIA, physical media) Format, RMVB (REALMEDIA VARIABLE BITRATE, RM variable bit rate) Format, FLV (FLASHVIDEO) Format, MP4 (Moving Picture Experts Group ) Format, etc., or a plurality of pictures that change during the playing of music.
Fig. 1 is a schematic diagram of an alternative application environment of a video text tracking method according to an embodiment of the present application. Referring to fig. 1, the video text tracking method provided by the embodiment of the present application may be applied to a video text tracking system 100, where the video text tracking system 100 may include a terminal 110 and a server 120, and the specific number of the terminal 110 and the server 120 may be arbitrarily set. The terminal 110 and the server 120 may establish a communication connection through a wireless network or a wired network. The wireless network or wired network may be configured as the internet, using standard communication techniques and/or protocols, or any other network including, for example, but not limited to, a local area network (Local Area Network, LAN), metropolitan area network (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), mobile, wired or wireless network, a private network, or any combination of virtual private networks. The terminal 110 may send the video to be text tracked to the server 120 based on the established communication connection, where the server 120 performs corresponding processing by executing the video text tracking method provided by the embodiment of the present application to obtain the target tracking track of the video text in the video, and then returns the processing result to the terminal 110.
In some embodiments, the terminal 110 may be any electronic product that can interact with a user through one or more of a keyboard, a touchpad, a touch screen, a remote control, a mouse, a voice interaction or a handwriting device, and the electronic product may include, but is not limited to, a PC (Personal Computer, a Personal computer), a mobile phone, a smart phone, a PDA (Personal digital assistant), a palm computer PPC (Pocket PC), a tablet computer, and the like. The server 120 may be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and may be configured as a cloud server that provides services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and big data and artificial intelligence platforms.
It should be understood that fig. 1 illustrates an alternative implementation environment of the video text tracking method according to an embodiment of the present application, and is not necessarily implemented by the video text tracking system 100 in fig. 1 in practical applications, for example, in some embodiments, the video text tracking method may be implemented locally and independently by the terminal 110, for example, by executing the video text tracking method by some application program installed on the terminal 110, where the application program may be video playing software, a web platform, or the like. Similarly, the video word tracking method may also be implemented independently by the server 120.
Referring to fig. 2, fig. 2 is an optional flowchart of a video text tracking method according to an embodiment of the present application, where the method in fig. 2 includes steps S201 to S205, and the method may be applied to the video text tracking system 100 described above.
Step S201, determining a first text box from a first video frame of the video.
In the embodiment of the application, a video segment can be divided into a plurality of continuous video frames in advance through a video stream decoding technology, and then one frame is arbitrarily selected from the video frames as a first video frame. And then determining a text area in the first video frame, and marking a text box according to the text area, and marking the text box as a first text box. Specifically, the size and shape of the first text box may be determined according to the text region in the video frame. For example, if the detection is for subtitles in video, the first text box may be set to a rectangular box. It can be understood that the first text box is mainly used for representing the position information of the characters in the first video frame, and the video characters can be the characters in the subtitle or the characters at any position in the video picture. The text box in the embodiment of the application comprises a border box line and text content of characters or picture content based on character presentation.
Specifically, step S201 in the embodiment of the present application may be implemented through step S210, or through steps S220 to S230, where:
and S210, detecting a text region in the first video frame to obtain a first text box.
In the embodiment of the present application, the text region refers to a region containing text, and as described above, the region may be any region in the first video frame picture. Here, the detection of the text region may be accomplished using a machine learning model in an artificial intelligence technique. For the machine learning model, the task of detecting the text region may be regarded as a segmentation task, i.e. a partial picture containing text is segmented from a complete video frame picture, and then the first text box may be determined according to the peripheral shape of the partial picture. Specifically, in some embodiments, the shape of the first text box may be preset, for example, set to be rectangular, and the first text box may be as small as possible on the premise that the divided text images are included, for example, a minimum circumscribed rectangle of the text region in the first video frame detected may be taken as the first text box. In some embodiments, the shape and size of the first text box may be preset, and then the center position of the text region in the detected first video frame is used as the center of the first text box to determine the first text box. Here, it should be noted that, the text area is an area where some video text is gathered, when text exists in multiple places in the first video frame, multiple text areas may be determined in the first video frame at the same time, and each text area may determine a corresponding first text box.
In the embodiment of the application, an initial tracking track can be established in advance, for example, a neighboring video frame with a first video frame is acquired, a text region in the first video frame is detected to obtain one text box, a text region in the neighboring video frame is detected to obtain another text box, the similarity between the two text boxes is determined, whether the two text boxes are matched or not is determined according to the similarity, and if so, the initial tracking track is established according to the position information of the two text boxes. It should be noted that, in the embodiment of the present application, the video frames adjacent to the above-mentioned adjacent video frames may be further acquired, the similarity between the text boxes is identified and determined through the above steps, and the similarity between the text boxes is further determined according to the similarity, so as to determine whether to add to the initial tracking track, that is, the initial tracking track may include the position information of the text boxes in two or more video frames. In the embodiment of the present application, the location information includes, but is not limited to, coordinates of a text box, where the coordinates may be coordinates of a center of the text box or coordinates of each vertex. Optionally, a preset threshold may be set in the embodiment of the present application, when a video frame with a continuous preset threshold frame number is detected, a text box with the same match is found, that is, an initial tracking track is built, and when a video frame with a continuous preset threshold frame number is detected, a text box with the same match is not found, then the initial tracking track is not built.
Step S220, acquiring a third video frame adjacent to the first video frame from the video.
In the embodiment of the present application, adjacent video frames refer to video frames that are adjacent in time sequence, for example, a plurality of video frames of a video have a video frame a, a video frame B, and a video frame C that are continuous in time sequence, and then the video frame a and the video frame C are adjacent to the video frame B, and it is understood that a third video frame may be located before or after a first video frame in time sequence, where the first video frame is located between a second video frame and the third video frame. The third video frame may be an adjacent video frame adjacent to the first video frame in the initial tracking track.
Step S230, a text box matched with the first video frame and the third video frame is used as a first text box.
In the embodiment of the application, the text box in the first video frame is matched with the text box in the third video frame, so that the first text box is determined. Here, the purpose of matching the first video frame with the third video frame is to determine an initial tracking trajectory of the video text. Specifically, step S230 may be implemented by the following steps S240 to S280.
Step S240, inputting the first video frame and the third video frame into a first text tracking network, wherein the first text tracking network comprises a first detection branch network and a second detection branch network.
As shown in fig. 3, taking the first video frame 101 and the third video frame 102 as an example, the first video frame 101 and the third video frame 102 each have a text region, the text region has text "XXXXX", and the first video frame 101 and the third video frame 102 are input into the first text tracking network 310. In the embodiment of the present application, the first text tracking network 310 may be trained in advance, and the first detection branch network and the second detection branch network are used for detecting a text box and outputting a detection result. In the embodiment of the present application, the first detection branch network and the second detection branch network may have the same structure and may share weights. Specifically, in the embodiment of the present application, the detection branch network is a network capable of performing text detection, and may be, but not limited to, yolo networks (You Only Look Once), CNNs (Convolutional Neural Networks ), LSTM (Long-Short Term Memory, long-short-term memory artificial neural networks), and the like.
And step S250, detecting the first video frame through the first detection branch network to obtain a fourth text box.
Specifically, the first video frame 101 is input into the first detection branch network, and a fourth text box is obtained through the processing of the first detection branch network, where the fourth text box is used for representing the position information of the characters in the first video frame.
And step S260, detecting the third video frame through the second detection branch network to obtain a fifth text box.
Specifically, the third video frame 102 is input into the second detection branch network, and a fifth text box is obtained through the processing of the second detection branch network, where the fifth text box is used for representing the position information of the characters in the third video frame.
Step S270, determining a second similarity of the fourth text box and the fifth text box.
Referring to fig. 4, in an embodiment of the present application, the first text trace network 310 may further include a first trace branch network and a second trace branch network, and step S270 may include steps S301 to S303:
Step S301, extracting a fourth text box through a first tracking branch network to obtain a first feature vector.
Specifically, in the embodiment of the application, the first tracking branch network is connected with the first detection branch network, and the first tracking branch network receives the output of the first detection branch network as input, so that the feature extraction is performed on the fourth text box, and a first feature vector is obtained.
And step S302, extracting the fifth text box through a second tracking branch network to obtain a second feature vector.
Specifically, the second tracking branch network is connected with the second detection branch network, and the second tracking branch network receives output of the second detection branch network as input, so that feature extraction is performed on the fifth text box, and a second feature vector is obtained. In the embodiment of the application, the structures of the first tracking branch network and the second tracking branch network can be the same, and weights can be shared.
Step S303, determining a second similarity according to the first feature vector and the second feature vector.
Specifically, the second similarity may be determined by euclidean distance, manhattan distance, markov distance, cosine similarity, or the like of the first feature vector and the second feature vector.
And step S280, when the second similarity is greater than the first threshold, determining the fourth text box as the first text box.
It will be appreciated that the first threshold may be adjusted as desired, and that the fourth text box is determined to be the first text box when the second similarity is greater than the first threshold. As shown in fig. 3, after the fourth text box is subjected to similarity matching with the fifth text box to obtain a second similarity, the second similarity is compared with a first threshold value, and when the second similarity is greater than the first threshold value, a recognition result is obtained, namely the fourth text box is determined to be the first text box.
Referring to fig. 5, in particular, step S230 may also be implemented through steps S310 to S360.
Step S310, inputting the first video frame and the third video frame into a second text tracking network, wherein the second text tracking network comprises a third detection branch network and a fourth detection branch network, the third detection branch network comprises a first sub-network and a second sub-network, and the fourth detection branch network comprises a third sub-network and a fourth sub-network.
Specifically, in fig. 5, the first video frame 101 and the third video frame 102 are still described as an example, and the first video frame 101 and the third video frame 102 are input into the second text tracking network 320. Similarly, the second text tracking network may also be trained in advance, where the first sub-network and the second sub-network in the third detection branch network, and the third sub-network and the fourth sub-network in the fourth detection branch network are all used to detect the text box, and output a detection result. Wherein the first sub-network and the second sub-network may receive the same input information and the third sub-network and the fourth sub-network may receive the same input information. In the embodiment of the present application, the structures and weight parameters of the third detection branch network and the fourth detection branch network may be set to be the same. In the embodiment of the present application, the third detection branch network and the fourth detection branch network may further include more than two sub-networks, and the number of sub-networks in the third detection branch network and the fourth detection branch network may be the same or different.
Step S320, detecting the first video frame through a first sub-network to obtain a sixth text box, and detecting the first video frame through a second sub-network to obtain a seventh text box;
specifically, step S320 may be implemented by step S401 to step S402:
Step S401, downsampling a first multiple through a first sub-network to extract characteristics of a first video frame, and detecting to obtain a sixth text box;
step S402, downsampling a second multiple through a second sub-network to extract features of the first video frame, and detecting to obtain a seventh text box.
In the embodiment of the application, downsampling is a processing mode for compressing an image, and the size of the image is reduced after downsampling operation, wherein the reduction degree is related to the sampling period of downsampling. In the embodiment of the application, the first multiple and the second multiple are downsampled to perform feature extraction, so that image features with different depths are extracted under different image scales for detecting a text region. Specifically, the difference between the first multiple and the second multiple may be adjusted according to actual needs, which is not limited herein. It can be understood that, when the number of the sub-networks is more than two, the feature extraction can be performed on the first video frame by setting a third multiple different from the first multiple and the second multiple, so as to obtain more different text box detection results, which are used for improving the matching accuracy.
Step S330, detecting the third video frame through a third sub-network to obtain an eighth text box, and detecting the third video frame through a fourth sub-network to obtain a ninth text box;
Specifically, step S330 may be implemented by step S403 to step S404:
Step S403, downsampling a first multiple through a third sub-network to extract features of a third video frame, and detecting to obtain an eighth text box;
and step S404, downsampling a second multiple through a fourth sub-network to extract the characteristics of the third video frame, and detecting to obtain a ninth text box.
In the embodiment of the application, when the features are extracted from the third video frame, a third sub-network and a fourth sub-network with different sampling times can be used, the sampling times of the third sub-network and the first sub-network can be the same, and the sampling times of the fourth sub-network and the second sub-network can be the same, so that the method is convenient for subsequent matching.
Step S340, determining third similarity of the sixth text box and the eighth text box, and determining fourth similarity of the seventh text box and the ninth text box;
in the embodiment of the present application, when determining the third similarity between the sixth text box and the eighth text box and determining the fourth similarity between the seventh text box and the ninth text box, the method may be implemented in the manner of step S270.
Step S350, determining a fifth similarity according to the third similarity and the fourth similarity;
Specifically, step S350 may be implemented through steps S501 to S504:
Step S501, obtaining confidence degrees of a sixth text box, a seventh text box, an eighth text box and a ninth text box;
step S502, determining a first weight according to the average confidence degrees of the sixth text box and the eighth text box;
Specifically, according to the confidence coefficient of the sixth text box and the confidence coefficient of the eighth text box, calculating an average value to obtain average confidence coefficients of the sixth text box and the eighth text box as the first weight.
Step S503, determining a second weight according to the average confidence of the seventh text box and the ninth text box;
Specifically, according to the confidence coefficient of the seventh text box and the confidence coefficient of the ninth text box, calculating an average value to obtain average confidence coefficients of the seventh text box and the ninth text box as the second weight.
And step S504, carrying out weighted summation on the third similarity and the fourth similarity according to the first weight and the second weight to obtain a fifth similarity.
Specifically, the fifth similarity can be calculated by the following formula:
For the confidence level of text box b 1 of the ith subnetwork of the third test branch network, For the fourth detection of confidence in text box b 2 of the i-th subnetwork of the branched network,For b 1,b2 similarity in the corresponding i-th subnetwork,The similarity result is b 1,b2.
For example, when i=1,For the confidence level of text box b 1 of the 1 st subnetwork of the third test branch network (i.e. the confidence level of the sixth text box in the first subnetwork),For the confidence level of text box b 1 of the fourth detection branch network 1 st subnetwork (i.e. the confidence level of the eighth text box in the third subnetwork),As for the similarity (third similarity) between the sixth text box and the eighth text box, the same is true when i=2, and the description thereof will not be repeated.
It will be understood that when the third detection branch network and the fourth detection branch network have more than two sub-networks, the confidence levels of different text boxes may be determined according to steps S501-S504, and the average confidence levels and weights between the different text boxes may be determined, and the weighted summation may be performed by using the above formula to determine the fifth similarity.
When the initial tracking track is determined, when the video text has more than two text boxes in two continuous video frames, all the text boxes in the two video frames can be detected first, the similarity between each text box in one video frame and each text box in the other video frame is calculated, the similarity matrix is formed by combining the cross-over ratios of the text boxes, and the text box combination is paired by adopting a bipartite graph maximum weight matching method, so that the sum of the similarity and the cross-over ratio is the maximum as a pairing result, and the pairing of each text box is completed. It should be noted that, the pairing may be considered to be successful, that is, successful matching, when the sum of the similarity and the cross-over ratio of the paired text boxes is greater than or equal to the pairing threshold. The similarity of the text boxes refers to the similarity result determined by the calculation formula in step S503.
Step S360, when the fifth similarity is larger than the second threshold, determining the sixth text box or the seventh text box as the first text box.
It may be appreciated that the second threshold may be adjusted according to the actual situation, and when the fifth similarity is greater than the second threshold, one of the sixth text box or the seventh text box may be randomly used as the first text box, or a text box with a higher confidence may be used as the first text box according to the confidence of the sixth text box and the seventh text box.
And S202, generating a plurality of particles at positions corresponding to the first text box in a second video frame of the video, wherein the first video frame and the second video frame are adjacent.
It is understood that when the first text box is determined in step S220, the sequence of the first video frame, the second video frame, and the third video frame may be the third video frame, the first video frame, the second video frame, or the second video frame, the first video frame, and the third video frame.
In an embodiment of the present application, the plurality of particles are generated at corresponding positions, including but not limited to, within the corresponding positions, or around the corresponding positions, or centered around corners of the corresponding positions. For example, when the first text box is rectangular, a plurality of particles may be generated in a position of the rectangle corresponding to the first text box in the second video frame, or around the position of the rectangle, or around one of four corner positions of the position of the rectangle. It will be appreciated that the number of particles generated may be adjusted, the particles may be used to characterize the positional information of the text box, such as the particles including, but not limited to, coordinates characterizing one of the vertices of the text box or the center coordinates of the text box, the size, shape, etc. of the particles may be adjusted as desired.
Step S203, determining a plurality of second text boxes in the second video frame according to the positions of the respective particles.
Specifically, step S203 may be determined by:
a plurality of second text boxes are determined in the second video frame with the location of each particle as the midpoint or any vertex of the text box.
Specifically, the position of each particle is taken as the midpoint of the text box or any vertex of the text box, and the size information of the first text box is combined, wherein the size information includes, but is not limited to, the length and the width, so that a plurality of second text boxes are determined in the second video frame, and it should be noted that the size of the second text boxes is the same as the size of the first text boxes.
As shown in fig. 6, a schematic diagram of generating particles 1031 in a second video frame 103 by a first text box 1011 of a first video frame 101 is shown in fig. 6. Taking the first text box 1011 in the first video frame 101 as an example, a plurality of particles 1031 are generated at the position of the rectangle corresponding to the first text box 1011 in the second video frame 103. Specifically, for example, a plurality of particles 1031 may be generated around the vertex of the upper left corner of the rectangle, where each particle 1031 may characterize the coordinates of the vertex of the upper left corner of the rectangular text box, and in combination with the size information of the first text box 1011, the second text box corresponding to each particle 1031 may be determined. It should be noted that the particles 1031 may be generated according to a predetermined rule or randomly generated, and the predetermined rule includes, but is not limited to, generation in a graph formed centering on the upper left corner. It will be appreciated that the particles 1031 may also be generated at the center of the rectangle or at other vertices of the rectangle, and accordingly, the particles 1031 may correspond to the center coordinates characterizing the rectangular text box or the coordinates of other vertices of the rectangular text box.
Step S204, determining first similarity between the first text box and each second text box, and taking the second text box with the highest first similarity as a third text box;
in the embodiment of the present application, the first similarity is determined by the calculation result of the formula in step S504, and it can be understood that the first similarity may also be determined by the method in step S270.
Step S205, determining a target tracking track of the video text according to the first text box and the third text box, wherein the target tracking track is used for representing the position information of the video text.
Specifically, step S205 may include step S601 or step S602.
And step S601, when the first similarity of the first text box and the third text box is larger than a third threshold value, adding the position information of the third text box into the target tracking track.
And step S602, when the first similarity of the first text box and the third text box is smaller than a fourth threshold value, ending the track tracking of the video text, and obtaining a target tracking track.
In the embodiment of the present application, two thresholds may be set for the first similarity between the first text box and the third text box, and the two thresholds are denoted as a third threshold and a fourth threshold, where the third threshold and the fourth threshold may be set at the same time, and the size of the third threshold should be greater than or equal to the fourth threshold. For example, in a manner in which the percentage is a measure of similarity, when the first similarity between the first text box and the third text box is 100%, it is explained that the first text box and the third text box are identical, the third threshold may be set to 80%, and the fourth threshold may be set to 50%. Of course, the above values are for convenience of illustration, and the actual threshold size may be flexibly adjusted as required.
When the first similarity of the first text box and the third text box is greater than the third threshold, for example, the first similarity of the first text box and the third text box is 90%, indicating that there is a text box in the second video frame that is very similar to the first text box, i.e., the content in the third text box is likely to be consistent with the content in the first text box. Thus, it can be considered that these text contents exist in both the first video frame and the second video frame, so that the position information of the third text box can be added to the target track of the text. Specifically, in the embodiment of the present application, the target tracking track refers to the position information of the video text in each continuous video frame sequence, and includes two aspects, wherein the first aspect is that the video text is distributed in which video frames, and the second aspect is that the specific position of the video text in each video frame.
Conversely, when the first similarity between the first text box and the third text box is less than the fourth threshold, for example, the first similarity between the first text box and the third text box is 30%, indicating that the substantial degree of similarity between the text box most similar to the first text box (i.e., the third text box) and the first text box is not very high in the second text box in the second video frame. Therefore, it can be considered that at this time, the corresponding position in the second video frame has no more text content in the first text box in the first video frame, i.e. the last frame of picture in which the video text exists is the first video frame. And at the moment, finishing the track tracking of the video characters in the first text box, and considering that the track tracking of the video characters in the first text box is finished, so that the target tracking track can be obtained.
It should be noted that, in the embodiment of the present application, the processing object in the step S601 and the step S602 may be understood as any pair of video frames in the target tracking track processing procedure. For example, a target tracking track of a video word, for example, characterizes the video word as continuously existing in 15 th to 25 th video frames of a video. When the target tracking track is determined by the video text tracking method in the embodiment of the application, assuming that processing is started from the 15 th frame according to the label of the video frame, for a pair of video frames consisting of the 17 th frame and the 18 th frame, a first text box comprising the video text can be determined from the 17 th frame, a third text box can be determined from the 18 th frame, and the first similarity between the third text box of the 18 th frame and the first text box of the 17 th frame can be obtained by comparison and is larger than a third threshold value, so that the position information of the third text box of the 18 th frame can be added to the target tracking track of the video text.
For a pair of video frames consisting of the 25 th frame and the 26 th frame, the first text frame comprising the video text can be determined from the 25 th frame, the third text frame is determined in the 26 th frame, and the first similarity between the third text frame of the 25 th frame and the first text frame of the 26 th frame video frame can be known to be smaller than the fourth threshold value through comparison, so that the track tracking of the video text can be considered to be completed at the moment, and the target tracking track of the video text can be obtained by pushing the first video frame (namely the 25 th frame) at the completion position to the starting frame (namely the 15 th frame) at the beginning of identification. In addition, when the first text box is determined, the other frames except the starting frame from which the recognition is started may use the third text box in the previous recognition as the first text box in the next recognition, for example, for a pair of video frames consisting of the 18 th frame and the 19 th frame, the third text box in the 18 th frame determined in the previous recognition of the 17 th frame and the 18 th frame may be used as the first text box in the 18 th frame in the present recognition.
The following describes the technical solution of the present application in detail with reference to specific application embodiments, and it should be understood that the model types and model structures adopted in the following description do not limit the practical application of the present application.
In the embodiment of the application, a Yolo-v3 (You Only Look Once-v 3) network can be used as a detection branch network to build a text tracking network. Specifically, yolo-v3 network is a classical neural network in the field of target detection, the network is a full convolution network, a large number of layer jump connections using a residual mechanism are in the network, and downsampling is performed between feature maps by using convolution with a step length of 2. In terms of image feature extraction, the Yolo-v3 network adopts a partial network structure of Darknet-53 (including 53 convolution layers), and it is worth noting that the Yolo-v3 network can detect the target at 32 times of downsampling, 16 times of downsampling and 8 times of downsampling respectively, namely, the position of the target is identified on the feature graphs of three scales of 52 x 52, 26 x 26 and 13 x 13, and the feature representation of the target frame, namely, the feature graph of the target frame part is generated. For the embodiment of the application, the area where the text is located is the part of Yolo-v3 network to be detected and framed, namely the target text box, which can generate the prediction results of three feature images under three scale predictions, and the prediction process of each feature image can represent the process of detecting the text box once.
Referring to fig. 7, a schematic diagram of a portion of a word trace network constructed with Yolo-v3 network detection branch networks during processing of word trace tasks is shown in fig. 7. In fig. 7, a video frame 401 and a video frame 402 are two adjacent frames of video frames, and are respectively input into a text tracking network, and the video frame 401 is processed through a first Yolo-v3 network, for example, so as to obtain detection results under three scales, for example, in the embodiment of the present application, a feature map A1 is a target text box feature map detected and generated during 8 times downsampling, a feature map A2 is a target text box feature map detected and generated during 16 times downsampling, and a feature map A3 is a target text box feature map detected and generated during 32 times downsampling. Here, the three feature maps may be aligned by ROI alignment (target region alignment layer) so that the feature maps are uniform in size, for example, 14×14 each. And extracting the feature vector from the feature map A1, namely mapping the feature map to a vector space, marking the obtained feature vector as a feature vector C1, and simultaneously carrying out the same processing on the feature map A2 and the feature map A3 to obtain feature vectors which are respectively marked as a feature vector C2 and a feature vector C3.
The processing of video frame 402 is similar to that of video frame 401 except that it is detected over another Yolo-v3 network, denoted as the second Yolo-v3 network. Here, it should be noted that the same Yolo-v3 network may be used to process the video frame 402, and another Yolo-v3 network may be used to process the video frame 402 in order to achieve synchronous processing of the video frame 401 and the video frame 402, that is, it is not necessary to wait for the processing of the video frame 401 to finish and then process the video frame 402, so that the time required for processing is greatly shortened.
The first Yolo-v3 network and the second Yolo-v3 network may share weights, i.e. the network parameters in the first Yolo-v3 network and the second Yolo-v3 network may be set to be the same, so as to reduce interference of the network parameter difference on the obtained identification result. Similarly, after the second Yolo-v3 network detects the video frame 402, three detection results of the feature maps are generated, the feature map B1 of the target text box detected and generated during 8 times of downsampling is marked, the feature map B2 of the target text box detected and generated during 16 times of downsampling is marked, and the feature map B3 of the target text box detected and generated during 32 times of downsampling is marked. And then extracting the feature vectors of the feature map B1, the feature map B2 and the feature map B3 respectively to obtain a feature vector D1, a feature vector D2 and a feature vector D3. Here, the network used for extracting the feature map B1, the feature map B2, and the feature map B3 may be the same as the network structure and parameters used for extracting the feature map A1, the feature map A2, and the feature map A3, so as to reduce interference of the network structure and the parameters on the obtained recognition result.
After the feature vector C1, the feature vector C2, the feature vector C3, the feature vector D1, the feature vector D2, and the feature vector D3 are extracted, the feature vector C1, the feature vector D1, the feature vector C2, the feature vector D2, and the feature vector C3 are respectively subjected to similarity matching. The similarity between the feature vector C1 and the feature vector D1 is referred to as a similarity S1, the similarity between the feature vector C2 and the feature vector D2 is referred to as a similarity S2, and the similarity between the feature vector C3 and the feature vector D3 is referred to as a similarity S3. Here, since the structure and network parameters between the first Yolo-v3 network and the second Yolo-v3 network are the same, and the network structure and parameters of the extracted feature vector C1 and the extracted feature vector D1 are also the same, the size of the similarity S1 between the feature vector C1 and the feature vector D1 can effectively reflect the sizes of the similarity of the text boxes in the video frame 401 and the video frame 402, and similarly, the similarity S2 and the similarity S3 can also effectively reflect the sizes of the similarity of the text boxes in the video frame 401 and the video frame 402. Therefore, in the embodiment of the present application, the similarity of the text boxes in the video frame 401 and the video frame 402 may be comprehensively determined based on the similarity S1, the similarity S2 and the similarity S3, and the similarity is denoted as X, and in some embodiments, the similarity X may be determined according to the average value of the similarity S1, the similarity S2 and the similarity S3, so that the detection results of the neural network on the text boxes under different scales are balanced and considered, and the obtained similarity X can reduce the negative effects caused by the inaccuracy of the single prediction of the neural network. In some embodiments, confidence information of the detection results generated by Yolo-v3 network under each prediction scale may also be obtained, and the reliability of the similarity S1, the similarity S2 and the similarity S3 may be determined according to the confidence. For example, the confidence of the first Yolo-v3 network is 0.8 at the prediction scale of the generated feature map A1, the confidence of the first Yolo-v3 network is 0.9 at the prediction scale of the generated feature map A2, and the confidence of the first Yolo-v3 network is 0.85 at the prediction scale of the generated feature map A3, and the confidence of the second Yolo-v3 network is considered to be the same as that of the first Yolo-v3 network because the structures and parameters of the second Yolo-v3 network and the first Yolo-v3 network are completely identical in the embodiment of the present application. Since the confidence of the feature map A1 and the confidence of the feature map B1 are 0.85, the reliability of the similarity S1 is described as being represented by the confidence of 0.8, and similarly, the reliability of the similarity S2 and the similarity S3 can be represented by the confidence of 0.9 and 0.85, respectively, the weights of the similarity S1, the similarity S2 and the similarity S3 can be determined based on the magnitudes of the three confidence degrees, and the calculation formula of the similarity X can be expressed as follows, taking the sum of the weights of the similarity S1, the similarity S2 and the similarity S3 as 1 as an example:
X=0.314*S1+0.353*S2+0.333*S3
s1, S2 and S3 in the above formula represent the magnitudes of the similarity S1, the similarity S2 and the similarity S3 respectively, and X represents the magnitude of the similarity X.
Of course, it can be understood that when the two detection branch network structures or parameters of the text tracking network are different, the weights of the respective similarities can be determined in a similar manner, and only the corresponding confidence coefficients need to be averaged. For example, when the structures or parameters of the first Yolo-v3 network and the second Yolo-v3 network are different, the average value of the confidence coefficient of the feature map A1 and the confidence coefficient of the feature map B1 can be used as a judging basis for measuring the reliability of the similarity S1.
It should be noted that, in the embodiment of the present application, for two consecutive video frames, the identified text box may be either a text box of a subtitle or a text box of a text in the picture content. When there are multiple groups of text boxes in two consecutive video frames, which may contain the same text content, to be identified, the method in the foregoing embodiment may be used to identify each group of text boxes sequentially, or may also identify multiple groups of text boxes simultaneously.
Referring to fig. 8, a text trace network constructed by Yolo-v3 network detection branch network and one specific structure of a first trace branch network and a second trace branch network are shown in fig. 8, and a similarity schematic diagram is obtained according to an output feature vector, specifically, the first trace branch network includes an ROI Align layer 501 (target region alignment layer), an ROI Align layer 502, an ROI Align layer 503, a layer 601, a layer 602, a layer 603, a connection layer 701, a connection layer 702, and a connection layer 703, wherein the layers 601, 602, and 603 each include a convolution layer and an average pooling layer. Similarly, the second trace branch network includes ROI alignment layer 504, ROI alignment layer 505, ROI alignment layer 506, layer 604, layer 605, layer 606, connection layer 704, connection layer 705, connection layer 706, wherein layers 604, 605, 606 each include a convolution layer and an average pooling layer.
In the embodiment of the application, similarly, two adjacent video frames of the video frame 401 and the video frame 402 are respectively input into a text tracking network, the output results of the feature map A1, the feature map A2, the feature map A3, the feature map B1, the feature map B2 and the feature map B3 can be obtained through the first scale detection, the second scale detection and the third scale detection, the feature map A1 is input into the ROI alignment layer 501, the result of the output of the ROI alignment layer 501 is input into the layer 601, the output result of the layer 601 is input into the connecting layer 701, namely the feature vector C1 can be obtained, the feature map A2 is input into the ROI alignment layer 502, the output result of the ROI alignment layer 502 is input into the layer 602, namely the feature vector C2 can be obtained, the feature map A3 is input into the ROI alignment layer 503, the result of the output of the ROI alignment layer 503 is input into the layer 603, the output result of the layer 501 is input into the connecting layer 601, namely the feature vector C5 is input into the ROI alignment layer 606, the result of the ROI 5 is input into the ROI layer 606, and the result of the ROI 5 is input into the ROI layer 606, namely the ROI 5 is input into the connecting layer 606. Then, the feature vector C1 and the feature vector D1 are input to the connection layer 801 to obtain a similarity S1, the feature vector C2 and the feature vector D2 are input to the connection layer 802 to obtain a similarity S2, and the feature vector C3 and the feature vector D3 are input to the connection layer 803 to obtain a similarity S3.
Referring to fig. 9, in an embodiment of the present application, a video processing method is further provided, where the video processing method may be applied to a terminal, may also be applied to a server, and may also be applied to software in the terminal or the server, to implement a part of software functions. In some embodiments, the terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, or the like, the server may be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and may be configured as a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, big data, an artificial intelligence platform, or the like, and the software may be an application program for playing video, or the like, but is not limited to the above form. Fig. 9 is a schematic flow chart of an alternative video processing method according to an embodiment of the present application, where the method mainly includes steps 701 to 703:
step 701, obtaining a plurality of continuous video frames of a video;
step 702, obtaining tracking tracks of a plurality of video characters in a video through the video character tracking method;
step 703, extracting the video frames according to the tracking track to obtain a keyframe set of the video.
In the embodiment of the application, a video processing method is provided, through which a key frame set of a video can be effectively extracted. Here, a keyframe set refers to a set of video frames that can reflect, describe, video content. For example, the speech of the video is very helpful to help understand the video content, a text presented in the subtitle at each time is taken as a sentence of speech, and the video frame covering each sentence of speech can be selected as a key frame set, so that the speech is convenient for auditing or recommending the video content. For example, referring to fig. 10, for example, a certain video clip includes 50 consecutive video frames, in which five sentences are displayed in total, the first sentence T1 is distributed from frame 1 to frame 15, the second sentence T2 is distributed from frame 16 to frame 21, the third sentence T3 is distributed from frame 22 to frame 37, the fourth sentence T4 is distributed from frame 37 to frame 42, and the fifth sentence is distributed from frame 43 to frame 50. Then, by the video text tracking method, the station word is a tracking target of the video text, five tracking tracks of five sentence stations can be tracked and obtained, and the first tracking track records the position information of the first sentence station word T1, and the position information characterizes the video frames in which the first sentence station word T1 is distributed, namely, the first frame to the fifteenth frame. Therefore, according to the tracking track of the first sentence T1, a frame can be extracted from the video frames covered by the tracking track, and the text content of the first sentence T1 can be reflected. Similarly, for the second sentence T2 to the fifth sentence T5, one frame is extracted from the video frames covered by the corresponding tracking tracks, and the text content of the corresponding sentence is reflected, so that the extracted video frames are used as a key frame set. For example, the aforementioned 50 consecutive video frames, a keyframe set may be extracted in which the 10 th, 18 th, 29 th, 41 st, and 44 th video frames result in the segment of video. It can be understood that, in the embodiment of the present application, the selection of the frame number and the key frame of the video is only for convenience of illustration, and the selection can be flexibly adjusted according to the needs in the actual implementation process.
In the video processing method, the target tracking track of the video text is mainly applied to determine which video frames the video text is distributed in, so that one frame of video frame can be selected from the video frames for analyzing and auditing the video content. In other embodiments, the specific positions of the video words in each video frame can be determined by using the target tracking tracks of the video words, for example, when a word which does not meet the relevant specification is found in a picture in a certain video segment and needs to be shielded, the specific positions of the video words in each video frame can be rapidly determined according to the target tracking tracks of the video words, so that the staff can conveniently perform coding in time.
Referring to fig. 11, the embodiment of the application also discloses a video text tracking device, which comprises:
A first processing module 910 configured to determine a first text box from a first video frame of a video;
A particle generation module 920, configured to generate a plurality of particles in a second video frame of the video at a position corresponding to the first text box;
a second processing module 930 configured to determine a plurality of second text boxes in the second video frame according to the positions of the respective particles;
a similarity determining module 940, configured to determine a first similarity between the first text box and each of the second text boxes, and take the second text box with the highest first similarity as a third text box;
The track determining module 950 is configured to determine a target tracking track of the video text according to the first text box and the third text box, where the target tracking track is used for representing position information of the video text.
It can be understood that the content in the embodiment of the video text tracking method shown in fig. 2 is applicable to the embodiment of the video text tracking device, and the functions of the embodiment of the video text tracking device are the same as those of the embodiment of the video text tracking method shown in fig. 2, and the beneficial effects achieved by the embodiment of the video text tracking method shown in fig. 2 are the same as those achieved by the embodiment of the video text tracking method shown in fig. 2.
Referring to fig. 12, the embodiment of the application also discloses an electronic device, which comprises:
At least one processor 1010;
At least one memory 1020 for storing at least one program;
The at least one program, when executed by the at least one processor 1010, causes the at least one processor 1010 to implement the video word tracking method embodiment as shown in fig. 2 or the video processing method embodiment as shown in fig. 7.
It can be understood that the content in the video text tracking method embodiment shown in fig. 2 or the video processing method embodiment shown in fig. 9 is suitable for the present electronic device embodiment, and the functions specifically implemented by the present electronic device embodiment are the same as those in the video text tracking method embodiment shown in fig. 2 or the video processing method embodiment shown in fig. 9, and the beneficial effects achieved by the present electronic device embodiment are the same as those achieved by the video text tracking method embodiment shown in fig. 2 or the video processing method embodiment shown in fig. 9.
The embodiment of the application also discloses a computer readable storage medium, in which a program executable by a processor is stored, which when executed by the processor is used to implement the video text tracking method embodiment shown in fig. 2 or the video processing method embodiment shown in fig. 9.
It can be understood that the content in the video text tracking method embodiment shown in fig. 2 or the video processing method embodiment shown in fig. 9 is applicable to the present computer-readable storage medium embodiment, and the functions specifically implemented by the present computer-readable storage medium embodiment are the same as those in the video text tracking method embodiment shown in fig. 2 or the video processing method embodiment shown in fig. 9, and the beneficial effects achieved by the video text tracking method embodiment shown in fig. 2 or the video processing method embodiment shown in fig. 9 are the same.
The embodiment of the application also discloses a computer program product or a computer program, which comprises computer instructions, wherein the computer instructions are stored in the computer readable storage medium, and the processor of the electronic device shown in fig. 12 can read the computer instructions from the computer readable storage medium, and execute the computer instructions to cause the electronic device to execute the video text tracking method embodiment shown in fig. 2 or the video processing method embodiment shown in fig. 9.
It will be appreciated that the content of the video text tracking method embodiment shown in fig. 2 or the video processing method embodiment shown in fig. 9 is applicable to the present computer program product or the computer program embodiment, and the functions implemented by the present computer program product or the computer program embodiment are the same as those of the video text tracking method embodiment shown in fig. 2 or the video processing method embodiment shown in fig. 9, and the advantages achieved are the same as those achieved by the video text tracking method embodiment shown in fig. 2 or the video processing method embodiment shown in fig. 9.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present application are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the application is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the functions and/or features may be integrated in a single physical device and/or software module or may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present application. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the application as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the application, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the related art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing an electronic device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include an electrical connection (an electronic device) having one or more wires, a portable computer diskette (a magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of techniques known in the art, discrete logic circuits with logic gates for implementing logic functions on data signals, application specific integrated circuits with appropriate combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present application have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the spirit and scope of the application as defined by the appended claims and their equivalents.
While the preferred embodiment of the present application has been described in detail, the present application is not limited to the embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present application, and the equivalent modifications or substitutions are intended to be included in the scope of the present application as defined in the appended claims.