CN112291614A

CN112291614A - Video generation method and device

Info

Publication number: CN112291614A
Application number: CN201910677074.1A
Authority: CN
Inventors: 詹振; 李丽
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2019-07-25
Filing date: 2019-07-25
Publication date: 2021-01-29

Abstract

The embodiment of the application provides a video generation method and device, which can acquire characters corresponding to event keywords and pictures corresponding to the characters, then convert the characters corresponding to the event keywords into voice of a video, and convert the pictures corresponding to the characters into video frames of the video, so as to generate the video corresponding to the event keywords. That is, it is not necessary to generate a video related to an event keyword, for example, a hot event keyword, by recording a video, but a video corresponding to an event keyword may be generated by using a text corresponding to the event keyword and a picture corresponding to the text. The time spent in the steps of obtaining the characters corresponding to the event keywords and the pictures corresponding to the characters, generating the videos corresponding to the event keywords according to the characters corresponding to the event keywords and the pictures corresponding to the characters is short, compared with the video recording, the efficiency of generating the videos is higher, and the videos related to the hot events can be generated soon after the hot events occur.

Description

Video generation method and device

Technical Field

The present application relates to the field of internet, and in particular, to a video generation method and apparatus.

Background

With the development of science and technology, many video websites and applications for playing videos appear. In order to attract more users to watch the video website and attract more users to use the application program for playing the video, the video website and the application program for playing the video can play the video of the hot event.

It is understood that if a video of a hot event is to be played, a video of the hot event is first generated. Most of the current methods for generating the video of the hot event are directly recording the video.

It is understood that a certain time is required for recording the video, and a video clip and the like are required after the video is recorded, so that the video of the hot event cannot be generated quickly after the hot event occurs, and the video of the hot event cannot be broadcasted as soon as possible after the hot event occurs.

Disclosure of Invention

The technical problem to be solved by the application is that a traditional video mode for generating a hot event cannot generate a video of the hot event quickly after the hot event occurs, so that the video of the hot event cannot be broadcasted as soon as possible after the hot event occurs, and the video generation method and the device are provided.

In a first aspect, an embodiment of the present application provides a video generation method, where the method includes:

acquiring characters corresponding to the event keywords and pictures corresponding to the characters;

and converting the characters corresponding to the event keywords into voice of a video, and converting the pictures corresponding to the characters into video frames of the video so as to generate the video corresponding to the event keywords.

Optionally, the method further includes:

obtaining a material according to a preset rule;

extracting candidate keywords from the materials;

and if the search quantity and/or the click quantity corresponding to the candidate keyword meet preset conditions, determining the candidate keyword as the event keyword.

Optionally, the obtaining of the text corresponding to the event keyword and the picture corresponding to the text includes:

acquiring at least one group of characters corresponding to the event keywords and pictures corresponding to each group of characters;

the converting the characters corresponding to the event keywords into the voice of the video comprises:

converting each group of characters in the at least one group of characters into corresponding video voices respectively;

the converting the picture corresponding to the text into the video frame of the video comprises:

determining the playing time of the voice of the video corresponding to each group of characters in the at least one group of characters;

and determining video frames respectively corresponding to the pictures corresponding to each group of characters in the at least one group of characters according to the playing duration.

Optionally, the converting the text corresponding to the event keyword into a voice of a video, and converting the picture corresponding to the text into a video frame of the video to generate the video corresponding to the event keyword includes:

acquiring time information corresponding to each group of characters in the at least one group of characters;

and converting the characters corresponding to the event keywords into video voices and converting the pictures of the characters into video frames of the videos according to the time information corresponding to each group of characters in the at least one group of characters so as to generate the videos corresponding to the event keywords.

Optionally, the at least one group of words includes a plurality of groups of words;

and converting the multiple groups of characters corresponding to the event keywords into video voices according to the logic relation among the multiple groups of characters.

Optionally, the method further includes:

acquiring a picture corresponding to the event keyword;

identifying the picture corresponding to the event keyword to obtain the picture content of the picture corresponding to the event keyword;

determining the association degree between the picture content of the picture corresponding to the event keyword and the characters;

and if the association degree is greater than or equal to a first threshold value, determining the picture corresponding to the event keyword as the picture corresponding to the character.

Optionally, the text corresponding to the event keyword includes multiple groups;

after obtaining a plurality of groups of texts corresponding to the event keywords, the converting the texts corresponding to the event keywords into the voice of the video includes:

determining the association degree of each group of characters in the plurality of groups of characters and the event keywords on the content;

the converting the characters corresponding to the event keywords into the voice of the video comprises the following steps:

and converting the characters of which the corresponding association degree is greater than or equal to a second threshold value in the plurality of groups of characters into the voice of the video.

Optionally, the method further includes:

determining words representing time in the characters;

if the word representing the time is not in the preset format, obtaining the publishing time of the character, determining the word conforming to the preset format according to the publishing time of the character, and replacing the word conforming to the preset format with the word representing the time to obtain a replaced character;

and converting the replaced characters into voice of the video.

Optionally, the method further includes:

and converting the characters corresponding to the event keywords into subtitles of the video.

In a second aspect, an embodiment of the present application provides a video generating apparatus, where the apparatus includes:

the system comprises a first acquisition unit, a second acquisition unit and a display unit, wherein the first acquisition unit is used for acquiring characters corresponding to event keywords and pictures corresponding to the characters;

and the generating unit is used for converting the characters corresponding to the event keywords into voice of a video and converting the pictures corresponding to the characters into video frames of the video so as to generate the video corresponding to the event keywords.

Optionally, the apparatus further comprises:

the second acquisition unit is used for acquiring the material according to a preset rule;

an extracting unit for extracting candidate keywords from the material;

and the first determining unit is used for determining the candidate keyword as the event keyword if the search quantity and/or click quantity corresponding to the candidate keyword meets a preset condition.

Optionally, the first obtaining unit is specifically configured to:

the generating unit is specifically configured to:

converting each group of characters in the at least one group of characters into corresponding video voices respectively; determining the playing time of the voice of the video corresponding to each group of characters in the at least one group of characters; and determining video frames respectively corresponding to the pictures corresponding to each group of characters in the at least one group of characters according to the playing duration so as to generate a video corresponding to the event keyword.

Optionally, the generating unit is specifically configured to:

Optionally, the apparatus further comprises:

a third acquiring unit, configured to acquire a picture corresponding to the event keyword;

the identification unit is used for identifying the picture corresponding to the event keyword to obtain the picture content of the picture corresponding to the event keyword;

the second determining unit is used for determining the association degree between the picture content of the picture corresponding to the event keyword and the characters;

and the third determining unit is used for determining the picture corresponding to the event keyword as the picture corresponding to the character if the association degree is greater than or equal to a first threshold value.

Optionally, the apparatus further comprises:

a fourth determining unit, configured to determine a word indicating time in the text;

the replacing unit is used for acquiring the publication time of the characters if the words representing the time are not in a preset format, determining the words conforming to the preset format according to the publication time of the characters, and replacing the words conforming to the preset format with the words representing the time to obtain the replaced characters;

and converting the replaced characters into voice of the video.

Optionally, the apparatus further comprises:

and the conversion unit is used for converting the characters corresponding to the event keywords into the subtitles of the video.

In a third aspect, embodiments of the present application provide a video generation apparatus, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors to include instructions for:

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium, where instructions, when executed by a processor of an electronic device, enable the electronic device to perform the video generation method of any one of the above first aspects.

Compared with the prior art, the embodiment of the application has the following advantages:

the embodiment of the application provides a video generation method, and particularly, in practical application, after an event, such as a hot event, occurs, pictures and characters related to the hot event often appear on a network. Therefore, in the embodiment of the application, the characters corresponding to the event keywords and the pictures corresponding to the characters can be acquired, then the characters corresponding to the event keywords are converted into the voice of the video, and the pictures corresponding to the characters are converted into the video frames of the video, so that the video corresponding to the event keywords is generated. That is to say, with the solution provided in the embodiment of the present application, it is not necessary to generate a video related to an event keyword, for example, a hot event keyword, by recording a video, but a video corresponding to the event keyword is generated by using a text corresponding to the event keyword and an image corresponding to the text. The time spent in the steps of acquiring the characters corresponding to the event keywords and the pictures corresponding to the characters, generating the videos corresponding to the event keywords according to the characters corresponding to the event keywords and the pictures corresponding to the characters and the like is short, and compared with the video recording, the video generating efficiency is higher. Compared with the prior art, the scheme provided by the embodiment of the application can generate the video related to the hot event soon after the hot event occurs.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flowchart of a video generation method according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a method for determining an event keyword according to an embodiment of the present application;

fig. 3 is a flowchart illustrating a method for determining an event keyword according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a video generating device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Most of the current methods for generating the video of the hot event are directly recording the video. However, it takes a certain time to record a video, and a video clip or the like is performed after the video is recorded, so that it is impossible to generate a video of a hot event soon after the hot event occurs. In view of this, embodiments of the present application provide a video generation method and apparatus, which can generate a video related to a hot event soon after the hot event occurs.

Various non-limiting embodiments of the present application are described in detail below with reference to the accompanying drawings.

Exemplary method

Referring to fig. 1, the figure is a schematic flowchart of a video generation method according to an embodiment of the present application.

The video generation method provided in the embodiment of the present application may be executed by a server, where the server may be a dedicated server for generating a video corresponding to an event keyword, and the server may also be a server further having other data processing functions, and the embodiment of the present application is not particularly limited.

The video generation method provided by the embodiment of the application may include the following steps S101 to S102, for example.

S101: and acquiring characters corresponding to the event keywords and pictures corresponding to the characters.

The event keywords mentioned in the embodiment of the present application may be keywords related to a trending event, and may also be keywords related to other events, for example, an event to be researched, and the embodiment of the present application is not particularly limited.

In this embodiment, the event keyword may include a plurality of characters, and the characters may be, for example, chinese characters, english characters, korean characters, or the like. It is to be understood that the event keyword may include a plurality of chinese characters when the character is a chinese character, a plurality of english words when the character is an english character, and a plurality of korean characters when the character is a korean character.

The specific number of characters included in the event keyword is not specifically limited in the embodiments of the present application, and the specific number of characters included in the event keyword may be determined according to an event corresponding to the event keyword. It should be noted that, in the embodiment of the present application, the text corresponding to the event keyword and the picture corresponding to the text may be obtained through a network. As an example, a word corresponding to the event keyword and a picture corresponding to the word may be crawled by using a web crawler with the event keyword as a search keyword. The crawling scope of the web crawler is not particularly limited in the embodiment of the application, the crawling scope of the web crawler can include the world wide web, and the crawling scope of the web crawler can further include content published on a corresponding social application program. The social application program is not particularly limited in the embodiments of the present application, and the social application program may be, for example, a microblog, a forum, a community, and the like.

S102: and converting the characters corresponding to the event keywords into voice of a video, and converting the pictures corresponding to the characters into video frames of the video so as to generate the video corresponding to the event keywords.

After the words corresponding to the event keywords and the pictures corresponding to the words are obtained, the videos corresponding to the event keywords can be generated by using the words corresponding to the event keywords and the pictures corresponding to the words. It is understood that video includes both voice and video frames, with pictures included in the video frames. Therefore, in the embodiment of the application, the characters corresponding to the event keywords can be converted into the voice of the video, the pictures corresponding to the characters can be converted into the video frames of the video, and the video frames and the voice are synthesized, so that the video corresponding to the event keywords can be obtained.

The embodiment of the present application does not specifically limit a specific implementation manner of converting the text corresponding to the event keyword into the speech, and as an example, the text corresponding to the event keyword may be converted into the speech by using a corresponding speech generation tool.

The embodiment of the present application does not specifically limit an implementation manner of converting the picture corresponding to the text into the video frame of the video, and as an example, the duration of the picture corresponding to the text appearing in the video frame may be determined according to the play duration of the voice obtained through the conversion, so as to convert the picture corresponding to the text into the video frame of the video. For example, if the playing time of the voice is 5 minutes, it may be determined that the time length of the picture corresponding to the text appearing in the video frame is 5 minutes, and further, the picture corresponding to the text is converted into the video frame of the video.

As can be seen from the above description, with the video generation method provided in the embodiment of the present application, it is not necessary to generate a video related to an event keyword, such as a hot event keyword, in a manner of recording a video, but a video corresponding to the event keyword is generated by using a text corresponding to the event keyword and an image corresponding to the text. The time spent in the steps of acquiring the characters corresponding to the event keywords and the pictures corresponding to the characters, generating the videos corresponding to the event keywords according to the characters corresponding to the event keywords and the pictures corresponding to the characters and the like is short, and compared with the video recording, the video generating efficiency is higher. Compared with the prior art, the scheme provided by the embodiment of the application can generate the video related to the hot event soon after the hot event occurs.

Moreover, generally speaking, the text corresponding to the event keyword and the picture corresponding to the text acquired in S101 are related in content, so that in the process of playing the video frame, the video frame correspondingly plays the voice related to the picture in the video frame by using the video generated by the text corresponding to the event keyword and the picture corresponding to the text, and the correlation between the video frame and the voice is strong, so that better user viewing experience can be brought.

As described above, the aforementioned event keywords may be keywords related to a trending event. In consideration of the fact that in practical application, in order to be able to broadcast the video of the hot event as soon as possible after the hot event occurs, the method provided by the embodiment of the application may further automatically determine the event keyword, so as to generate the video corresponding to the hot event before the event is fermented into the hot event. Thereby enabling the video associated with the event to be played at a first time when the event ferments to a trending event.

Referring to fig. 2, the figure is a schematic flowchart of a method for determining an event keyword according to an embodiment of the present application. The method for determining the event keyword provided by the embodiment of the application can be implemented through the following steps S201 to S203, for example.

S201: and acquiring the material according to a preset rule.

S202: and extracting candidate keywords from the materials.

In the embodiment of the present application, it is considered that in practical applications, many websites and applications are provided with hot event columns, and the events mentioned in the hot event columns have a high possibility that fermentation becomes hot events. In view of this, in an implementation manner of the embodiment of the present application, in a specific implementation, S201 may obtain the material from the hot column of the preset website and/or the hot column of the preset application by using a data mining technology.

After the material is obtained, keywords related to the event in the material can be determined as candidate keywords. The embodiment of the present application does not specifically limit the specific implementation manner of determining the candidate keyword, and as an example, considering that generally, in the content related to the material, the degree of connection between the title and the event is relatively close, the candidate keyword may be extracted from the title of the material.

S203: and if the search quantity and/or the click quantity corresponding to the candidate keyword meet preset conditions, determining the candidate keyword as the event keyword.

It is contemplated that not all of the events mentioned in the popular section may be popular events. In the embodiment of the present application, after determining the candidate keywords, a candidate keyword corresponding to an event with a higher possibility of becoming a trending event may be further determined from the determined candidate keywords, that is, an event keyword may be further determined from the determined candidate keywords.

It can be understood that, in practical applications, the attention of users to hot events is relatively high. The attention of the user to an event can be reflected by the search volume and the click volume of the user to the keyword corresponding to the event. In view of this, in the embodiment of the present application, the event keyword may be determined from the candidate keywords by the search volume and/or click volume corresponding to the candidate keywords. The search amount of a candidate keyword may be the amount of the candidate keyword searched by the user in the search engine; the click rate of a candidate keyword may be the number of the user clicking on the web page corresponding to the candidate keyword, posting a message corresponding to the candidate keyword on the social networking site, and the like.

In view of this, in the embodiment of the present application, if the search volume and/or the click volume corresponding to the candidate keyword meet a preset condition, the candidate keyword is determined as the event keyword. In the embodiment of the application, the search volume and/or click volume corresponding to the candidate keyword meet a preset condition, which indicates that the attention of a user to an event corresponding to the candidate keyword is high. As an example, the search volume and/or the click volume corresponding to the candidate keyword meet a preset condition, for example, the search volume and/or the click volume corresponding to the candidate keyword meet the preset condition, where the search volume and/or the click volume corresponding to the candidate keyword is greater than or equal to a preset threshold, and a specific value of the preset threshold may be determined according to an actual situation, and the embodiment of the present application is not specifically limited.

As described above, the text corresponding to the event keyword and the image corresponding to the text may be acquired through a network. It can be understood that when words corresponding to event keywords and pictures corresponding to the words are acquired by using a network, words corresponding to the keywords and pictures corresponding to the words from various channels can be acquired. For example, words corresponding to the keywords and pictures corresponding to the words from various news websites can be acquired; characters corresponding to the keywords and pictures corresponding to the characters issued on each social application program can also be acquired. In view of this, in this embodiment of the application, in the step S101 "obtaining the text corresponding to the event keyword and the picture corresponding to the text" in a specific implementation, the text may be obtained by obtaining at least one group of text corresponding to the event keyword and a picture corresponding to each group of text.

It should be noted that, in this embodiment of the application, the text and the picture corresponding to the event keyword, which are obtained from a certain channel, may be defined as a group of text corresponding to the event keyword and a group of picture corresponding to the text. The embodiment of the present application does not specifically limit the range covered by the channel, and as an example, a webpage may be defined as a channel; as yet another example, a website may be defined as a channel; as yet another example, a social application may be defined as a channel, and so on.

When the obtained text corresponding to the event keyword and the picture corresponding to the text include at least one group of text corresponding to the event keyword and a picture corresponding to each group of text, respectively, in the specific implementation of "converting the text corresponding to the event keyword into the voice of the video" in S102, each group of text in the at least one group of text may be converted into the voice of the corresponding video, respectively. It can be understood that, in practical applications, in the pictures corresponding to the at least one group of words and each group of words, since a group of words and a picture corresponding to the group of words may be obtained from the same channel, for example, the same web page, the content correlation between the group of words and the picture corresponding to the group of words is relatively high. In view of this, in the embodiment of the present application, when the video corresponding to the event keyword is generated, the voice corresponding to a group of characters and the video frame corresponding to the picture corresponding to the group of characters may be played correspondingly, so that the video frame of the video and the voice of the video have a relatively high correlation in content. Specifically, in the foregoing S102, when the step of converting the picture corresponding to the text into the video frame of the video is specifically implemented, the playing time of the voice of the video corresponding to each group of the at least one group of the text may be determined, and then the video frame corresponding to the picture corresponding to each group of the at least one group of the text is determined according to the playing time. For example, the text corresponding to the event keyword and the picture corresponding to the text include 3 groups of texts and pictures corresponding to the 3 groups of texts, specifically, the playing time of the voice corresponding to the first group of texts is a first time, the playing time of the voice corresponding to the second group of texts is a second time, and the playing time of the voice corresponding to the third group of texts is a third time. Determining that the playing time length of the video frame corresponding to the picture corresponding to the first group of characters is a first time length, and further generating a video frame corresponding to the voice corresponding to the first group of characters, wherein the playing time length of the video frame is the first time length; determining the playing time length of the video frame corresponding to the picture corresponding to the second group of characters as a second time length, and further generating the video frame corresponding to the voice corresponding to the second group of characters, wherein the playing time length of the video frame is the second time length; and determining the playing time length of the video frame corresponding to the picture corresponding to the third group of characters as a third time length, and further generating the video frame corresponding to the voice corresponding to the third group of characters, wherein the playing time length of the video frame is the third time length.

It can be understood that, in practical applications, when the text corresponding to the event keyword and the picture corresponding to the text include at least one group of texts and pictures corresponding to each group of texts, in order to enable the generated voice of the video corresponding to the event keyword to describe the process of the event occurrence according to a certain time sequence when playing, in this embodiment of the present application, time information corresponding to each group of texts in the at least one group of texts may be obtained, and according to the time information corresponding to each group of texts in the at least one group of texts, the text corresponding to the event keyword is converted into voice of the video, and the picture of the text is converted into a video frame of the video, so as to generate the video corresponding to the event keyword.

In the embodiment of the present application, the time information corresponding to each group of characters in the at least one group of characters is obtained, so as to describe a process of the occurrence of the event according to a time sequence. In this embodiment of the application, considering that the publication information of each of the at least one group of texts can indicate the development sequence of the event corresponding to the event keyword to a certain extent, the time information corresponding to each of the at least one group of texts may include the publication time corresponding to the at least one group of texts. In addition, considering that each of the at least one group of texts may be information describing an event corresponding to the event keyword within a certain time period, in another implementation manner of the embodiment of the present application, the time information corresponding to each of the at least one group of texts may also include information describing a time included in each of the at least one group of texts.

In this embodiment of the application, after the time information corresponding to each group of words in the at least one group of words is obtained, the voices corresponding to the at least one group of words may be sorted according to a time sequence, for example, the voices corresponding to the at least one group of words are sorted according to a time described by the time information corresponding to the each group of words from early to late, and a playing sequence of the voices corresponding to the each group of words in the at least one group of words in the video is determined according to the sorting sequence, so that the generated video can describe a process of occurrence of an event according to a development sequence of the event corresponding to the event keyword.

Of course, in the embodiment of the present application, the voices corresponding to the at least one group of characters may also be sorted in other orders, for example, the voices corresponding to the at least one group of characters are sorted in order from late to early according to the time described by the time information corresponding to each group of characters. Further, according to the arrangement sequence, determining the playing sequence of the voice corresponding to each group of characters in the at least one group of characters in the video, so that the generated video can describe the process of the event in a reverse narrative manner.

In an implementation manner of the embodiment of the present application, when the text corresponding to the event keyword includes multiple groups of text, in order to make a logic relationship of voice of the video more strict in a playing process of a finally generated video, in another implementation manner of the embodiment of the present application, the multiple groups of text corresponding to the event keyword may be converted into voice of the video according to the logic relationship between the multiple groups of text. The logical relationship between the words mentioned in the embodiments of the present application may include any one or more of from cause to effect, from primary to secondary, from whole to part, from general to concrete, from phenomenon to essence, and from concrete to general, for example.

In an implementation manner of the embodiment of the application, each group of characters in the multiple groups of characters may be analyzed, a connecting word representing a logical relationship in each group of characters in the multiple groups of characters is extracted, then the logical relationship between the multiple groups of characters is determined according to the connecting word representing the logical relationship in each group of characters in the multiple groups of characters, and then the multiple groups of characters corresponding to the event keywords are converted into the voice of the video according to the logical relationship between the multiple groups of characters, so that the voice corresponding to the multiple groups of characters is in the playing sequence in the video, and conforms to the logical relationship between the multiple groups of characters corresponding to the event keywords.

For example, the words corresponding to the event keywords include two groups of words, wherein the logical relationship between the first group of words and the second group of words is a relationship from a cause to a result. Therefore, in the embodiment of the application, the first group of characters can be converted into the first voice and the second group of characters can be converted into the second voice according to the logical relationship between the first group of characters and the second group of characters, and the playing sequence of the first voice and the second voice in the video is that the first voice is played first and then the second voice is played.

In another implementation manner of the embodiment of the present application, a pre-trained logical relationship determination model may be used to determine the logical relationship between the multiple groups of characters. Specifically, the plurality of groups of characters may be input into the logical relationship determination model to obtain an output result of the logical relationship determination model. It will be appreciated that the logical relationship determines the result of the model input, i.e., the logical relationship between the sets of words.

It should be noted that, in this embodiment of the application, the logical relationship determination model may be obtained by training based on training texts and labels carried by the training texts, where the training texts may include multiple groups of texts, and the labels of the training texts are used to represent logical relationships between the multiple groups of texts in the training texts. The logical relationship determination model is not specifically limited in the embodiments of the present application, and as an example, the logical relationship determination model may be a deep learning model, for example, the logical relationship determination model may be a Convolutional Neural Networks (CNN) model; for another example, the logic relationship determination model may be a Recurrent Neural Network (RNN) model; as another example, the logical relationship determination model may be a Deep Neural Network (DNN) model, or the like. And are not described in detail herein.

As described above, the picture corresponding to the text may be converted into the video frame of the video according to the playing duration of the voice converted from the text corresponding to the event keyword. In an implementation manner of the embodiment of the application, in order to improve the viewing experience when the user views the video, one picture is not suitable to continuously appear in the video frames of multiple frames which are continuously played. In view of this, in an implementation manner of the embodiment of the present application, if the playing time duration corresponding to the voice obtained by converting the text is greater than or equal to a certain ratio threshold, the number of pictures included in a video frame is relatively small in the playing time duration of the voice, that is, one picture may appear in many frames of continuously played video frames. For example, if the playing time corresponding to the voice obtained by the text conversion is 120 seconds, and the number of pictures corresponding to the text is 2, in the playing time, it is possible that the pictures included in the video frame played in the first 60 seconds are all the first pictures, and the pictures included in the video frame played in the second 60 seconds are all the second pictures. For this situation, in this embodiment of the application, the images corresponding to the characters corresponding to the event keywords may be added through steps S301 to S304 shown in fig. 3. Fig. 3 is a flowchart illustrating a method for determining an image corresponding to a text corresponding to an event keyword according to an embodiment of the present application.

S301: and acquiring a picture corresponding to the event keyword.

It should be noted that, in the embodiment of the present application, the picture corresponding to the event keyword may be obtained through a network, for example, the picture corresponding to the event keyword may be searched by using the event keyword as a search keyword, so as to obtain the picture corresponding to the event keyword.

S302: and identifying the picture corresponding to the event keyword to obtain the picture content of the picture corresponding to the event keyword.

It should be noted that the embodiment of the present application is not particularly limited to a specific implementation manner of performing image identification on the image corresponding to the event keyword, and as an example, the image feature of the image corresponding to the event keyword may be extracted, and the image content of the image corresponding to the event keyword is determined according to the extracted image feature.

S303: and determining the association degree between the picture content of the picture corresponding to the event keyword and the characters corresponding to the event keyword.

In the embodiment of the present application, a specific implementation manner of determining the association degree between the picture content of the picture corresponding to the event keyword and the text is not particularly limited, and as an example, a model capable of determining the association degree between the picture content and the text corresponding to the event keyword may be trained in advance, so that the association degree between the picture content and the text corresponding to the event keyword is determined by using the trained model. In the embodiment of the present application, the model may be, for example, a Convolutional Neural Network (CNN) model. In the embodiment of the present application, for example, the CNN model may be trained according to picture content carrying a label and corresponding text, where the label is used to represent a degree of association between the picture content and the text. In order to further improve the accuracy of the trained CNN model for determining the degree of association between the picture content and the text corresponding to the event keyword, when the CNN model is trained, the picture content input as training data may further include, for example, the position of the picture in the web page where the picture is obtained, the size of the picture, the position relationship between the picture and the text in the web page where the picture is obtained, and the like.

S304: and if the association degree is greater than or equal to a first threshold value, determining the picture corresponding to the event keyword as the picture corresponding to the character corresponding to the event keyword.

It should be noted that the association degree is greater than or equal to the first threshold, which indicates that the association degree between the picture corresponding to the event keyword and the text corresponding to the event keyword is relatively high, so that the picture with the association degree greater than or equal to the first threshold in the picture corresponding to the event keyword can be determined as the picture corresponding to the text corresponding to the event keyword. The first threshold is not specifically limited in the embodiment of the application, and the specific value of the first threshold can be determined according to the actual situation.

It should be noted that, in practical applications, in addition to adding the image corresponding to the text corresponding to the event keyword through the foregoing steps S301 to S304, other methods may also be used to add the image corresponding to the text corresponding to the event keyword. For example, in a possible implementation manner, an entity in the text corresponding to the event keyword may be identified, then an image related to the identified entity is obtained, and the obtained image related to the entity is determined as the image corresponding to the text corresponding to the event keyword. The embodiments of the present application do not specifically limit the entity, and the entity may include one or more of a name of a person, a name of an object, and the like. The embodiment of the present application does not specifically limit an implementation manner of obtaining the picture related to the identified entity, and as an example, the picture related to the identified entity may be obtained by searching using the identified entity as a search keyword.

As described above, the words corresponding to the event keyword may be obtained through a network, and in the embodiment of the present application, it is considered that when the obtained words corresponding to the event keyword include multiple groups, the association degrees between the multiple groups of words and the event keyword may be different, the association degrees between some groups of words and the event keyword may be higher, and the association degrees between another group of words and the event keyword are lower. In the embodiment of the application, in order to enable the degree of association between the voice of the generated video and the event keyword to be relatively high, the obtained multiple groups of characters can be further screened, and finally the characters with the relatively high degree of association between the screened characters and the event keyword are converted into the voice of the video. Specifically, in this embodiment of the present application, a degree of association between each group of texts in the multiple groups of texts and the event keyword in the content may be determined, and when converting the text corresponding to the event keyword into the voice of the video, the text corresponding to the degree of association greater than or equal to a second threshold in the multiple groups of texts may be converted into the voice of the video.

It should be noted that, in this embodiment of the application, a distance between each group of texts in the plurality of groups of texts and the event keyword may be respectively calculated, and a degree of association between each group of texts in the plurality of groups of texts and the event keyword is determined according to the distance. In general, the greater the distance between a group of words and the event keyword, the smaller the degree of association between the group of words and the event keyword. The embodiment of the present application does not specifically limit a specific implementation manner of calculating a distance between a group of characters and the event keyword, and as an example, may determine a word embedding vector corresponding to the event keyword, determine a word embedding vector corresponding to the group of characters, and calculate a distance between the word embedding vector corresponding to the event keyword and the word embedding vector corresponding to the group of characters, to obtain a distance between the group of characters and the event keyword.

In this embodiment of the application, the degree of association between a group of words and the event keyword is greater than or equal to the second threshold, which indicates that the degree of association between the group of words and the event keyword is relatively high. The specific value of the second threshold is not specifically limited in the embodiment of the application, and the specific value of the second threshold can be determined according to the actual situation.

As described above, the image corresponding to the text corresponding to the event keyword may be obtained through a network, and in consideration of practical applications, the obtained image may also include some images unrelated to the text corresponding to the event keyword, for example, some unrelated advertisement images. In view of this, in this embodiment of the application, after the picture corresponding to the text corresponding to the event keyword is acquired, in step S101, a degree of association between the picture content of the acquired picture and the text corresponding to the event keyword may be further determined, and if the degree of association is relatively high, the picture is determined as the picture corresponding to the text corresponding to the event keyword. For specific implementation of determining the association degree between the picture content of the acquired picture and the text corresponding to the event keyword, reference may be made to the description part of S303, and details are not described here.

It is understood that, in practical applications, the words corresponding to the event keywords may include words representing time, and some of the words representing time may be represented in other formats than a preset format.

The present embodiment does not specifically limit the preset format, and the preset format may be, for example, an absolute time format, and the absolute time format may be, for example, a time, a minute, a day, a morning, an afternoon, and the like. The embodiments of the present application also do not specifically limit other formats, which may be, for example, relative time formats. The relative time format may be, for example, "today morning," "yesterday afternoon 3" and "today 15 hours 30" and so on.

It can be understood that if the words corresponding to the event keywords include words that do not represent time in a preset format, some errors may exist when the video is played. Because the publishing time of the video generated by using the scheme provided by the embodiment of the present application may be inconsistent with the publishing time of the text corresponding to the event keyword, if the text corresponding to the event keyword is converted into the voice of the video, the words that do not describe the time in a preset format, such as an absolute time format, are not considered, and some errors may be introduced. Therefore, in the embodiment of the present application, if the word indicating the time is not in the preset format, the publishing time of the text corresponding to the event keyword can be further obtained, the word corresponding to the word indicating the time and conforming to the preset format is determined according to the publishing time of the text corresponding to the event keyword, and the word conforming to the preset format is replaced with the word not conforming to the preset format and the replaced text is obtained. Correspondingly, when the characters corresponding to the event keywords are converted into the voice of the video, the replaced characters can be converted into the voice of the video, so that corresponding errors caused by the fact that the words representing the time are not in the preset format are avoided.

For example, if the words corresponding to the event keywords include "11 am today" and are not words describing time in a preset format, the publication time of the words corresponding to the event keywords may be obtained, and after the publication time corresponding to the event keywords is obtained as 2019, 6.3, the words describing time "11 am today" may be converted into "11 am 6.3 in 2019", and "11 am 6.3 in 2019" may be used to replace "11 am 11 today" in the words corresponding to the event keywords, and further, the words corresponding to the event keywords after replacement may be converted into the voice of the video.

It is understood that in practical applications, it may be inconvenient for a user to listen to speech while watching a video. For example, when a user is in a car, the ambient noise is too loud, and the video and voice are not clearly heard. In order to enable a user to normally watch videos in scenes where the user is inconvenient to listen to voices, in an implementation manner of the embodiment of the application, characters corresponding to the event keywords can be converted into subtitles of the videos, and the subtitles can be synchronously displayed when the videos are played, so that the user can know specific contents displayed by the video frames through the subtitles.

It should be noted that, in the embodiment of the present application, the content embodied by the subtitles may completely correspond to the voice of the aforementioned video, that is, when the video is played, the subtitles corresponding to the voice and the voice are played synchronously. Of course, the content embodied by the subtitles may not completely correspond to the voice of the video as long as the subtitles are determined according to the characters corresponding to the event keywords. It can be understood that, if the content embodied by the subtitle may not completely correspond to the voice of the video, when the user is inconvenient to listen to the voice, the specific content displayed by the video frame may be known according to the subtitle. When a user conveniently listens to voice, the content displayed by the video frame can be known according to the voice and the subtitles, and the content embodied by the subtitles can also not completely correspond to the voice of the video, so that the user can know more content displayed by the video frame.

Exemplary device

Based on the methods provided by the above embodiments, the embodiments of the present application further provide a video generating apparatus, which is described below with reference to the accompanying drawings.

Referring to fig. 4, this figure is a schematic structural diagram of a video generating apparatus according to an embodiment of the present application.

The video generating apparatus 400 illustrated in fig. 4 may specifically include: a first acquisition unit 401 and a generation unit 402.

A first obtaining unit 401, configured to obtain a text corresponding to an event keyword and an image corresponding to the text;

a generating unit 402, configured to convert the text corresponding to the event keyword into a voice of a video, and convert the picture corresponding to the text into a video frame of the video, so as to generate a video corresponding to the event keyword.

Optionally, the apparatus further comprises:

an extracting unit for extracting candidate keywords from the material;

Optionally, the first obtaining unit 401 is specifically configured to:

the generating unit 402 is specifically configured to:

Optionally, the generating unit 402 is specifically configured to:

Optionally, the apparatus further comprises:

and converting the replaced characters into voice of the video.

Optionally, the apparatus further comprises:

Since the apparatus 400 is an apparatus corresponding to the method provided in the above method embodiment, and the specific implementation of each unit of the apparatus 400 is the same as that of the above method embodiment, for the specific implementation of each unit of the apparatus 400, reference may be made to the description part of the above method embodiment, and details are not repeated here.

As can be seen from the above description, with the video generation apparatus provided in the embodiment of the present application, it is not necessary to generate a video related to an event keyword, such as a hot event keyword, by recording the video, but a text corresponding to the event keyword and an image corresponding to the text are used to generate a video corresponding to the event keyword. The time spent in the steps of acquiring the characters corresponding to the event keywords and the pictures corresponding to the characters, generating the videos corresponding to the event keywords according to the characters corresponding to the event keywords and the pictures corresponding to the characters and the like is short, and compared with the video recording, the video generating efficiency is higher. Compared with the prior art, the scheme provided by the embodiment of the application can generate the video related to the hot event soon after the hot event occurs.

Fig. 5 is a block diagram illustrating a video generation apparatus 500 according to an example embodiment. For example, the apparatus 500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, the apparatus 500 may include one or more of the following components: processing component 502, memory 504, power component 506, multimedia component 508, audio component 510, input/output (I/O) interface 512, sensor component 514, and communication component 516.

The processing component 502 generally controls overall operation of the device 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 502 may include one or more processors 520 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 502 can include one or more modules that facilitate interaction between the processing component 502 and other components. For example, the processing component 502 can include a multimedia module to facilitate interaction between the multimedia component 508 and the processing component 502.

The memory 504 is configured to store various types of data to support operation at the device 500. Examples of such data include instructions for any application or method operating on device 500, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 504 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 506 provides power to the various components of the device 500. The power components 506 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 500.

The multimedia component 508 includes a screen that provides an output interface between the device 500 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 508 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 500 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 510 is configured to output and/or input audio signals. For example, audio component 510 includes a Microphone (MIC) configured to receive external audio signals when apparatus 500 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 504 or transmitted via the communication component 516. In some embodiments, audio component 510 further includes a speaker for outputting audio signals.

The I/O interface 512 provides an interface between the processing component 502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 514 includes one or more sensors for providing various aspects of status assessment for the device 500. For example, the sensor assembly 514 may detect an open/closed state of the device 500, the relative positioning of the components, such as a display and keypad of the apparatus 500, the sensor assembly 514 may also detect a change in the position of the apparatus 500 or a component of the apparatus 500, the presence or absence of user contact with the apparatus 500, orientation or acceleration/deceleration of the apparatus 500, and a change in the temperature of the apparatus 500. The sensor assembly 514 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 514 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 516 is configured to facilitate communication between the apparatus 500 and other devices in a wired or wireless manner. The apparatus 500 may access a wireless network based on a communication standard, such as WiFi, 2G or 5G, or a combination thereof. In an exemplary embodiment, the communication component 516 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 516 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 504 comprising instructions, executable by the processor 520 of the apparatus 500 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 6 is a schematic structural diagram of a video generation device in an embodiment of the present invention. The video generation apparatus 600 may have relatively large differences due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 622 (e.g., one or more processors) and a memory 632, one or more storage media 630 (e.g., one or more mass storage devices) storing applications 642 or data 644. Memory 632 and storage medium 630 may be, among other things, transient or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations for the video generating device. Still further, the central processor 622 may be configured to communicate with the storage medium 630 and execute a series of instruction operations in the storage medium 630 on the video generating device 600.

The video generating apparatus 600 may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input-output interfaces 658, one or more keyboards 656, and/or one or more operating systems 661, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

Embodiments of the present application also provide a non-transitory computer-readable storage medium, where instructions, when executed by a processor of a video generation device, enable the video generation device to perform a video generation method, the method including:

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the attached claims

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of video generation, the method comprising:

2. The method of claim 1, further comprising:

obtaining a material according to a preset rule;

extracting candidate keywords from the materials;

3. The method according to claim 1, wherein the obtaining of the text corresponding to the event keyword and the picture corresponding to the text comprises:

4. The method of claim 3, wherein converting the text corresponding to the event keyword into a voice of a video, and converting the picture corresponding to the text into a video frame of the video, so as to generate the video corresponding to the event keyword comprises:

5. The method of claim 3, wherein the at least one set of words comprises a plurality of sets of words;

6. The method of claim 1, further comprising:

acquiring a picture corresponding to the event keyword;

7. The method according to claim 1, wherein the words corresponding to the event keywords comprise a plurality of groups;

8. A video generation apparatus, characterized in that the apparatus comprises:

9. A video generation apparatus, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:

10. A non-transitory computer readable storage medium, instructions in which, when executed by a processor of an electronic device, enable the electronic device to perform the video generation method of any of claims 1 to 7.