WO2018130173A1

WO2018130173A1 - Dubbing method, terminal device, server and storage medium

Info

Publication number: WO2018130173A1
Application number: PCT/CN2018/072201
Authority: WO
Inventors: 李钟伟
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2017-01-16
Filing date: 2018-01-11
Publication date: 2018-07-19
Also published as: CN107071512A; CN107071512B

Abstract

Disclosed are a dubbing method, apparatus and system. The dubbing method comprises: in response to a user instruction, playing back a video; acquiring a video start point and a video termination point in the video, said points being selected by the user; generating, according to the video start point and the video termination point, video information of the video to be dubbed; and sending the video information to a server so that the server generates, according to the video information, a video to be dubbed.

Description

Dubbing method, terminal device, server and storage medium

The present application claims priority to Chinese Patent Application, filed Jan.

Technical field

The present application relates to the field of video processing, and in particular, to a dubbing method, a terminal device, a server, and a storage medium.

background

At present, some dubbing software can provide user dubbing function, that is, receiving audio data submitted by the user for the user-selected to-be-recorded video, thereby generating a video of the user dubbing. The video to be dubbed is generally provided by the dubbing software operator for the user to select from which to be interested. Some dubbing software also allows users to upload self-portrait video files as to-be-recorded video.

Technical content

The embodiment of the present application proposes a dubbing method, device and system.

A voice-over method of the embodiment of the present application can be applied to a terminal device, where the method includes:

Playing a video in response to a user instruction;

Obtaining a video starting point and a video ending point selected by the user in the video;

Generating video information of the to-be-recorded video according to the video starting point and the video ending point;

Sending the video information to a server, so that the server generates a to-be-recorded video according to the video information.

A voice-over method, wherein the method is applied to a server, and the method includes:

Obtaining video information of the to-be-recorded video from the terminal device, where the video information is generated by the terminal device according to a starting point and a video termination point selected by the user in the played video;

And generating a to-be-recorded video according to the video information.

A terminal device, comprising a processor and a memory, wherein the memory stores computer readable instructions that cause the processor to:

Playing a video in response to a user instruction;

Sending the video information to a server, so that the server obtains a to-be-recorded video according to the video information.

A server comprising: a processor and a memory, the memory storing computer readable instructions, the instructions causing the processor to:

And generating a to-be-recorded video according to the video information.

The embodiment of the present application further provides a non-transitory computer readable storage medium storing computer readable instructions, which can cause at least one processor to perform the method as described above.

The technical solution of the embodiment of the present application can intercept the video content specified by the user in the video played by the terminal device according to the user instruction, generate the audio to be dubbed video, enrich the material source of the dubbing system, and improve the service capability of the dubbing system.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any inventive effort.

1 is a schematic diagram of an implementation environment provided by an example of the present application;

2 is a schematic diagram of a server cluster architecture provided by an embodiment of the present application;

3 is a flowchart of a dubbing method provided by an embodiment of the present application;

4A is a flowchart of a method for a first client to obtain a to-be-recorded video according to an embodiment of the present application;

4B is a flowchart of a method for obtaining a to-be-recorded video according to an embodiment of the present application;

FIG. 5 is a video editing method provided by an embodiment of the present application;

6 is a schematic flowchart of editing a video according to an embodiment of the present application;

7A is a flowchart of a dubbing method provided by an embodiment of the present application;

7B is a flowchart of a dubbing method provided by an embodiment of the present application;

FIG. 8 is a flowchart of a method for generating a target video according to an embodiment of the present application;

FIG. 9 is a flowchart of a method for acquiring a caption provided by an embodiment of the present application;

10 is a flowchart of a method for voice recognition provided by an embodiment of the present application;

11 is a block diagram of a dubbing device provided by an embodiment of the present application;

FIG. 12 is a block diagram of a target video generating module according to an embodiment of the present application;

FIG. 13 is a block diagram of an identity generation module provided by an embodiment of the present application;

FIG. 14 is a structural block diagram of a terminal according to an embodiment of the present application;

FIG. 15 is a structural block diagram of a server provided by an embodiment of the present application.

Implementation

The embodiments described herein are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present application without creative efforts are within the scope of the present application.

Please refer to FIG. 1 , which shows a schematic diagram of an implementation environment provided by an embodiment of the present application. The implementation environment includes a first terminal 120, a server 140, and a second terminal 160.

The first client 120 runs a first client. The first terminal 120 may be a mobile phone, a tablet computer, an OTT device, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop portable computer, a desktop computer, or the like. The OTT device is a device that connects a TV to the Internet, and allows the TV to connect to the Internet through an OTT device to play content obtained from the Internet. OTT devices may include smart televisions, set top boxes, network television boxes, and the like. The network TV box, also called the smart TV box, is a device that connects the TV to the Internet. The network television box obtains data of the network program from the Internet and provides it to the television set.

The server 140 can be a server, a server cluster consisting of several servers, or a cloud computing service center.

A second client is running in the second terminal 160. The second terminal 160 can be a cell phone, a tablet, a laptop portable computer, a desktop computer, and the like.

The server 140 can establish a communication connection with the first terminal 120 and the second terminal 160 through the communication network, respectively. The network can be either a wireless network or a wired network.

In the embodiment of the present application, the first client may be any client that has a User Interface (UI) interface and is capable of communicating with the server 140. For example, the first client may be a video service client, a cable client, a game client, a browser, a client dedicated to video dubbing, and the like.

In the embodiment of the present application, the second client may be any client that has a User Interface (UI) interface and is capable of communicating with the server 140. For example, the second client may be a video editing client, a social application client, an instant messaging client, a payment application client, a client dedicated to video dubbing, and the like.

In a practical application, the first client and the second client may be two clients with different functions, and the first client and the second client may also be two clients with the same function. Correspondingly, the first terminal and the second terminal are both terminal devices. When the client running in the terminal device is used to implement the function of the first client side in the method example of each embodiment, the terminal device is used as the first terminal; when the client running in the terminal device is used to implement the application In the example of the second client side of the method example, the terminal device acts as the second terminal. In practical applications, for the same client, it can be used as the first client or as the second client. For the same terminal, it can be used as the first terminal or as the second terminal.

In one example, as shown in FIG. 2, when the background server 140 is a cluster architecture, the background server 140 may include a communication server 142, a management server 144, and a video server 146.

The communication server 142 is for providing communication services with the first client and the second client, and for providing communication services with the management server 144 and the video server 146.

The management server 144 is used to provide functions for managing video files as well as audio files.

Video server 146 is used to provide editing and dubbing functions for the video.

A communication connection can be established between the above various servers through a communication network. The network can be either a wireless network or a wired network.

Please refer to FIG. 3, which shows a flowchart of a dubbing method provided by an embodiment of the present application. This method can be applied to the implementation environment shown in FIG. The method can include the following steps.

Step 301: The first client obtains a to-be-recorded video in response to a user instruction.

If the first client runs on a terminal device with a remote controller such as a smart TV or a set top box, the user command can be triggered by pressing or long pressing a designated button on the remote controller, or can be clicked through the remote controller. Or triggering by double-clicking on the specified icon; if the first client is running on a terminal device with buttons and screens such as a television, desktop or portable computer, the user command can be pressed or long pressed by a designated button. Triggering can also be triggered by clicking or double-clicking the specified icon; if the first client is running on a mobile phone or tablet, the user command can also be triggered by clicking, double-clicking, moving, dragging, and the like. In response to the user instruction, the first client enters a dubbing mode. Please refer to FIG. 4A, which shows a flowchart of a method for the first client to obtain a video to be dubbed in the dubbing mode.

Step 3011A: Obtain a video identifier selected by a user.

Step 3012A: Obtain a video starting point and a video ending point selected by the user;

Step 3013A: In the video file corresponding to the video identifier, copy the video content between the video starting point and the video ending point to obtain a to-be-recorded video.

In some examples, obtaining the audio to video may also be obtained by the method as shown in FIG. 4B. The method can include the following steps.

Step 3011B, playing a video in response to a user instruction;

Step 3012B: Acquire a video starting point and a video termination point selected by the user in the video.

Step 3013B: Generate video information of the to-be-recorded video according to the video starting point and the video ending point;

Step 3014B: Send the video information to a server, so that the server generates a to-be-recorded video according to the video information.

In various embodiments, the played video is a video obtained by the terminal device via the Internet, such as an OTT video.

In some examples, the terminal device may intercept video data between the video start point and the video termination point in the video, and send the video data as the video information to the server, such that the server The video data is stored as the to-be-recorded video.

In some examples, the terminal device may send a video identifier of the video, information of the video starting point, and information of the video termination point to the server as the video information, so that the server according to the The information of the video start point and the information of the video termination point intercept the to-be-recorded video from the video corresponding to the video identifier.

In some examples, the information of the video starting point includes a first video screenshot of the video corresponding to the video starting point, and the information of the video termination point includes a second of the video corresponding to the video termination point. Video screenshot. The terminal device may send the video information to the server, so that the server determines, according to the first video screenshot and the second video screenshot, that the video starting point and the video are terminated in a video corresponding to the video identifier. Pointing, the to-be-recorded video is intercepted from the video according to the video starting point and the video termination point.

In some examples, the information of the video starting point includes a first time in the video corresponding to the video starting point, and the information of the video ending point includes a second time in the video corresponding to the video ending point. . The terminal device may send the video information to the server, so that the server intercepts the to-be-recorded video from the video according to the first time and the second time.

In some instances, the terminal device can also edit the to-be-recorded video by interacting with the server. The editing operations include, but are not limited to, screen cropping, video clipping, video addition, mute, dubbing, and graphics processing.

In some examples, the method can also include:

Generating an audio file corresponding to the to-be-recorded video in response to the voice-over instruction;

And transmitting the audio file to a server, so that the server generates a dubbed video file according to the audio to video corresponding to the video identifier and the audio file corresponding to the video identifier.

Here, the terminal device can acquire audio input by the user through various devices with pickups to generate an audio file. These devices with pickups can include a microphone, a remote control with a microphone, a cell phone, and the like. The terminal device can communicate with the device with the pickup using a wired connection or a wireless connection (eg, infrared, Bluetooth, Wi-Fi, etc.).

Step 302: The first client sends the to-be-recorded video to the server.

Further, the first client may also save the to-be-recorded video locally before sending the to-be-recorded video to the server.

Step 303: The server acquires the to-be-recorded video, and the server generates a target video according to the to-be-dubbed video.

Specifically, if the to-be-dubbed video meets the relevant definition of the target video, the to-be-dubbed video may directly serve as the target video; if the to-be-matched audio-video does not meet the relevant definition of the target video, then the to-be-recorded video is performed. The target video is generated after editing. The relevant definition of the target video includes, but is not limited to, no audio data in the target video.

Step 304: The server generates a management identifier corresponding to the target video, and obtains an interaction identifier corresponding to the management identifier.

Specifically, the management identifier may be an ID number or a key value (key value) for identifying the target video. All audio files and video files associated with the target video have the same management identity, and the server manages the video files and/or audio files according to the management identity.

The interaction identifier is configured to enable the second client to obtain the target video generated by the server and the management identifier; the interaction identifier may be the same as the management identifier, or may be different from the management identifier. The interaction identifier is generated according to the management identifier, and the interaction identifier includes, but is not limited to, a web address, a two-dimensional code, a barcode, and a combination thereof.

In an embodiment of the present application, the interaction identifier includes a web address corresponding to the management identifier and the web address represented by a two-dimensional code. The target video and the management identifier are stored under the location of the web address.

Step 305: The server sends the interaction identifier to the first client.

Step 306: The first client acquires the interaction identifier from the server, and enables the interaction identifier to be acquired by the second client.

In some examples, the method can also include:

The terminal device displays an interaction identifier of the to-be-recorded video sent by the server, and the interaction identifier can be recognized by a terminal device to obtain the to-be-recorded video from the server. Here, the second client can be run on the terminal device.

Step 307: The second client obtains the target video and the management identifier from the server according to the interaction identifier.

The first client obtains the two-dimensional code, and the second client can obtain the two-dimensional code by scanning a code, and the second client can log in to the second client by using the two-dimensional code. The web address represented by the dimension code, thereby obtaining the target video and the management identifier.

Further, the second client may further perform an editing operation on the target video, where the editing operation includes, but is not limited to, screen cropping, video clipping, video addition, silencing, dubbing, and graphics processing, thereby obtaining the edited target video. And sending the edited target video and the management identifier to the server to replace the target video corresponding to the management identifier on the server side.

Further, the second client may also issue a video editing instruction to the server by interacting with the server, where the editing instruction further includes the management identifier. Editing operations are performed by the server on the target video corresponding to the management identifier, including but not limited to screen cropping, video clipping, video addition, mute, dubbing, and graphics processing. The server obtains the edited target video, replaces the original target video with the edited target video, and pushes the edited target video to the second client.

Step 308, in response to the dubbing instruction, generate an audio file corresponding to the management identifier and send the audio file to the server.

Specifically, in response to the voice-over instruction, the second client may acquire the audio file by recording an audio file, selecting an existing audio file, and the like, and sending the audio file and the management identifier to the server, so that the server can obtain the Audio file.

Further, if the audio file is generated by recording an audio file, during the recording, the target video is played for the user to perform dubbing; if the second client interacts with the server before step 308, or by itself The editing function edits the target video, and during the recording process, the edited target video is played for the user to dub.

In some examples, the method can also include:

In response to the dubbing instruction, the terminal device may generate an audio file corresponding to the to-be-recorded video; send the audio file to a server, such that the server according to the to-be-recorded video corresponding to the video identification and corresponding to the video The identified audio file generates a dubbed video file.

For example, the terminal device can record an audio file through a voice input device, such as a microphone, and generate an audio file. During the recording process, the terminal device can simultaneously play the video for the user to perform dubbing.

Step 309: The server generates the dubbed video file according to the audio file corresponding to the management identifier and the target video corresponding to the management identifier.

If, before step 308, the second client edits the target video by means of interaction with the server or by its own editing function, the target video in the server has been replaced, and the server is replaced according to the audio file. The target video that has passed is the dubbed video file.

Further, in response to the sending instruction of the second client, the server may send the video file to the second client.

Further, in response to the sharing instruction sent by the second client, the server may also share the video file to other users.

In summary, the method provided in this embodiment implements voiceover for video by three-way interaction between the first client and the second client and the server. The specific work of dubbing is done on the server side, and the user only needs to select the audio to be dubbed and record the audio file, thereby simplifying the user dubbing process. Further, the source of the audio-visual video is not limited, and may be a video resource selected by the user in some video libraries, or a video resource that the user watches on the television, such as an OTT video.

OTT is an abbreviation of "Over The Top", which refers to providing various application services to users through the Internet. This kind of application is different from the communication service provided by the current operator. It only uses the operator's network, and the service is provided by a third party other than the operator. Currently, typical OTT services include Internet TV services, Apple App Store, and others. Internet companies use telecom operators' broadband networks to develop their own businesses, such as Google, Apple, Skype, Netflix, and domestic QQ. Netflix web video and apps in various mobile app stores are OTT. The embodiment of the present application can directly perform dubbing based on the OTT video, thereby significantly broadening the source of the dubbing material.

Further, before the step 308, the target video may be edited by the server or the second client. Referring to FIG. 5, the video editing method of the present application includes the following steps:

Step S310, the target video is decomposed into a combination of video frames frame by frame in a time-axis sequence; the time axis refers to a straight line in which two or more time points are arranged in order.

A decomposed temporary file is generated according to the combination of the video frames, and the video frame includes graphic data.

Step S320, receiving a video editing instruction, and editing the frame-decomposed video frame according to the video editing instruction.

In step S330, the edited target video is obtained according to the editing result.

Taking the screen cropping as an example, if the video editing command is a screen cropping instruction, the screen cropping instruction includes width data and height data of the video screen.

(1) If the screen clip is completed on the second client, the second client directly edits each video frame in the temporary file according to the width data and the height data of the video screen, and obtains the screen clipping according to the editing result. After the target file.

(2) if the screen clip is completed on the server side, the second client obtains the width data and the height data of the screen-cut video screen in response to the screen cropping instruction; and transmits the width data and the height data to the server so that The server performs screen clipping on the target video in the server according to the width data and the height data, and the method for cropping the screen is consistent with (1).

Further, other video editing instructions of the user may be received, including video clipping, video addition, mute, dubbing, and graphics processing.

By performing various editing on the target video, the example of the present application can satisfy various editing requirements of the user, and finally obtain a better dubbing effect; by performing screen clipping, the original subtitle in the target video can be removed.

Further, for video editing instructions such as video clipping, video addition, mute, dubbing, and graphics processing, referring to FIG. 6, the video editing instruction is edited according to video editing instructions such as video clipping, video addition, mute, dubbing, and graphics processing. Schematic diagram of the process. The above step S320 specifically includes:

S3201. Receive a video editing instruction, where the video editing instruction includes a start point and an end point of video editing and a type of video editing;

S3022: Match the start point and the end point with time points on the time axis, respectively, and obtain a first matching time point corresponding to the starting point and a second matching time point corresponding to the ending point;

S3203. Search for a first video frame corresponding to the first matching time point and a second video frame corresponding to the second matching time point.

S3204: Edit a video frame between the first video frame and the second video frame according to the type of the video editing.

Step S320 will be described below based on the type of video editing in particular.

(1) Video cropping processing

If the type of the video editing is the video cropping process, the start point and the end point are respectively matched with the time points on the time axis, and the first matching time point corresponding to the starting point and the second matching time point corresponding to the ending point are obtained. Finding a first video frame corresponding to the first matching time point and a second video frame corresponding to the second matching time point, in the first video frame and the second video frame in the temporary file The video frame is cropped.

(2) Video increase processing

If the type of the video editing is a video addition process, the start point and the end point are respectively matched with the time points on the time axis, and the first matching time point corresponding to the starting point and the second matching time point corresponding to the ending point are obtained. Finding a first video frame corresponding to the first matching time point and a second video frame corresponding to the second matching time point. If the start point and the end point are time points corresponding to the adjacent two frames of image data, the video frame to be added is inserted between the first video frame and the second video frame. If the time point corresponding to the multi-frame graphic data is included between the start point and the end point, the preset position between the first video frame and the second video frame may be inserted according to a preset rule.

(3) Silencing treatment

If the type of the video editing is the mute processing, the start point and the end point are respectively matched with the time points on the time axis, and the first matching time point corresponding to the starting point and the second matching time point corresponding to the ending point are obtained; Finding a first video frame corresponding to the first matching time point and a second video frame corresponding to the second matching time point. Then, the sound data between the first video frame and the second video frame is deleted.

(4) Dubbing processing

If the type of the video editing is the dubbing processing, the starting point and the end point are respectively matched with the time points on the time axis, and the first matching time point corresponding to the starting point and the second matching time point corresponding to the ending point are obtained; Finding a first video frame corresponding to the first matching time point and a second video frame corresponding to the second matching time point. Then, the sound data selected by the user is added between the first video frame and the second video frame, and if the video frame between the first video frame and the second video frame originally carries the sound data, the original video The sound data is erased and then the sound data selected by the user is added.

(5) Graphics processing

If the type of the video editing is graphics processing, the starting point and the ending point are respectively matched with the time points on the time axis, and the first matching time point corresponding to the starting point and the second matching time point corresponding to the ending point are obtained; Finding a first video frame corresponding to the first matching time point and a second video frame corresponding to the second matching time point. Then, the contrast, brightness, and color saturation of the image data between the video frames between the first video frame and the second video frame are adjusted.

Of course, the video editing process of step S320 is not limited to the above several processes. Other processing can also be included. Moreover, the above processing can be flexibly combined. For example, the video frame can be silenced first, and then the silenced video frame can be dubbed; or the video frame is first cropped, and then the corresponding position of the clipped video frame is performed. Insert the video frame to be added, and so on. It should be noted that if the video editing command does not include the start point and the end point, the start point is set to the start time point of the time axis of the entire video frame by default, and the end point is set to the last time point of the time axis of the entire video signal by default. .

The example of the present application can decompose the target video to be processed frame by frame, so that the target video can be accurately processed to each frame, which improves the accuracy of the video processing and improves the editing effect.

Please refer to FIG. 7A, which illustrates a dubbing method, the method comprising the following steps:

Step S401A: Acquire video information of a to-be-recorded video from a terminal device, where the video information is generated by the terminal device according to a starting point and a video termination point selected by the user in the played video;

Step S402A: Generate a to-be-recorded video according to the video information.

In some examples, the server may intercept the to-be-recorded video from the video corresponding to the video identifier according to the information of the video starting point and the information of the video termination point.

In some examples, the information of the video starting point includes a first video screenshot of the video corresponding to the video starting point, and the information of the video termination point includes a second of the video corresponding to the video termination point. Video screenshot. The server may determine, according to the first video screenshot and the second video screenshot, the video starting point and the video termination point in a video corresponding to the video identifier, and intercept the video starting point from the video. The video data between the video termination points serves as the to-be-recorded video.

In some examples, the information of the video starting point includes a first time in the video corresponding to the video starting point, and the information of the video ending point includes a second time in the video corresponding to the video ending point. . The server may intercept video data between the first time and the second time as the to-be-recorded video from the video.

In some examples, the server may further receive an audio file sent by the terminal device, and generate a dubbed video file according to the audio to video corresponding to the video identifier and the audio file corresponding to the video identifier.

Please refer to FIG. 7B, which illustrates a dubbing method, the method comprising the following steps:

Step S401B: Acquire a to-be-recorded video from the first client.

Step S402B: Generate a target video according to the to-be-dubbed video.

Please refer to FIG. 8, which shows a target video generation method:

S4021: Determine whether there is audio data in the to-be-recorded video;

S4022, if yes, eliminating audio data in the to-be-recorded video to obtain a target video;

S4023. If no, directly use the to-be-recorded video as the target video.

Specifically, the canceling the audio data in the to-be-recorded video can be implemented in the following two manners:

(1) decoding the file in which the to-be-recorded video is located, obtaining video data and audio data; re-encoding the obtained video data to obtain a target video;

(2) directly canceling the audio data in the to-be-recorded video by means of digital filtering to obtain a target video.

Step S403B: Generate a management identifier corresponding to the target video, and obtain an interaction identifier corresponding to the management identifier, so that the second client can obtain the target video and the management identifier according to the interaction identifier.

In this embodiment, the management identifier corresponding to the target video may be generated according to a preset identifier generation method. The identifier generation method includes, but is not limited to, randomly generating an identifier, generating an identifier according to the target video generation time, and generating an identifier according to the target video generation time and other attribute parameters.

In this embodiment, the website may generate a web address according to the management identifier and a preset web address generation algorithm. The generated web address is an interactive identifier, and the web address is in one-to-one correspondence with the management identifier. The URL is generated and pushed to the first client. Further, the URL pushed to the first client may be in the form of a string or a QR code or a barcode.

Step S404B: Acquire an audio file corresponding to the management identifier from the second client.

Step S405B: Generate a dubbed video file according to the audio file corresponding to the management identifier and the target video corresponding to the management identifier.

Further, please refer to FIG. 9, which shows a flowchart of a subtitle acquisition method. After the obtaining the audio file corresponding to the management identifier from the second client, the method further includes:

Step S410, performing voice recognition on the audio in the audio file.

Specifically, please refer to FIG. 10, which shows a flowchart of a method for voice recognition of audio in the audio file, and step S410 includes the following steps:

In step S4101, audio data in the audio file is obtained.

Step S4102, the audio data is segmented according to the time interval of the speech, the audio data segment is obtained, and the time information of the audio data segment is recorded.

Specifically, the segmentation of the audio data according to the time interval of the speech is determined by the speech recognition based on the waveform of the audio in the audio data. Due to the different speed of speech, there are general speech rate, faster speech rate and slower speech rate. In order to further realize the accuracy of sentence segmentation, the pause interval can be set according to the speech rate of the vocal in the audio data. The time interval of the segment speech. Among them, segmenting the audio data to obtain the audio data segment ensures that the subtitle reading amount presented in the audio and video picture can make the viewer feel comfortable and convenient to digest and understand the subtitle content.

Step S4103, obtaining a corresponding text data segment by voice recognition.

Specifically, the audio data segment is obtained by voice recognition to obtain a corresponding text data segment, including: matching the audio data segment with the thesaurus to obtain a classified thesaurus corresponding to the audio data segment; and performing voice according to the matched classified dictionary Identification. The taxonomy includes: two or more language classification lexicons, and two or more professional subject classification lexicons. By matching the audio data segment with the thesaurus, the categorization vocabulary corresponding to the original language of the audio data can be obtained, and the vocabulary in the vocabulary can be further used to further accelerate the speech recognition to obtain the corresponding text data, and the audio can also be obtained by The data segment is matched with the thesaurus to obtain a professional subject classification vocabulary corresponding to the professional subject in the audio data. For example, the audio data of the historical subject can be matched to the historical professional subject classification vocabulary, and the vocabulary in the professional subject classification vocabulary can be further utilized. Speed up speech recognition to get the corresponding text data.

Specifically, the audio data segment is obtained by voice recognition to obtain a corresponding text data segment, which may be text data that directly recognizes the audio content in the audio data segment into an original sound corresponding language. Of course, the audio content in the audio data segment may also be identified as Text in other languages. The specific process of recognizing the audio content in the audio data segment into the text of the other language is: acquiring the language category selected by the user, identifying the audio data segment as the text data of the original sound corresponding language, and then identifying the text data of the original sound corresponding language Translated into text data of the language category selected by the user selected by the user.

In various embodiments, an interval identifier is added to the corresponding text data segment based on the length of the time interval in which the speech is spoken. Since the text data segment obtained by speech recognition contains a large number of punctuation marks, many punctuation marks do not conform to the context of the context. In order to facilitate further proofreading of the text data segment, the text segment of the speech recognition can be filtered, and the text data segment is segmented. The byte occupied by the punctuation symbol is converted into the interval identifier of the corresponding byte. In order to facilitate manual proofreading, it is modified into a punctuation mark that conforms to the context.

Specifically, the text data segment is obtained by voice recognition, and the text data may be segmented and line-wrapped according to the start time and the end time of each piece of the text data segment to form a caption text corresponding to the audio data in the audio file. Specifically, the standard for dividing and wrapping text data is mainly based on the cooperation of subtitles and audio in audio and video.

Step S420, generating a subtitle file corresponding to the management identifier according to the recognized result.

The above text data segment is recorded in the form of a subtitle file. It should be noted that after the subtitle file of the audio and video data is generated, the output mode of the subtitle file may be selected according to actual conditions. The output manner of the subtitle file includes but is not limited to: generating a specific format, a subtitle file conforming to the subtitle format standard; playing the video When the subtitle file is integrated into the audio and video output stream, the player can do the subtitle display work.

Step S430, transmitting the subtitle file to the second client, so that the second client can correct the subtitle file and return the correction result.

Step S440, obtaining a target subtitle file according to the correction result.

The correction result includes a confirmation instruction or a corrected subtitle file. If the second client corrects the subtitle file, returning the corrected subtitle file, and using the modified subtitle file as the target subtitle file; if the second client does not correct the subtitle file, directly returning the confirmation instruction , the original subtitle file is used as the target subtitle file. The target subtitle file also corresponds to the management identifier.

Further, after acquiring the target subtitle file, in step S405, the audio file corresponding to the same management identifier, the target video, and the target subtitle file may be combined to obtain a dubbed video file.

The embodiment provides a dubbing method, which automatically generates a subtitle file by means of voice recognition, and generates a dubbing file based on the management identifier. The user only needs to input the sound corresponding to the target video to obtain an audio file, and the dubbing work can be completed automatically. And automatically generate subtitles, thus avoiding excessive user contact with complex dubbing file generation work and improving user experience.

The following is an embodiment of the apparatus of the present application, which may be used to implement the method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Referring to FIG. 11, there is shown a dubbing apparatus having a function of implementing a server in the above method example, which may be implemented by hardware or by hardware to execute corresponding software. The device can include:

The to-be-sound video acquisition module 501 is configured to acquire a to-be-recorded video from the first client. It can be used to perform the above steps 303 and S401.

The target video generating module 502 is configured to generate a target video according to the to-be-dubbed video. It can be used to perform the above steps 303 and S402.

The identifier generating module 503 is configured to generate a management identifier corresponding to the target video, and obtain an interaction identifier corresponding to the management identifier, so that the second client can obtain the target video according to the interaction identifier and the Management identity. It can be used to perform the above steps 304 and S403.

The audio file obtaining module 504 is configured to obtain an audio file corresponding to the management identifier from the second client. It can be used to perform the above steps 308 and S404.

The synthesizing module 505 is configured to generate a dubbed video file according to the audio file corresponding to the management identifier and the target video corresponding to the management identifier. It can be used to perform the above steps 309 and S405.

Specifically, please refer to FIG. 12, which shows a block diagram of a target video generation module. The target video generating module 502 can include:

The determining unit 5021 is configured to determine whether there is audio data in the to-be-recorded video. It can be used to perform the above step S4021.

The muffling unit 5022 is configured to eliminate audio data in the to-be-recorded video. Can be used to perform step 3022 above.

Specifically, please refer to FIG. 13, which shows a block diagram of the identity generation module. The identifier generating module 503 can include:

The management identifier generating unit 5031 is configured to generate a management identifier corresponding to the target video according to a preset identifier generation method. It can be used to perform the above steps 304 and S403.

The website generating unit 5032 is configured to generate a web address according to the management identifier and a preset web address generation algorithm. It can be used to perform the above steps 304 and S403.

The two-dimensional code generating unit 5033 is configured to generate a two-dimensional code according to the web address. It can be used to perform the above steps 304 and S403.

Correspondingly, the device may further include: a two-dimensional code pushing module 506, configured to push the two-dimensional code to the first client. Can be used to perform the above step 304.

Further, the device may further include:

The voice recognition module 507 is configured to perform voice recognition on the audio in the audio file. It can be used to perform the above step S410.

The subtitle file generating module 508 is configured to generate a subtitle file according to the recognized result. It can be used to perform the above step S420.

Further, the device may further include:

The video editing module 509 is used for video editing.

The video file sending module 510 is configured to send the dubbed video file to the second client.

The video file sharing module 511 is configured to share the dubbed video file to other users.

An exemplary embodiment of the present application further provides a voice over system, the system including a first client 601, a second client 602, and a server 603;

The first client 601 is configured to obtain a to-be-dubbed video in response to a user instruction, send the to-be-recorded video to a server, acquire an interaction identifier from the server, and enable the interaction identifier to be used by the second client. Acquisition

The second client 602 is configured to acquire a target video from the server according to the interaction identifier; generate an audio file corresponding to the management identifier and send the audio file to the server in response to the voiceover instruction;

The server 603 is configured to acquire the to-be-recorded video; generate a target video according to the to-be-recorded video; generate a management identifier corresponding to the target video, and obtain an interaction identifier corresponding to the management identifier; The interaction identifier is sent to the first client; the target video is sent to the second client; and the dubbed video file is obtained according to the audio file and the target video in the server.

Specifically, the server 603 may be the above-mentioned dubbing device;

The first client 601 can include:

The video identifier selection module 6011 is configured to acquire a video identifier selected by the user.

a time point obtaining module 6012, configured to acquire a video starting point and a video ending point selected by the user;

The to-be-dubbed video acquisition module 6013 is configured to: in the video file corresponding to the video identifier, copy the video content between the video starting point and the video termination point to obtain a to-be-recorded video;

The second client 602 can include:

The interaction identifier obtaining module 6021 is configured to acquire an interaction identifier.

The interaction result obtaining module 6022 is configured to obtain a target video and a management identifier from the server according to the interaction identifier.

The audio file obtaining module 6023 is configured to generate an audio file corresponding to the management identifier;

The audio file sending module 6024 is configured to send the audio file to the server.

Further, the second client may further include:

The screen cropping module 6025 obtains the width data and the height data of the video screen after the screen is cropped in response to the screen cropping instruction.

It should be noted that, when the device and the system provided by the foregoing embodiments are implemented, only the division of the foregoing functional modules is illustrated. In actual applications, the function distribution may be completed by different functional modules as needed. The internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided in the foregoing embodiments are in the same concept, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.

Please refer to FIG. 14, which is a schematic structural diagram of a terminal provided by an embodiment of the present application. The terminal is used to implement the dubbing method provided in the above embodiments.

The terminal may include an RF (Radio Frequency) circuit 110, a memory 120 including one or more computer readable storage media, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, and a WiFi (wireless fidelity, The Wireless Fidelity module 170 includes a processor 180 having one or more processing cores, and a power supply 190 and the like. It will be understood by those skilled in the art that the terminal structure shown in FIG. 14 does not constitute a limitation to the terminal, and may include more or less components than those illustrated, or a combination of certain components, or different component arrangements.

The memory 120 can be used to store software programs and modules, and the processor 180 executes various functional applications and data processing by running software programs and modules stored in the memory 120. The memory 120 may mainly include a storage program area and an storage data area, wherein the storage program area may store an operating system, an application required for the function, and the like; the storage data area may store data or the like created according to the use of the terminal. Moreover, memory 120 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 120 may also include a memory controller to provide access to memory 120 by processor 180 and input unit 130.

The processor 180 is the control center of the terminal, connecting various portions of the entire terminal using various interfaces and lines, by running or executing software programs and/or modules stored in the memory 120, and recalling data stored in the memory 120. Exercising various functions and processing data of the terminal.

The terminal also includes a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to execute, by the one or more processors, the one or more programs include Instructions.

Please refer to FIG. 15 , which is a schematic structural diagram of a server provided by an embodiment of the present application. This server is used to implement the dubbing method of the server provided in the above embodiment. Specifically:

The server 1200 includes a central processing unit (CPU) 1201, a system memory 1204 including a random access memory (RAM) 1202 and a read only memory (ROM) 1203, and a system bus 1205 that connects the system memory 1204 and the central processing unit 1201. The server 1200 also includes a basic input/output system (I/O system) 1206 that facilitates transfer of information between various devices within the computer, and mass storage for storing the operating system 1213, applications 1214, and other program modules 1215. Device 1207.

The basic input/output system 1206 includes a display 1208 for displaying information and an input device 1209 such as a mouse, keyboard, etc. for user input of information. The display 1208 and the input device 1209 are both connected to the central processing unit 1201 via an input-output controller 1210 that is coupled to the system bus 1205. The basic input/output system 1206 can also include an input output controller 1210 for receiving and processing input from a plurality of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1210 also provides output to a display screen, printer, or other type of output device.

The mass storage device 1207 is connected to the central processing unit 1201 by a mass storage controller (not shown) connected to the system bus 1205. The mass storage device 1207 and its associated computer readable medium provide non-volatile storage for the server 1200. That is, the mass storage device 1207 can include a computer readable medium (not shown) such as a hard disk or a CD-ROM drive.

Without loss of generality, the computer readable medium can include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media include RAM, ROM, EPROM, EEPROM, flash memory or other solid state storage technologies, CD-ROM, DVD or other optical storage, tape cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage medium is not limited to the above. The system memory 1204 and the mass storage device 1207 described above may be collectively referred to as a memory.

According to various embodiments of the present application, the server 1200 can also be operated by a remote computer connected to the network through a network such as the Internet. That is, the server 1200 can be connected to the network 1212 through the network interface unit 1211 connected to the system bus 1205, or can also be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1211. .

The memory also includes one or more programs, the one or more programs being stored in a memory and configured to be executed by one or more processors. The one or more programs described above include instructions for executing the method of the server described above.

In an exemplary embodiment, there is also provided a non-transitory computer readable storage medium comprising instructions, such as a memory comprising instructions executable by a processor of a terminal to perform various steps in the above method embodiments, or The above instructions are executed by the processor of the server to complete the steps of the background server side in the above method embodiment. For example, the non-transitory computer readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.

It should be understood that "a plurality" as referred to herein means two or more. "and/or", describing the association relationship of the associated objects, indicating that there may be three relationships, for example, A and/or B, which may indicate that there are three cases where A exists separately, A and B exist at the same time, and B exists separately. The character "/" generally indicates that the contextual object is an "or" relationship.

A person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium. The storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

The above is only a part of the embodiments of the present application, and is not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc., which are within the scope of the present application, should be included in the scope of the present application. .

Claims

A voice-over method is applied to a terminal device, and the method includes:

Playing a video in response to a user instruction;

Obtaining a video starting point and a video ending point selected by the user in the video;

Generating video information of the to-be-recorded video according to the video starting point and the video ending point;

Sending the video information to a server, so that the server generates a to-be-recorded video according to the video information.
The method according to claim 1, wherein said generating video information of a video to be dubbed according to a video starting point and a video ending point, transmitting said video information to a server, so that said server generates a to-be-recorded voice according to said video information The video includes:

In the video, capturing video data between the video start point and a video termination point, and transmitting the video data as the video information to the server, so that the server stores the video data as a Tell the dubbing video.
The method according to claim 1, wherein the generating the video information according to a video starting point and a video ending point, and transmitting the video information to a server, so that the server generates the to-be-recorded video according to the video information, including :

Transmitting, to the server, the video identifier of the video, the information of the video starting point, and the information of the video termination point to the server, so that the server according to the information of the video starting point The information of the video termination point intercepts the to-be-recorded video from the video corresponding to the video identifier.
The method according to claim 3, wherein the information of the video starting point comprises a first video screenshot of the video corresponding to the video starting point, and the information of the video termination point includes the corresponding one of the videos. a second video screenshot of the video termination point,

The sending the video information to the server, so that the server generates the to-be-recorded video according to the video information, including:

Sending the video information to the server, so that the server determines the video starting point and the video termination point in the video corresponding to the video identifier according to the first video screenshot and the second video screenshot, according to The video starting point and the video termination point intercept the to-be-recorded video from the video.
The method according to claim 3, wherein the information of the video starting point comprises a first time in the video corresponding to the video starting point, and the information of the video ending point comprises a corresponding video in the video. The second time of the termination point,

The sending the video information to the server, so that the server generates the to-be-recorded video according to the video information, including:

Sending the video information to a server, so that the server intercepts the to-be-recorded video from the video according to the first time and the second time.
The method of claim 1 wherein the method further comprises:

Generating an audio file corresponding to the to-be-recorded video in response to the voice-over instruction;

And transmitting the audio file to a server, so that the server generates a dubbed video file according to the audio to video corresponding to the video identifier and the audio file corresponding to the video identifier.
The method of claim 1 further comprising:

Displaying an interaction identifier of the to-be-recorded video sent by the server, the interaction identifier being recognizable by a terminal device to obtain the to-be-recorded video from the server.
A voice-over method is applied to a server, and the method includes:

Obtaining video information of the to-be-recorded video from the terminal device, where the video information is generated by the terminal device according to a video starting point and a video termination point selected by the user in the played video;

And generating a to-be-recorded video according to the video information.
The method according to claim 8, wherein the video information comprises a video identifier of the video, information of the video starting point and information of the video termination point, and the to-be-recorded video is generated according to the video information. include:

And the to-be-recorded video is intercepted from the video corresponding to the video identifier according to the information of the video starting point and the information of the video termination point.
The method according to claim 9, wherein the information of the video starting point comprises a first video screenshot of the video corresponding to the video starting point, and the information of the video termination point includes the corresponding one of the videos. a second video screenshot of the video termination point,

The intercepting the to-be-recorded video from the video corresponding to the video identifier according to the information of the video starting point and the information of the video termination point includes:

Determining, according to the first video screenshot and the second video screenshot, the video starting point and the video termination point in a video corresponding to the video identifier, intercepting the video starting point from the video and the The video data between the video termination points serves as the to-be-recorded video.
The method according to claim 9, wherein the information of the video starting point includes a first time in the video corresponding to the video starting point, and the information of the video ending point includes the video corresponding to the video. The second time of the termination point,

The intercepting the to-be-recorded video from the video corresponding to the video identifier according to the information of the video starting point and the information of the video termination point includes:

Video data between the first time and the second time is intercepted from the video as the to-be-recorded video.
The method of claim 8 wherein the method further comprises:

Receiving an audio file sent by the terminal device,

The dubbed video file is generated according to the audio to video corresponding to the video identification and the audio file corresponding to the video identification.
A terminal device, comprising a processor and a memory, wherein the memory stores computer readable instructions that cause the processor to:

Playing a video in response to a user instruction;

Obtaining a video starting point and a video ending point selected by the user in the video;

Generating video information of the to-be-recorded video according to the video starting point and the video ending point;

Sending the video information to a server, so that the server generates a to-be-recorded video according to the video information.
The terminal device of claim 13, wherein the instructions cause the processor to perform the following operations:

In the video, capturing video data between the video start point and the video end point,

Transmitting the video data as the video information to the server such that the server stores the video data as the to-be-recorded video.
The terminal device of claim 13, wherein the instructions cause the processor to perform the following operations:

Transmitting, to the server, the video identifier of the video, the information of the video starting point, and the information of the video termination point to the server, so that the server according to the information of the video starting point The information of the video termination point intercepts the to-be-recorded video from the video corresponding to the video identifier.
The terminal device of claim 13, wherein the instructions cause the processor to perform the following operations:

Generating an audio file corresponding to the to-be-recorded video in response to the voice-over instruction;

The sending module is further configured to send the audio file to a server, so that the server generates a dubbed video file according to the audio to video corresponding to the video identifier and the audio file corresponding to the video identifier.
The terminal device of claim 13, wherein the instructions cause the processor to perform the following operations:

Displaying an interaction identifier of the to-be-recorded video sent by the server, the interaction identifier being recognizable by a terminal device to obtain the to-be-recorded video from the server.
A server comprising: a processor and a memory, the memory storing computer readable instructions, the instructions causing the processor to:

Obtaining video information of the to-be-recorded video from the terminal device, where the video information is generated by the terminal device according to a starting point and a video termination point selected by the user in the played video;

And generating a to-be-recorded video according to the video information.
The server of claim 18, wherein the instructions cause the processor to perform the following operations:

The video information includes a video identifier of the video, information of a starting point of the video, and information of a termination point of the video,

And the to-be-recorded video is intercepted from the video corresponding to the video identifier according to the information of the video starting point and the information of the video termination point.
The server of claim 18, wherein the instructions cause the processor to perform the following operations:

The information of the video starting point includes a first video screenshot of the video corresponding to the video starting point, and the information of the video termination point includes a second video screenshot of the video corresponding to the video termination point.

Determining, according to the first video screenshot and the second video screenshot, the video starting point and the video termination point in a video corresponding to the video identifier, intercepting the video starting point from the video and the The video data between the video termination points serves as the to-be-recorded video.
The server of claim 18, wherein the instructions cause the processor to perform the following operations:

The information of the video starting point includes a first time in the video corresponding to the video starting point, and the information of the video ending point includes a second time in the video corresponding to the video termination point,

Video data between the first time and the second time is intercepted from the video as the to-be-recorded video.
The server of claim 18, wherein the instructions cause the processor to perform the following operations:

Receiving an audio file sent by the terminal device,

The dubbed video file is generated according to the audio to video corresponding to the video identification and the audio file corresponding to the video identification.
A non-transitory computer readable storage medium storing computer readable instructions for causing at least one processor to perform the method of any one of claims 1 to 7.
A non-transitory computer readable storage medium storing computer readable instructions, which may cause at least one processor to perform the method of any one of claims 8 to 12.