CN117788654B

CN117788654B - Three-dimensional face driving method based on voice, model training method and device

Info

Publication number: CN117788654B
Application number: CN202311766861.6A
Authority: CN
Inventors: 杨少雄; 徐颖; 崔宪坤
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2025-06-10
Anticipated expiration: 2043-12-20
Also published as: CN117788654A

Abstract

The present disclosure provides a three-dimensional face driving method, a model training method and a device based on voice, which relate to the fields of computer vision, deep learning, augmented reality, virtual reality and the like in artificial intelligence technology, and can be applied to scenes such as meta universe, digital people, generation type artificial intelligence and the like. The method comprises the steps of determining a to-be-processed driving sequence of voice to be processed, performing style conversion on the to-be-processed driving sequence according to a style conversion model to obtain a target driving sequence, wherein the target driving sequence is used for indicating three-dimensional facial motion when a second object outputs the voice to be processed, the style conversion model is obtained by training according to a first driving sequence and the second driving sequence, the first driving sequence is used for indicating the three-dimensional facial motion when the first object outputs the target voice, the second driving sequence is used for indicating the three-dimensional facial motion when the second object outputs the target voice, and the three-dimensional facial model corresponding to the second object is driven to perform facial motion according to the target driving sequence.

Description

Three-dimensional face driving method based on voice, model training method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, deep learning, augmented reality, virtual reality and the like, and can be applied to scenes such as metauniverse, digital people, generated artificial intelligence and the like, and particularly relates to a three-dimensional face driving method based on voice, a model training method and a model training device.

Background

Currently, with the continued development of artificial intelligence technology, a large number of avatars are created for information transfer. In addition, in order to ensure that the user has a good visual experience, in the process of displaying the avatar, the facial motion of the avatar, particularly the lip motion, needs to be adaptively adjusted so as to meet the requirement of the voice to be played.

Disclosure of Invention

The present disclosure provides a three-dimensional face driving method, a model training method and a device based on voice, so that the face action change of a three-dimensional face model is more in line with the face action style of a speaking object when speaking.

According to a first aspect of the present disclosure, there is provided a voice-based three-dimensional face driving method, including:

determining a to-be-processed driving sequence of the to-be-processed voice, wherein the to-be-processed driving sequence is used for indicating a three-dimensional facial action of a first object when the first object outputs the to-be-processed voice;

Performing style conversion processing on the driving sequence to be processed according to a style conversion model to obtain a target driving sequence, wherein the target driving sequence is used for indicating a three-dimensional facial motion of a second object when outputting the voice to be processed, the style conversion model is obtained by training an initial model according to at least one group of first driving sequence and second driving sequence, the first driving sequence is used for indicating the three-dimensional facial motion of the first object when outputting the target voice, the second driving sequence is used for indicating the three-dimensional facial motion of the second object when outputting the target voice, and the style conversion model is used for outputting the driving sequence conforming to the facial style of the second object when speaking;

And driving the three-dimensional face model corresponding to the second object to perform facial action according to the target driving sequence.

According to a second aspect of the present disclosure, there is provided a training method of a style conversion model, including:

The method comprises the steps of obtaining at least one group of training sets, wherein the training sets comprise a first driving sequence and a second driving sequence, the first driving sequence is used for indicating three-dimensional facial actions when a first object outputs target voice, and the second driving sequence is used for indicating three-dimensional facial actions when a second object outputs the target voice;

Training the initial model according to the at least one training set to obtain a style conversion model, wherein the style conversion model is used for outputting a driving sequence conforming to a target style, and the target style is a face style of a second object when speaking.

According to a third aspect of the present disclosure, there is provided a three-dimensional face driving device based on voice, comprising:

The device comprises a determining unit, a processing unit and a processing unit, wherein the determining unit is used for determining a to-be-processed driving sequence of to-be-processed voice, and the to-be-processed driving sequence is used for indicating a first object to output three-dimensional facial actions when the to-be-processed voice is output;

The system comprises a processing unit, a style conversion model and a style conversion model, wherein the processing unit is used for performing style conversion processing on the driving sequence to be processed to obtain a target driving sequence, the target driving sequence is used for indicating a three-dimensional facial motion when a second object outputs the voice to be processed, the style conversion model is obtained by training an initial model according to at least one group of first driving sequence and second driving sequence, the first driving sequence is used for indicating the three-dimensional facial motion when a first object outputs the target voice, the second driving sequence is used for indicating the three-dimensional facial motion when the second object outputs the target voice, and the style conversion model is used for outputting the driving sequence which accords with the facial style when the second object speaks;

and the driving unit is used for driving the three-dimensional facial model corresponding to the second object to perform facial actions according to the target driving sequence.

According to a fourth aspect of the present disclosure, there is provided a training apparatus of a style conversion model, including:

The system comprises an acquisition unit, a training set and a control unit, wherein the training set comprises a first driving sequence and a second driving sequence, the first driving sequence is used for indicating the three-dimensional facial action of a first object when outputting target voice, and the second driving sequence is used for indicating the three-dimensional facial action of a second object when outputting the target voice;

The training unit is used for training the initial model according to the at least one group of training sets to obtain a style conversion model, wherein the style conversion model is used for outputting a driving sequence conforming to a target style, and the target style is the face style of the second object when speaking.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect or to enable the at least one processor to perform the method of the second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect or for causing a computer to perform the method of the second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program stored in a readable storage medium, the computer program being readable by at least one processor of an electronic device from the readable storage medium, the at least one processor executing the computer program causing the electronic device to perform the method of the first aspect or the at least one processor executing the computer program causing the electronic device to perform the method of the second aspect.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of a three-dimensional face driving method based on voice according to an embodiment of the disclosure;

fig. 2 is a flow chart of a second three-dimensional face driving method based on voice according to an embodiment of the disclosure;

fig. 3 is a flow chart of a training method of a style conversion model according to an embodiment of the disclosure;

FIG. 4 is a flowchart of a training method of a second style conversion model according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a three-dimensional face driving device based on voice according to an embodiment of the disclosure;

fig. 6 is a schematic structural diagram of a second voice-based three-dimensional face driving device according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a training device for a style conversion model according to an embodiment of the present disclosure;

Fig. 8 is a schematic structural diagram of a training device for a style conversion model according to an embodiment of the present disclosure;

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure;

Fig. 10 is a block diagram of an electronic device used to implement a speech-based three-dimensional face driving method or model training method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

At present, how to control the facial motion of the three-dimensional facial model corresponding to the avatar when driving the facial motion corresponding to the avatar according to the voice so as to ensure that the facial motion style of the avatar more accords with the facial motion style of the user when speaking is a problem to be solved.

In one possible implementation manner, a driving sequence corresponding to each phoneme corresponding to the speaking of the user may be manually constructed, where the driving sequence corresponding to the phoneme may be used to instruct the user to perform a facial action corresponding to the speaking based on the phonemes. Then, when the face action of the avatar is required to be driven according to the voice, phonemes contained in the voice can be determined first, and a driving sequence set corresponding to the current voice is constructed based on the correspondence between the phonemes and the driving sequences constructed in advance. Then, the face actions of the avatar are sequentially controlled according to the driving sequences included in the driving sequence set. In addition, in the implementation manner, the facial actions of the virtual images corresponding to different phones are easy to be disconnected, so that the user has poor appearance. It should be noted that the facial movements mentioned in the present disclosure include movements of various parts of the face in the avatar, such as different parts of lips, cheeks, forehead, etc.

In another possible implementation manner, the driving sequence corresponding to the face scanning data can be determined by acquiring the face scanning data when the person speaks and according to the face scanning data. And then, carrying out end-to-end model training on the voice sent by the user and the obtained driving sequence, so that the model can output the driving sequence which accords with the face style when the user speaks based on the input voice. However, the model training approach described above requires a large number of data sets, resulting in high data acquisition costs.

In order to avoid at least one of the above technical problems, the inventor of the present disclosure creatively performs a task to obtain the inventive concept of the present disclosure, when a three-dimensional face model corresponding to a second object needs to be driven, a to-be-processed driving sequence conforming to a face style of the first object when speaking is first generated, and then performs style conversion processing on the to-be-processed driving sequence according to a style conversion model to obtain a target driving sequence conforming to a face action style of the second object when speaking, and then performs three-dimensional face model driving according to the target driving sequence, so that the three-dimensional face model can perform face actions more smoothly and more approximate to the face action style of the second object when speaking.

The present disclosure provides a three-dimensional face driving method, a model training method and a device based on voice, which are applied to the technical fields of computer vision, deep learning, augmented reality, virtual reality and the like in artificial intelligence technology, and can be applied to scenes such as metauniverse, digital people, generated artificial intelligence and the like, so that the motion of a three-dimensional face model is smoother, and the motion of the three-dimensional face model can be enabled to conform to a specific style.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

Fig. 1 is a schematic flow chart of a three-dimensional face driving method based on voice according to an embodiment of the disclosure, as shown in fig. 1, the method includes:

S101, determining a to-be-processed driving sequence of the to-be-processed voice, wherein the to-be-processed driving sequence is used for indicating a three-dimensional face action of the first object when the first object outputs the to-be-processed voice.

For example, the execution subject of the present embodiment may be a three-dimensional face driving device based on voice, and the three-dimensional face driving device based on voice may be a server (such as a local server or a cloud server), may be a computer, may be a processor, may be a chip, or the like, and the present embodiment is not limited.

The first object and the second object in the embodiment may be a real person or a cartoon character, which is not particularly limited in the disclosure.

When the driving control is required to be performed on the three-dimensional face model corresponding to the second object according to the voice to be processed, a driving sequence to be processed can be generated according to the voice to be processed.

It should be noted that, the to-be-processed driving sequence generated herein is a driving sequence conforming to a facial action style when the first object speaks, and the to-be-processed driving sequence is specifically configured to instruct the first object to output the facial action corresponding to the to-be-processed voice.

In one example, the manner of generating the to-be-processed driving sequence may be generated by a model for generating a driving sequence corresponding to the first generating object, and the to-be-processed voice is input into the model to obtain the to-be-driven sequence. In particular, the model in this example may generate a model for an end-to-end sequence of the above-described speech-to-drive sequence.

S102, performing style conversion processing on a driving sequence to be processed according to a style conversion model to obtain a target driving sequence, wherein the target driving sequence is used for indicating a three-dimensional facial motion of a second object when the second object outputs voice to be processed, the style conversion model is obtained by training an initial model according to at least one group of first driving sequence and second driving sequence, the first driving sequence is used for indicating the three-dimensional facial motion of the first object when the first object outputs the target voice, the second driving sequence is used for indicating the three-dimensional facial motion of the second object when the second object outputs the target voice, and the style conversion model is used for outputting the driving sequence conforming to the facial style of the second object when the second object speaks.

In this embodiment, after obtaining the to-be-processed driving sequence according to the style of the first object, the style conversion model trained in advance may perform style conversion processing on the to-be-processed driving sequence so as to obtain the target driving sequence according to the facial action style of the second object when speaking, where the target driving sequence may be specifically used to indicate the facial action corresponding to the second object when speaking the to-be-processed voice.

It should be noted that, the style conversion model in this embodiment is used to convert the style of the driving sequence. Where a driving sequence may be understood as a series of parameters for indicating the action of the model face. Style conversion may be understood as a conversion from a facial action style when one subject speaks to a facial action style when another subject speaks.

In addition, at least one training set may be acquired in advance when training the style conversion model. The training set includes a first driving sequence corresponding to the first object and a second driving sequence corresponding to the second object for the same voice (i.e., the target voice). Specifically, the first driving sequence is a driving sequence conforming to a facial style when the first subject speaks, and the first driving sequence is specifically used to instruct the first subject to output a facial action when the target voice. Likewise, the second driving sequence is a driving sequence conforming to a facial style when the second subject speaks, and the second driving sequence is specifically used to instruct the second subject to output a facial action when the target voice. Wherein different training sets may correspond to different target voices. And then, combining the training set, and training the initial model to obtain the style conversion model which can convert the driving sequence conforming to the speaking face style of the first object into the driving sequence conforming to the speaking face style of the second object.

It should be noted that, in this embodiment, the model architecture of the style conversion model is not particularly limited.

And S103, driving the three-dimensional face model corresponding to the second object to perform face action according to the target driving sequence.

In this embodiment, after obtaining the target driving sequence that conforms to the face style of the second object when speaking, the face motion control may be performed on the three-dimensional face model corresponding to the second object according to the target driving sequence, so that the face motion of the three-dimensional face model may conform to the face motion style of the second object when speaking the voice to be processed.

It can be appreciated that in this embodiment, when the three-dimensional face model of the second object is driven according to the voice, the to-be-processed driving sequence conforming to the face style of the first object may be generated first, and then style conversion processing is performed on the to-be-processed driving sequence in combination with the style conversion model, so as to obtain the target driving sequence conforming to the face action style of the second object. Compared with the mode of generating the target driving sequence based on the end-to-end model of the driving sequence from the voice to the face style conforming to the second object, the style conversion model for performing the driving sequence style conversion provided in the embodiment has lower training difficulty, requires fewer training sets, and can improve the efficiency of three-dimensional face driving.

Fig. 2 is a flow chart of a second voice-based three-dimensional face driving method according to an embodiment of the disclosure, as shown in fig. 2, the method includes:

s201, determining phoneme information corresponding to the voice to be processed, wherein the phoneme information is a set of phonemes forming the voice to be processed.

In this embodiment, when the three-dimensional face model corresponding to the second object needs to be controlled, the face action of the second object when outputting the voice to be processed can be accurately simulated. The phoneme information may be understood as a set of phonemes arranged in sequence, which are obtained by splitting phonemes of a speech to be processed.

S202, determining a to-be-processed driving sequence according to the mapping relation and the phoneme information, wherein the mapping relation is a corresponding relation between the phonemes and a preset parameter set, the preset parameter set is used for indicating a three-dimensional face action when a first object sends the phonemes, and the to-be-processed driving sequence comprises at least one preset parameter set. The pending driver sequence is for indicating a three-dimensional facial motion of the first object when outputting the pending speech.

In this embodiment, after obtaining the phoneme information, the preset parameter set corresponding to the phoneme contained in the phoneme information may be determined according to the correspondence between the preset phoneme and the preset parameter set, and a sequence formed by the preset parameter sets corresponding to the phonemes may be used as the sequence to be processed in this embodiment.

It should be noted that, the above preset parameter set may be understood as a face parameter corresponding to a face action of the first object when the first object emits a phoneme corresponding to the set.

It can be appreciated that in this embodiment, the sequence to be driven may be generated by splicing the preset parameter sets corresponding to phonemes in the voice to be processed, so that the problem that the training time is long due to the need of training the end-to-end model of the voice to the drive sequence conforming to the speaking face style of the first object can be avoided.

S203, performing style conversion processing on the driving sequence to be processed according to a style conversion model to obtain a target driving sequence, wherein the target driving sequence is used for indicating a three-dimensional facial motion of a second object when the second object outputs voice to be processed, the style conversion model is obtained by training an initial model according to at least one group of first driving sequence and second driving sequence, the first driving sequence is used for indicating the three-dimensional facial motion of the first object when the first object outputs the target voice, the second driving sequence is used for indicating the three-dimensional facial motion of the second object when the second object outputs the target voice, and the style conversion model is used for outputting the driving sequence conforming to the facial style of the second object when the second object speaks.

For example, in the present embodiment, the specific principle of step S203 may refer to step S102, which is not described herein. In addition, when the sequence to be processed is acquired in the manner of S201 to S202, the first driving sequence used in the training process of the further style conversion model may also be acquired in the manner of S201 to S202.

S204, driving the three-dimensional face model corresponding to the second object to perform face action according to the target driving sequence.

For example, the specific principle of step S204 may be referred to step S103, which is not described herein.

It can be understood that in this embodiment, through the style conversion model, style conversion processing is performed on the to-be-processed driving sequence obtained by splicing corresponding to the phonemes, and face action control is performed on the three-dimensional face model corresponding to the second object based on the converted target driving sequence, so that smoothness of face action of the three-dimensional face model can be improved, and the face action of the three-dimensional face model is enabled to conform to the face style of the second object when speaking, so as to improve user watching experience.

In one example, the first driving sequence comprises N first parameter sets, the first parameter sets are used for indicating three-dimensional facial actions of the first object under a time frame corresponding to the first parameter sets, N is a positive integer greater than 1, the style conversion model is obtained by carrying out parameter adjustment on the initial model according to a third driving sequence and a second driving sequence, the third driving sequence is output by the initial model according to the N first parameter sets, and the third driving sequence comprises N third parameter sets. It should be noted that, the specific principles herein may be referred to the description in the embodiment of fig. 4, and will not be repeated herein.

In one example, the style conversion model is obtained by performing parameter adjustment on the initial model according to a first face model and a second face model, wherein the first face model is obtained by performing parameter adjustment on a preset face model according to a third driving sequence, and the second face model is obtained by performing parameter adjustment on the preset face model according to a second driving sequence. It should be noted that, the specific principles herein may be referred to the description in the embodiment of fig. 4, and will not be repeated herein.

In one example, the style conversion model is obtained by performing parameter adjustment on an initial model according to a first loss function and a second loss function, the first loss function is obtained according to a third driving sequence and a second driving sequence, the second driving sequence comprises N second parameter sets, the second parameter sets are used for indicating three-dimensional facial actions of a second object under a time frame corresponding to the second parameter sets, the second loss function is obtained according to a first facial model and a second facial model, the first facial model is obtained by performing parameter adjustment on a preset facial model according to the third driving sequence, and the second facial model is obtained by performing parameter adjustment on the preset facial model according to the second driving sequence. It should be noted that, the specific principles herein may be referred to the description in the embodiment of fig. 4, and will not be repeated herein.

Fig. 3 is a flow chart of a training method of a style conversion model according to an embodiment of the present disclosure, as shown in fig. 3, where the method includes:

S301, at least one group of training sets is obtained, wherein the training sets comprise a first driving sequence and a second driving sequence, the first driving sequence is used for indicating the three-dimensional facial motion of a first object when the first object outputs target voice, and the second driving sequence is used for indicating the three-dimensional facial motion of a second object when the second object outputs target voice.

The execution body of the embodiment may be a training device of a style conversion model, and the training device of the style conversion model may be a server (such as a local server or a cloud server), may also be a computer, may also be a processor, may also be a chip, or the like, which is not limited in this embodiment. The training device of the style conversion model may be the same device as the three-dimensional face driving device based on voice, or may be a different device.

The technical principle of step S301 may be referred to step S102, and will not be described herein.

S302, training the initial model according to at least one training set to obtain a style conversion model, wherein the style conversion model is used for outputting a driving sequence conforming to a target style, and the target style is a face style when a second object speaks.

In one example, a plurality of sets of driving parameters may be included in the first driving sequence, and the sets of driving parameters have time frames corresponding thereto. The driving parameter set in the first driving sequence may be understood as a parameter set for describing a facial motion of the first object under a time frame corresponding to the set in a process of outputting the target voice by the first object. Also, a plurality of sets of driving parameters may be included in the second driving sequence. Specifically, the driving parameter set in the second driving sequence may be understood as a parameter set for describing a facial action of the second object under a time frame corresponding to the set in the process of outputting the target voice by the first object.

When training the initial model according to the training set, a driving parameter set corresponding to a target time frame can be selected according to a first driving sequence, and a driving parameter set corresponding to the target time frame can be selected in a second driving sequence to perform model training. That is, the input of the initial model is a set of parameters in the corresponding first driving sequence at one time frame.

It can be appreciated that, in this embodiment, the style conversion model is obtained by training with the training set, so that the driving sequence can be subsequently style-converted based on the style conversion model, so as to quickly obtain the driving sequence conforming to the style of the speaking object, and control the three-dimensional facial model corresponding to the speaking object.

Fig. 4 is a flow chart of a training method of a second style conversion model according to an embodiment of the present disclosure, as shown in fig. 4, the method includes:

S401, at least one group of training sets is obtained, wherein the training sets comprise a first driving sequence and a second driving sequence, the first driving sequence is used for indicating the three-dimensional facial motion of a first object when the first object outputs target voice, and the second driving sequence is used for indicating the three-dimensional facial motion of a second object when the second object outputs target voice.

The first driving sequence comprises N first parameter sets, wherein the first parameter sets are used for indicating three-dimensional facial actions of the first object under a time frame corresponding to the first parameter sets, and N is a positive integer greater than 1.

That is, in this embodiment, the first driving sequence is a sequence composed of N first parameter sets, and the first parameter sets correspond to time frames in the target voice. The first parameter set may be specifically understood as a three-dimensional facial action corresponding to the first object when outputting the voice under the corresponding time frame in the target voice.

In one example, step S401 includes determining phoneme information corresponding to a target voice, wherein the phoneme information is a set of phonemes forming the target voice, determining a first driving sequence according to a mapping relation and the phoneme information, wherein the mapping relation is a corresponding relation between the phonemes and a preset parameter set, the preset parameter set is used for representing a three-dimensional face action when the phonemes are sent out, the first driving sequence includes at least one preset parameter set, acquiring a second driving sequence, and taking the second driving sequence and the first driving sequence as a group of training sets.

The specific principle of the acquisition manner of the first driving sequence in this embodiment is similar to that of steps S201 to S202, and will not be described here again.

In addition, the second driving sequence may be obtained by using an end-to-end model of the voice-to-driving sequence mentioned in the related art, which is not described herein, or may be obtained by capturing a facial image of the second object when the second object occurs.

It can be understood that, in this embodiment, the style conversion model is trained by using the first driving sequence spliced by the preset parameter set corresponding to the phoneme as the sequence conforming to the style of the face of the first object, so that the smoothness of the action of the final three-dimensional face model can be improved, the obtaining mode of the training data is simpler, and the model training complexity of generating the corresponding driving sequence by combining the models is avoided.

S402, inputting the N first parameter sets into an initial model to obtain a third driving sequence, wherein the third driving sequence comprises the N third parameter sets.

In this embodiment, when the initial model is trained, N first parameter sets corresponding to N time frames are input into the initial model at the same time, and style conversion processing is performed on the N first parameter sets by the initial model, so that the initial model can combine correlations between facial actions indicated by the first parameter sets of the time frames to obtain a driving sequence composed of N third parameter sets.

S403, according to the third driving sequence and the second driving sequence, carrying out parameter adjustment on the initial model to obtain a style conversion model, wherein the style conversion model is used for outputting the driving sequence which accords with a target style, and the target style is the face style of the second object when speaking.

In this embodiment, after the third driving sequence output by the initial model is obtained, the second driving sequence and the third driving sequence in the training set may be combined to perform parameter adjustment on the initial model, so that the style conversion model obtained after training meets the preset training stop condition. It should be noted that, the stopping conditions of the model training in this embodiment are similar to the setting manner in the related art, and will not be described here again.

In one example, when the initial model is parameter-adjusted based on the third drive sequence and the second drive sequence, the initial model may be parameter-adjusted in combination with a loss function obtained by the third drive sequence and the second drive sequence. The process of generating the loss function according to the third driving sequence and the second driving sequence may be obtained by combining various loss functions such as the L1 norm loss function type, the mean square error loss function, the cross entropy loss function, and the like provided in the related art, which are not particularly limited in this embodiment.

It can be understood that, when the user speaks, the change of the facial motion of the user also has time continuity, that is, the facial motion corresponding to the next moment is generally affected by the facial motion of the previous moment, so when the initial model is trained, each first parameter set under a plurality of continuous time frames is simultaneously input into the initial model, so that when the initial model performs style conversion on the input data, the style conversion on the first parameter set can be fully combined with the correlation between the facial motions under adjacent time frames, so as to improve the accuracy of the style conversion result. Moreover, the problem that when two different objects output the same section of voice, large noise is generated when training is performed frame by frame due to the fact that the mouth is opened and closed in a time difference mode, and the final facial action is unsmooth can be avoided.

In one example, step S403 includes the steps of:

according to the third driving sequence, parameter adjustment is carried out on the preset face model to obtain a first face model, parameter adjustment is carried out on the preset face model according to the second driving sequence to obtain a second face model, and parameter adjustment is carried out on the initial model according to the first face model and the second face model to obtain a style conversion model.

In this embodiment, when the initial model parameters are adjusted according to the third driving sequence and the second driving sequence, the same preset face model may be respectively adjusted according to the third driving sequence and the second driving sequence, that is, facial actions in the preset face model may be adjusted according to the third driving sequence, so as to obtain the first face model. And adjusting the facial actions in the preset facial model according to the fourth driving sequence to obtain a second facial model.

The preset face model may be understood as a three-dimensional face model constructed according to a face corresponding to a speaking object, specifically, the preset face model herein may select a three-dimensional face model corresponding to a second object, or may select three-dimensional face models corresponding to other speaking objects, which is not limited in this embodiment.

After the first and second facial models are obtained, parameters of the initial model may be adjusted based on differences between the first and second facial models. For example, the loss function may be constructed by extracting position information corresponding to at least one key point in the first face model and position information corresponding to at least one key point in the second face model, and then performing parameter adjustment on the initial model based on the obtained loss function.

It can be understood that in this embodiment, the preset face model is driven by the third driving sequence obtained by the model, and further facial motion verification is performed on the preset face model, so as to further ensure accuracy and rationality of the third driving sequence output by the model, avoid the phenomenon of over-fitting of the model, and improve the training efficiency of the model.

In one example, step S403 includes the steps of:

Determining a first loss function according to a third driving sequence and a second driving sequence, wherein the second driving sequence comprises N second parameter sets, the second parameter sets are used for indicating three-dimensional facial actions of a second object under a time frame corresponding to the second parameter sets, according to the third driving sequence, parameter adjustment is conducted on a preset facial model to obtain a first facial model, parameter adjustment is conducted on the preset facial model according to the second driving sequence to obtain a second facial model, the second loss function is determined according to the first facial model and the second facial model, and according to the first loss function and the second loss function, parameter adjustment is conducted on an initial model to obtain a style conversion model.

In this embodiment, on the basis of the above example, the second driving sequence also includes N second parameter sets corresponding to time frames one by one, where the second parameter sets may be understood as facial parameters of the second object under one time frame in the process of sending out the target voice.

When the third drive sequence and the second drive sequence are acquired, a loss function may be constructed not only based on the third drive sequence and the second drive sequence, to obtain the first loss function. In addition, a second loss function can be constructed according to a first face model and a second face model which are obtained by driving the preset face model through the third driving sequence and the second driving sequence respectively. And then, combining the first loss function and the second loss function, and carrying out parameter adjustment on the initial model so as to train and obtain a style conversion model.

It can be appreciated that in this embodiment, the initial model may be trained by combining the first loss function constructed by the differences between the driving sequences and the second loss function constructed by the differences between the face models, so that the model training efficiency may be improved, and the model over-fitting phenomenon and the phenomenon of unreasonable facial actions during face driving are avoided, so as to improve the accuracy of style conversion performed by the style conversion model.

In one example, the style conversion model includes M one-dimensional convolution layers, where the one-dimensional convolution layers are used to convolve input data to obtain a processing result, the size of the processing result is the same as that of the input data, and M is a positive integer greater than 1.

Illustratively, the style conversion model in the present embodiment specifically includes M one-dimensional convolution layers. And when each one-dimensional convolution layer carries out convolution operation on the input data input into the convolution layer, the obtained convolution result is the same as the size of the input data corresponding to the one-dimensional convolution layer. That is, the convolution operation of the shift convolution layer in this embodiment does not change the size of the input data, and further, by the above-described model construction manner, the size consistency of the input and output results of the style conversion model is ensured.

For example, a full convolution network may be used as the model architecture of the style conversion model in practical applications, and further the step size of each convolution layer in the full convolution network may be set to 1 to ensure consistency of the input and output sizes of each convolution layer.

It can be understood that in this embodiment, the style conversion processing is performed on the input driving sequence by setting a plurality of one-dimensional convolution layers, so that the model structure is simple, and consistency of the input and output data sizes can be ensured, so as to ensure fluency of the facial motion during final facial driving.

In one example, the style conversion model may also obtain a model output result consistent with the size of the input driving sequence after a plurality of downsampling processes and upsampling processes are sequentially performed on the input driving sequence by a plurality of sequentially connected convolution layers.

Fig. 5 is a schematic structural diagram of a three-dimensional voice-based face driving device according to an embodiment of the present disclosure, and as shown in fig. 5, a three-dimensional voice-based face driving device 500 includes:

a determining unit 501, configured to determine a to-be-processed driving sequence of the to-be-processed voice, where the to-be-processed driving sequence is used to instruct the first object to output the three-dimensional facial motion when the to-be-processed voice is output.

The processing unit 502 is configured to perform style conversion processing on a driving sequence to be processed according to a style conversion model to obtain a target driving sequence, where the target driving sequence is configured to instruct a second object to output a three-dimensional facial motion when the voice to be processed is output, the style conversion model is obtained by training an initial model according to at least one set of a first driving sequence and a second driving sequence, the first driving sequence is configured to instruct the first object to output the three-dimensional facial motion when the target voice is output, the second driving sequence is configured to instruct the second object to output the three-dimensional facial motion when the target voice is output, and the style conversion model is configured to output a driving sequence conforming to a facial style when the second object speaks.

And a driving unit 503, configured to drive the three-dimensional facial model corresponding to the second object to perform facial motion according to the target driving sequence.

The device provided in this embodiment is configured to implement the technical scheme provided by the method, and the implementation principle and the technical effect are similar and are not repeated.

Fig. 6 is a schematic structural diagram of a second three-dimensional voice-based face driving device according to an embodiment of the present disclosure, as shown in fig. 6, a three-dimensional voice-based face driving device 600, including:

a determining unit 601, configured to determine a to-be-processed driving sequence of the to-be-processed voice, where the to-be-processed driving sequence is used to instruct the first object to output the three-dimensional facial motion when the to-be-processed voice is output.

The processing unit 602 is configured to perform style conversion processing on a driving sequence to be processed according to a style conversion model to obtain a target driving sequence, where the target driving sequence is used to instruct a second object to output a three-dimensional facial motion when the voice to be processed is output, the style conversion model is obtained by training an initial model according to at least one set of a first driving sequence and a second driving sequence, the first driving sequence is used to instruct the first object to output the three-dimensional facial motion when the target voice is output, the second driving sequence is used to instruct the second object to output the three-dimensional facial motion when the target voice is output, and the style conversion model is used to output the driving sequence conforming to the facial style when the second object speaks.

The driving unit 603 is configured to drive the three-dimensional face model corresponding to the second object to perform a facial action according to the target driving sequence.

In one example, the determining unit 601 includes:

A first determining module 6011, configured to determine phoneme information corresponding to a to-be-processed voice, where the phoneme information is a set of phonemes that form the to-be-processed voice;

the second determining module 6012 is configured to determine a to-be-processed driving sequence according to the mapping relationship and the phoneme information, where the mapping relationship is a correspondence relationship between the phonemes and a preset parameter set, the preset parameter set is configured to instruct a three-dimensional facial motion when the first object sends the phonemes, and the to-be-processed driving sequence includes at least one preset parameter set.

In one example, the first driving sequence comprises N first parameter sets, the first parameter sets are used for indicating three-dimensional facial actions of the first object under a time frame corresponding to the first parameter sets, N is a positive integer greater than 1, the style conversion model is obtained by carrying out parameter adjustment on the initial model according to a third driving sequence and a second driving sequence, the third driving sequence is output by the initial model according to the N first parameter sets, and the third driving sequence comprises N third parameter sets.

In one example, the style conversion model is obtained by performing parameter adjustment on the initial model according to a first face model and a second face model, wherein the first face model is obtained by performing parameter adjustment on a preset face model according to a third driving sequence, and the second face model is obtained by performing parameter adjustment on the preset face model according to a second driving sequence.

In one example, the style conversion model is obtained by performing parameter adjustment on an initial model according to a first loss function and a second loss function, the first loss function is obtained according to a third driving sequence and a second driving sequence, the second driving sequence comprises N second parameter sets, the second parameter sets are used for indicating three-dimensional facial actions of a second object under a time frame corresponding to the second parameter sets, the second loss function is obtained according to a first facial model and a second facial model, the first facial model is obtained by performing parameter adjustment on a preset facial model according to the third driving sequence, and the second facial model is obtained by performing parameter adjustment on the preset facial model according to the second driving sequence.

Fig. 7 is a schematic structural diagram of a training device for a style conversion model according to an embodiment of the present disclosure, and as shown in fig. 7, a training device 700 for a style conversion model includes:

the obtaining unit 701 is configured to obtain at least one set of training sets, where the training sets include a first driving sequence and a second driving sequence, the first driving sequence is configured to instruct a three-dimensional facial motion when the first object outputs the target voice, and the second driving sequence is configured to instruct a three-dimensional facial motion when the second object outputs the target voice.

Training unit 702, configured to train the initial model according to at least one training set to obtain a style conversion model, where the style conversion model is configured to output a driving sequence according to a target style, and the target style is a face style when the second object speaks.

Fig. 8 is a schematic structural diagram of a training device for a style conversion model according to an embodiment of the present disclosure, and as shown in fig. 8, a training device 800 for a style conversion model includes:

The obtaining unit 801 is configured to obtain at least one set of training sets, where the training sets include a first driving sequence and a second driving sequence, the first driving sequence is configured to instruct a three-dimensional facial motion when the first object outputs the target voice, and the second driving sequence is configured to instruct the three-dimensional facial motion when the second object outputs the target voice.

Training unit 802, configured to train the initial model according to at least one training set to obtain a style conversion model, where the style conversion model is configured to output a driving sequence according to a target style, and the target style is a face style when the second subject speaks.

In one example, the first driving sequence comprises N first parameter sets, wherein the first parameter sets are used for indicating three-dimensional facial actions of the first object under a time frame corresponding to the first parameter sets;

training unit 802, comprising:

the first obtaining module 8021 is configured to input N first parameter sets to the initial model to obtain a third driving sequence, where the third driving sequence includes N third parameter sets;

The adjusting module 8022 is configured to perform parameter adjustment on the initial model according to the third driving sequence and the second driving sequence, so as to obtain a style conversion model.

In one example, the adjustment module 8022 includes:

a first adjusting submodule 80221, configured to perform parameter adjustment on a preset face model according to a third driving sequence, so as to obtain a first face model;

The second adjusting submodule 80222 is configured to perform parameter adjustment on the preset face model according to a second driving sequence to obtain a second face model;

And a third adjustment submodule 80223, configured to perform parameter adjustment on the initial model according to the first face model and the second face model, so as to obtain a style conversion model.

In one example, the adjustment module 8022 includes:

The first determining submodule is used for determining a first loss function according to a third driving sequence and a second driving sequence, wherein the second driving sequence comprises N second parameter sets, and the second parameter sets are used for indicating the three-dimensional facial action of a second object under a time frame corresponding to the second parameter sets;

The fourth adjustment sub-module is used for carrying out parameter adjustment on the preset face model according to the third driving sequence to obtain a first face model;

A fifth adjustment sub-module, configured to perform parameter adjustment on the preset face model according to the second driving sequence, to obtain a second face model;

A second determination submodule for determining a second loss function from the first face model and the second face model;

And the sixth adjustment sub-module is used for carrying out parameter adjustment on the initial model according to the first loss function and the second loss function to obtain a style conversion model.

In one example, the obtaining unit 801 includes:

A third determining module 8011, configured to determine phoneme information corresponding to the target speech, where the phoneme information is a set of phonemes that form the target speech;

A fourth determining module 8012, configured to determine a first driving sequence according to the mapping relationship and the phoneme information, where the mapping relationship is a correspondence relationship between a phoneme and a preset parameter set;

a second obtaining module 8013, configured to obtain a second driving sequence;

Fifth determining module 8014 is configured to take the second driving sequence and the first driving sequence as a set of training sets.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

The present disclosure provides an electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method provided in any one of the embodiments described above.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure, and as shown in fig. 9, an electronic device 900 in the disclosure may include a processor 901 and a memory 902.

The memory 902 is used for storing programs, the memory 902 may include volatile memory (english) such as random-access memory (RAM), static random-access memory (SRAM), double data rate synchronous dynamic random-access memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM), and the like, and nonvolatile memory (english) such as flash memory (flash memory). The memory 902 is used to store computer programs (e.g., application programs, functional modules, etc. that implement the methods described above), computer instructions, etc., which may be stored in one or more of the memories 902 in a partitioned manner. And the above-described computer programs, computer instructions, data, etc. may be called by the processor 901.

The computer programs, computer instructions, etc., described above may be stored in one or more of the memories 902 in a partitioned manner. And the above-described computer programs, computer instructions, etc. may be invoked by the processor 901.

A processor 901 for executing a computer program stored in the memory 902 to implement the steps in the method according to the above embodiment.

Reference may be made in particular to the description of the embodiments of the method described above.

The processor 901 and the memory 902 may be separate structures or may be integrated structures. When the processor 901 and the memory 902 are separate structures, the memory 902 and the processor 901 may be coupled by a bus 903.

The electronic device in this embodiment may execute the technical scheme in the above method, and the specific implementation process and the technical principle are the same, which are not described herein again.

The present disclosure provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method provided by any one of the embodiments described above.

According to an embodiment of the present disclosure, there is also provided a computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of an electronic device can read the computer program, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any of the embodiments described above.

Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

The various components in device 1000 are connected to I/O interfaces 1005, including an input unit 1006, e.g., keyboard, mouse, etc., an output unit 1007, e.g., various types of displays, speakers, etc., a storage unit 1008, e.g., magnetic disk, optical disk, etc., and a communication unit 1009, e.g., network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1001 may be among various general and/or special purpose processing groups having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the respective methods and processes described above, such as a three-dimensional face driving method based on speech, a model training method. For example, in some embodiments, the speech-based three-dimensional face driving method, model training method, may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the above-described voice-based three-dimensional face driving method, model training method may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the speech-based three-dimensional face driving method, the model training method, by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), and the Internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PRIVATE SERVER" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A three-dimensional face driving method based on speech, comprising:

Determining a to-be-processed driving sequence of the to-be-processed speech; wherein the to-be-processed driving sequence is used to indicate a three-dimensional facial movement of the first object when outputting the to-be-processed speech;

According to the style conversion model, the to-be-processed driving sequence is subjected to style conversion processing to obtain a target driving sequence; wherein the target driving sequence is used to indicate the three-dimensional facial movements of the second object when the to-be-processed speech is output; the style conversion model is obtained by adjusting the parameters of the initial model according to the third driving sequence and the second driving sequence, and the third driving sequence is obtained by inputting the N first parameter sets included in the first driving sequence into the initial model; N is a positive integer greater than 1; the first driving sequence is used to indicate the three-dimensional facial movements of the first object when the target speech is output; the second driving sequence is used to indicate the three-dimensional facial movements of the second object when the target speech is output; the style conversion model is used to output a driving sequence that conforms to the facial style of the second object when speaking;

According to the target driving sequence, the three-dimensional facial model corresponding to the second object is driven to perform facial actions.

2. The method according to claim 1, wherein determining the to-be-processed driving sequence of the to-be-processed speech comprises:

Determine the phoneme information corresponding to the speech to be processed; wherein the phoneme information is a set of phonemes constituting the speech to be processed;

According to the mapping relationship and the phoneme information, a drive sequence to be processed is determined; wherein the mapping relationship is a correspondence between a phoneme and a preset parameter set; the preset parameter set is used to indicate a three-dimensional facial movement when the first object emits a phoneme; and the drive sequence to be processed includes at least one preset parameter set.

3. The method according to claim 1 or 2, wherein the first parameter set is used to indicate the three-dimensional facial movement of the first object in the time frame corresponding to the first parameter set; and the third driving sequence includes N third parameter sets.

4. The method according to claim 3, wherein the style transfer model is obtained by adjusting parameters of the initial model according to the first facial model and the second facial model;

The first facial model is obtained by adjusting parameters of a preset facial model according to the third driving sequence; and the second facial model is obtained by adjusting parameters of a preset facial model according to the second driving sequence.

5. The method according to claim 3, wherein the style conversion model is obtained by adjusting parameters of the initial model according to the first loss function and the second loss function;

The first loss function is obtained according to the third drive sequence and the second drive sequence; wherein the second drive sequence includes N second parameter sets; the second parameter set is used to indicate the three-dimensional facial action of the second object in the time frame corresponding to the second parameter set;

The second loss function is obtained based on the first facial model and the second facial model; wherein the first facial model is obtained by adjusting the parameters of the preset facial model according to the third driving sequence; and the second facial model is obtained by adjusting the parameters of the preset facial model according to the second driving sequence.

6. The method according to any one of claims 1-2 and 4-5, wherein the style transfer model comprises M one-dimensional convolutional layers, wherein the one-dimensional convolutional layers are used to perform convolution processing on input data to obtain a processing result; and the size of the processing result is the same as the size of the input data; and M is a positive integer greater than 1.

7. A method for training a style transfer model, comprising:

Acquire at least one set of training sets; wherein the training set includes a first driving sequence and a second driving sequence; the first driving sequence is used to indicate the three-dimensional facial movements of the first object when outputting the target voice; the second driving sequence is used to indicate the three-dimensional facial movements of the second object when outputting the target voice;

Inputting N first parameter sets included in the first driving sequence into an initial model to obtain a third driving sequence; adjusting parameters of the initial model according to the third driving sequence and the second driving sequence to obtain a style conversion model; wherein N is a positive integer greater than 1; the style conversion model is used to output a driving sequence that conforms to a target style; and the target style is the facial style of the second object when speaking.

8. The method according to claim 7, wherein the first parameter set is used to indicate the three-dimensional facial movement of the first object in the time frame corresponding to the first parameter set; and the third driving sequence includes N third parameter sets.

9. The method according to claim 8, wherein adjusting parameters of the initial model according to the third driving sequence and the second driving sequence to obtain a style transfer model comprises:

According to the third driving sequence, adjusting parameters of the preset facial model to obtain a first facial model;

According to the second driving sequence, adjusting parameters of the preset facial model to obtain a second facial model;

According to the first facial model and the second facial model, parameters of the initial model are adjusted to obtain a style transfer model.

10. The method according to claim 8, wherein adjusting parameters of the initial model according to the third driving sequence and the second driving sequence to obtain a style conversion model comprises:

Determine a first loss function according to the third drive sequence and the second drive sequence; wherein the second drive sequence includes N second parameter sets; the second parameter set is used to indicate a three-dimensional facial action of the second object in a time frame corresponding to the second parameter set;

According to the third driving sequence, adjusting the parameters of the preset facial model to obtain a first facial model; according to the second driving sequence, adjusting the parameters of the preset facial model to obtain a second facial model; and determining a second loss function according to the first facial model and the second facial model;

According to the first loss function and the second loss function, parameters of the initial model are adjusted to obtain a style transfer model.

11. The method according to any one of claims 7 to 10, wherein obtaining at least one training set comprises:

Determining the phoneme information corresponding to the target speech; wherein the phoneme information is a set of phonemes constituting the target speech;

Determine a first driving sequence according to the mapping relationship and the phoneme information; wherein the mapping relationship is a corresponding relationship between a phoneme and a preset parameter set; the preset parameter set is used to characterize a three-dimensional facial movement when a phoneme is emitted; and the first driving sequence includes at least one preset parameter set;

A second driving sequence is acquired, and the second driving sequence and the first driving sequence are used as a training set.

12. The method according to any one of claims 7 to 10, wherein the style transfer model comprises M one-dimensional convolutional layers, wherein the one-dimensional convolutional layers are used to perform convolution processing on input data to obtain a processing result; and the size of the processing result is the same as the size of the input data; and M is a positive integer greater than 1.

13. A speech-based three-dimensional facial driving device, comprising:

A determination unit, configured to determine a to-be-processed driving sequence of the to-be-processed speech; wherein the to-be-processed driving sequence is used to indicate a three-dimensional facial movement of the first object when outputting the to-be-processed speech;

A processing unit, configured to perform style conversion processing on the to-be-processed driving sequence according to a style conversion model to obtain a target driving sequence; wherein the target driving sequence is used to indicate the three-dimensional facial movements of the second object when the to-be-processed speech is output; the style conversion model is obtained by adjusting the parameters of the initial model according to the third driving sequence and the second driving sequence, and the third driving sequence is obtained by inputting the N first parameter sets included in the first driving sequence into the initial model; N is a positive integer greater than 1; the first driving sequence is used to indicate the three-dimensional facial movements of the first object when the target speech is output; the second driving sequence is used to indicate the three-dimensional facial movements of the second object when the target speech is output; the style conversion model is used to output a driving sequence that conforms to the facial style of the second object when speaking;

A driving unit is used to drive the three-dimensional facial model corresponding to the second object to perform facial movements according to the target driving sequence.

14. The apparatus according to claim 13, wherein the determining unit comprises:

A first determination module, used to determine the phoneme information corresponding to the speech to be processed; wherein the phoneme information is a set of phonemes constituting the speech to be processed;

The second determination module is used to determine the drive sequence to be processed according to the mapping relationship and the phoneme information; wherein the mapping relationship is the correspondence between the phoneme and the preset parameter set; the preset parameter set is used to indicate the three-dimensional facial movement when the first object emits the phoneme; the drive sequence to be processed includes at least one preset parameter set.

15. The device according to claim 13 or 14, wherein the first parameter set is used to indicate the three-dimensional facial movement of the first object in the time frame corresponding to the first parameter set; and the third driving sequence includes N third parameter sets.

16. The device according to claim 15, wherein the style transfer model is obtained by adjusting parameters of the initial model according to the first facial model and the second facial model;

17. The device according to claim 15, wherein the style conversion model is obtained by adjusting parameters of the initial model according to the first loss function and the second loss function;

18. An apparatus according to any one of claims 13-14, 16-17, wherein the style transfer model comprises M one-dimensional convolutional layers, wherein the one-dimensional convolutional layers are used to perform convolution processing on input data to obtain a processing result; and the size of the processing result is the same as the size of the input data; and M is a positive integer greater than 1.

19. A training device for a style transfer model, comprising:

An acquisition unit is used to acquire at least one set of training sets; wherein the training set includes a first drive sequence and a second drive sequence; the first drive sequence is used to indicate the three-dimensional facial movements of the first object when outputting the target voice; the second drive sequence is used to indicate the three-dimensional facial movements of the second object when outputting the target voice;

A training unit is used to input the N first parameter sets included in the first driving sequence into an initial model to obtain a third driving sequence; according to the third driving sequence and the second driving sequence, the parameters of the initial model are adjusted to obtain a style conversion model; wherein N is a positive integer greater than 1; the style conversion model is used to output a driving sequence that conforms to a target style; the target style is the facial style of the second object when speaking.

20. The apparatus according to claim 19, wherein the first parameter set is used to indicate a three-dimensional facial action of the first object at a time frame corresponding to the first parameter set;

The third driving sequence includes N third parameter sets.

21. The apparatus according to claim 20, wherein the adjustment module comprises:

A first adjustment submodule, configured to adjust parameters of a preset facial model according to the third driving sequence to obtain a first facial model;

A second adjustment submodule, configured to adjust parameters of the preset facial model according to the second driving sequence to obtain a second facial model;

The third adjustment submodule is used to adjust parameters of the initial model according to the first facial model and the second facial model to obtain a style transfer model.

22. The apparatus according to claim 20, wherein the adjustment module comprises:

A first determination submodule is configured to determine a first loss function according to the third drive sequence and the second drive sequence; wherein the second drive sequence includes N second parameter sets; and the second parameter set is configured to indicate a three-dimensional facial action of the second object in a time frame corresponding to the second parameter set;

a fourth adjustment submodule, configured to adjust parameters of a preset facial model according to the third driving sequence to obtain a first facial model;

a fifth adjustment submodule, configured to adjust parameters of the preset facial model according to the second driving sequence to obtain a second facial model;

A second determination submodule, configured to determine a second loss function according to the first facial model and the second facial model;

The sixth adjustment submodule is used to adjust the parameters of the initial model according to the first loss function and the second loss function to obtain a style transfer model.

23. The apparatus according to any one of claims 19 to 22, wherein the acquisition unit comprises:

A third determination module is used to determine the phoneme information corresponding to the target speech; wherein the phoneme information is a set of phonemes constituting the target speech;

a fourth determination module, configured to determine a first driving sequence according to a mapping relationship and the phoneme information; wherein the mapping relationship is a corresponding relationship between a phoneme and a preset parameter set; the preset parameter set is used to characterize a three-dimensional facial movement when a phoneme is emitted; and the first driving sequence includes at least one preset parameter set;

A second acquisition module, used for acquiring a second driving sequence;

A fifth determining module is configured to use the second driving sequence and the first driving sequence as a training set.

24. An apparatus according to any one of claims 19-22, wherein the style transfer model comprises M one-dimensional convolutional layers, wherein the one-dimensional convolutional layers are used to perform convolution processing on input data to obtain a processing result; and the size of the processing result is the same as the size of the input data; and M is a positive integer greater than 1.

25. An electronic device comprising:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method according to any one of claims 1 to 12.

26. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1-12.

27. A computer program product, comprising a computer program, which, when executed by a processor, implements the steps of the method according to any one of claims 1 to 12.