CN120015053B

CN120015053B - AI-driven personalized voice training and pronunciation correction system

Info

Publication number: CN120015053B
Application number: CN202510489999.9A
Authority: CN
Inventors: 黄冠; 陈劢; 杨楠; 黎洪勇
Original assignee: Shenzhen Shifang Ronghai Technology Co ltd
Current assignee: Shenzhen Shifang Ronghai Technology Co ltd
Priority date: 2025-04-18
Filing date: 2025-04-18
Publication date: 2025-07-22
Anticipated expiration: 2045-04-18
Also published as: CN120015053A

Abstract

The present invention provides an AI-driven personalized speech training and pronunciation correction system; the system first collects original speech data of a user during speech training, and extracts a multidimensional speech feature vector including pitch, speech speed, intonation, resonance peak parameters and corresponding text; based on the feature vector, the context label of the current speech is identified, and the pronunciation feature vector of the user is constructed; the system further calls a standard pronunciation database according to the context label, generates a target pronunciation feature vector in the corresponding context, and performs a multidimensional comparison between the user's pronunciation and the target pronunciation to obtain a difference parameter set including pronunciation position, speech speed and emotion difference; finally, the system generates a correction suggestion including pronunciation action guidance, intonation adjustment prompt and semantic emotion reinforcement instruction, and receives user feedback information to improve the personalized training effect; the present invention can realize accurate pronunciation correction under context perception, and enhance the personalization and intelligence level of speech training.

Description

AI-driven personalized speech training and pronunciation correction system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an AI-driven personalized voice training and pronunciation correction system.

Background

In the prior art, speech training and pronunciation correction systems generally perform basic comparison and scoring of user speech based on a preset acoustic model or static pronunciation template. The system mostly adopts the distance calculation between the audio characteristics and the target pronunciation as a core evaluation means, and part of the system can provide feedback of indexes such as pitch, speech speed and the like, but the system is generally focused on the evaluation and correction of static indexes and lacks of dynamic interaction and depth understanding capability.

The prior art has significant shortcomings in dealing with context correlation, emotion expression accuracy and individualization differences. Specifically, the conventional system cannot distinguish the pronunciation targets according to the dialogue situation, neglects the roles of intonation and emotion in speech expression, and cannot dynamically adjust the evaluation standard according to the historical expression of the user, so that training feedback is too mechanical and lacks pertinence, and high-quality speech expression training requirements under complex application scenes are difficult to meet.

In view of the foregoing, there is a need for a new speech training and pronunciation correction scheme that can integrate contextual understanding, emotion perception, and personalized modeling.

Disclosure of Invention

The application provides an AI-driven personalized voice training and pronunciation correction system to realize accurate pronunciation correction under context awareness.

The application provides an AI-driven personalized speech training and pronunciation correction system, comprising:

The voice acquisition unit is used for acquiring original voice data of a user in a voice training process and extracting multidimensional voice feature vectors including pitch, speech speed, intonation, formant parameters and voice corresponding texts from the original voice data;

the recognition construction unit is used for recognizing a context label where the current voice of the user is based on semantic information and emotion characteristics in the multi-dimensional voice characteristic vector and constructing a corresponding user pronunciation characteristic vector based on the multi-dimensional voice characteristic vector;

The feature generation unit is used for calling a target pronunciation feature vector corresponding to the context label from a preset standard pronunciation database based on the context label;

the difference analysis unit is used for carrying out multi-dimensional comparison on the user pronunciation characteristic vector and the target pronunciation characteristic vector to obtain a difference parameter set containing pronunciation part deviation, speech speed difference and emotion expression difference;

and the error correction feedback unit is used for generating pronunciation correction suggestions based on the difference parameter set, wherein the pronunciation correction suggestions comprise pronunciation action guide, intonation adjustment prompts and semantic emotion strengthening instructions, and the error correction feedback unit is used for outputting the pronunciation correction suggestions to a user and receiving exercise feedback information of the user based on the pronunciation correction suggestions.

The system can realize comprehensive and fine modeling of the pronunciation characteristics of the user by extracting multidimensional voice characteristic vectors comprising pitch, speech speed, intonation, formant parameters and voice texts, thereby improving the accuracy and coverage of voice analysis. (2) The system introduces a context label recognition mechanism, and can dynamically call corresponding target pronunciation standards according to different voice scenes (such as dialogue, speech, reading and the like), so that the contextually relevant personalized training is realized, and the adaptability and the practicability of training content are enhanced. (3) By comparing the multi-dimensional difference between the pronunciation characteristics of the user and the target pronunciation characteristics related to the context, the system can identify the deviation of the pronunciation part, the speech speed, the emotion expression and the like, provide a quantization basis with strong pertinence for subsequent correction, and remarkably improve the error correction precision. (4) The system builds an interactive closed-loop training mechanism based on user feedback information, so that pronunciation correction suggestions not only have instantaneity and context awareness capability, but also can continuously optimize training effects in the continuous training process of the user, and the efficiency and the intelligent level of voice learning are improved.

Drawings

Fig. 1 is a schematic diagram of an AI-driven personalized speech training and pronunciation correction system according to a first embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be embodied in many other forms than those herein described, and those skilled in the art will readily appreciate that the present application may be similarly embodied without departing from the spirit or essential characteristics thereof, and therefore the present application is not limited to the specific embodiments disclosed below.

A first embodiment of the present application provides an AI-driven personalized speech training and pronunciation correction system. Referring to fig. 1, a schematic diagram of a first embodiment of the present application is shown. A first embodiment of the present application is described in detail below with reference to fig. 1 to provide an AI-driven personalized speech training and pronunciation correction system.

The system comprises a voice acquisition unit 101, a recognition construction unit 102, a feature generation unit 103, a difference analysis unit 104 and an error correction feedback unit 105.

The voice collection unit 101 is configured to collect original voice data of a user during a voice training process, and extract multidimensional voice feature vectors including pitch, speech speed, intonation, formant parameters and corresponding text of voice from the original voice data.

In the AI-driven personalized speech training and pronunciation correction system of the present embodiment, the speech acquisition unit 101 is a front-end input component of the system, and functions to acquire original speech data of a user during speech training, and extract multidimensional speech feature vectors from the speech data. The unit may comprise one or more audio acquisition devices, such as an electret microphone array, a MEMS microphone or a pick-up module integrated in a mobile device, a computer or an ear-mounted device. The acquisition frequency is preferably 16kHz or more to ensure that the resolution of the audio data is high enough to facilitate subsequent feature extraction.

The original voice signal obtained by collection is firstly subjected to signal noise reduction and dereverberation treatment through a preprocessing module. Front-end enhancement algorithms based on spectral subtraction, wiener filtering or deep neural networks can be employed to remove interference of ambient background noise and reverberation to the accuracy of the speech training. The processed speech signal is fed into an acoustic feature extraction process.

The feature extraction module extracts multidimensional voice feature vectors by using a time domain and frequency domain analysis method. The multi-dimensional voice features at least comprise pitch, which can be estimated by adopting an autocorrelation method or a deep learning method based on a masking spectrogram, speech speed, which is obtained by counting time distribution of phonemes, syllables or words, intonation, which is the pitch variation trend of voice in a whole sentence, which can be modeled by adopting an average pitch curve, formant parameters, which are usually analyzed by LPC (Linear predictive coding) to extract formant frequencies of F1, F2, F3 and the like, and in addition, the system is also used for extracting text information corresponding to the voice, and an ASR (Automatic Speech Recognition, automatic voice recognition) model is usually used for recognizing a text sequence corresponding to the original voice. The preferred ASR model may be based on CTC architecture, transducer-Encoder or Conformer architecture to achieve higher recognition accuracy and adaptability to accents.

To facilitate the invocation of subsequent modules, all of the features described above will be uniformly encoded into one structured multidimensional vector as the output of the speech acquisition unit 101. The multidimensional voice feature vector not only reserves original acoustic information, but also integrates time sequence features and semantic tags, and has good expandability and system interface compatibility. The vector will be passed directly to the recognition construction unit 102 for subsequent context analysis and pronunciation feature modeling.

It should be noted that, the voice acquisition unit 101 may also cooperate with a wake-up mechanism and an authority control mechanism of the user device, so as to implement automatic data acquisition and uploading in the multiple training process on the premise of ensuring the privacy of the user. In addition, to improve the personalized adaptability, the voice acquisition unit 101 may further include a user ID binding mechanism, so as to ensure that each training data can be associated with historical data, so that the system can accumulate individual pronunciation files conveniently, and support subsequent dynamic optimization and long-term tracking correction.

In summary, the speech acquisition unit 101 not only completes the high quality audio input and noise reduction preprocessing in the present system, but also constructs the high-dimensional speech feature representation required for the subsequent recognition, analysis and training control.

The recognition construction unit 102 is configured to recognize a context label where the current voice of the user is located based on semantic information and emotion features in the multi-dimensional voice feature vector, and construct a corresponding user pronunciation feature vector based on the multi-dimensional voice feature vector.

In the AI-driven personalized speech training and pronunciation correction system of the present embodiment, the recognition construction unit 102 is configured to process the multidimensional speech feature vector output by the speech acquisition unit 101, and its function mainly includes two aspects, namely, recognizing the context label where the current speech is located, and constructing the pronunciation feature vector of the user based on the feature vector.

First, the recognition construction unit receives the aforementioned multidimensional speech feature vector including pitch, speech rate, intonation, formant parameters, and speech corresponding text. In order to obtain semantic and emotion information from the features, the recognition construction unit internally comprises a semantic understanding sub-module and an emotion analysis sub-module. The semantic understanding sub-module is used to analyze the text content and language structure of the sentence spoken by the user, preferably using a natural language processing model, such as a context-aware model based on BERT, roBERTa, or a Transformer encoder, to encode the meaning of the sentence. In combination with the text information corresponding to the voice, the system can identify the function type of the current sentence, such as a statement sentence, a question sentence, an exclamation sentence, or distinguish scene types, such as daily dialogue, news broadcasting, voice instructions and the like.

The emotion analysis sub-module uses the tone, rhythm and formant form and other signal features highly related to the acoustic features, and combines with the deep neural network model (such as multilayer LSTM or CNN-BiGRU structure) to judge the emotion tendency expressed by the current voice, such as neutrality, aggressiveness, anxiety, anger and the like. Based on the dual recognition of the semantic tags and the emotion tags, the system fuses the results, and generates a definite context tag through a context discrimination network, wherein the tag is selected from a preset context set, such as 'children dialogue', 'news broadcasting', 'customer service scene', 'role playing', and the like, and each context tag corresponds to a group of follow-up pronunciation targets and evaluation references.

After the context recognition is completed, the recognition construction unit further carries out aggregation processing on the original multidimensional voice feature vectors to construct pronunciation feature vectors of users. The vector is used to characterize the overall speech expression state of the user in the current context, including the pronunciation location of his actual pronunciation, speech patterns, pitch changes and their emotional consistency with the spoken content, etc. Preferably, the user pronunciation feature vector adopts a vector structure with fixed dimension, and different feature dimensions can be weighted through a attention mechanism so as to enhance the expression capability of the pronunciation difference sensitive feature. In addition, the construction process can also call the pronunciation habit or the known problem of the user in the history training record to realize the personalized adjustment of the user model, so that the constructed pronunciation characteristic vector can reflect the comprehensive expression of the current state and the past characteristics of the user.

Finally, the output result of the recognition construction unit 102 includes two parts, one is a context label for indicating the expression context in which the user's current speech is located, and the other is a user pronunciation feature vector for performing a subsequent difference analysis with the target pronunciation feature. These two outputs are transmitted to the feature generation unit 103 and the variance analysis unit 104, respectively, as the basis for the generation of the subsequent correction advice.

Further, the recognition construction unit comprises a context transformation simulation module which is specifically used for analyzing pronunciation feature continuity between different context labels in user history training data, extracting transition expression patterns between adjacent context categories and establishing a multi-context mixed expression model based on the transition expression patterns;

When the recognition construction unit detects that the current multi-dimensional voice feature vector simultaneously contains feature indexes corresponding to two or more context labels and the feature indexes have high correlation in historical data, the context transformation simulation module calls a cross context interference decoder to decouple and separate different context components in the multi-dimensional voice feature vector, generates voice feature sub-vectors corresponding to each context, and calculates a context conflict confidence score of each sub-vector based on the separation result;

Under the condition that any context conflict confidence score exceeds a preset threshold, the context transformation simulation module triggers a feature masking mechanism to partially mask the context feature sub-vectors with the confidence lower than the threshold, and calculates fusion weights for the reserved parts;

The fusion weight is used for carrying out weighted combination on target pronunciation feature vectors corresponding to a plurality of context labels in a preset standard pronunciation database to generate a dynamic target pronunciation feature vector containing interpretable weight vector labels, and the dynamic target pronunciation feature vector is provided with transitional expression features and is used for comparing with the current user pronunciation feature vector so as to cope with nonlinear expression deviation caused by context rapid switching in the natural voice training process of the user.

In an AI-driven personalized speech training and pronunciation correction system, the recognition building unit includes a context transformation simulation module for recognizing and handling multi-context mixing or context transition expressions that may occur to a user during natural speech training. The system first establishes a continuous mapping relationship of pronunciation features between different context labels by analyzing user history voice samples. Context labels are predefined categories of expression scenes by the system, such as "news broadcast", "emotion reading", "customer service dialogue", etc. Each context label corresponds to a set of known standard pronunciation feature sequences. These context labels are automatically identified by the model, and the initial training set can also be provided by manually labeling training samples.

In order to construct a transitional expression mode, the system extracts key transfer characteristics including pitch curve change rate, speech speed fluctuation interval, emotion parameter mutation points and the like based on a characteristic change sequence which is shown by a user when the user naturally transfers between two adjacent contexts in a history sample. In the process flow, the system adopts a sliding window mechanism to segment the voice fragments with the cross-context characteristics from the voice data of the user. These segments are then aligned using a Dynamic Time Warping (DTW) algorithm, and similar transition samples are classified using a K-means based feature clustering method. Each class represents a common context transition path, e.g., a gradual transition from "formal broadcast" to "emotional read" mode. After clustering, these time series features are modeled using a two-way long and short term memory network (BiLSTM) and a transitional expression predictive model is trained that enables output of predicted "transitional state" feature vectors for subsequent dynamic object generation when signs of context mixing are detected.

When the recognition architecture unit receives the multi-dimensional speech feature vector in the current user's speech, the context transformation simulation module first invokes the existing context recognition model (e.g., based on the joint encoding network of BERT and BiGRU) to analyze the potential context label distribution in the current sample. If two or more key features corresponding to context labels are detected to be present at the same time and have a high frequency co-occurrence relationship in the history (e.g., present at the same time in more than 30% of the transitional training samples), the system determines that a context mix or context skip condition is likely.

At this point, the context transformation simulation module invokes the cross-context interference decoder. The decoder adopts a multi-head attention mechanism structure, performs decoupling on the input multidimensional voice feature vector in the context dimension, and decomposes the input multidimensional voice feature vector into a plurality of voice feature sub-vectors, wherein each sub-vector corresponds to one candidate context component. For example, one sub-vector may reflect mainly the pace of the "news broadcast" and the other the pitch jitter and emotion expression of the "emotion reading".

To determine the validity of each context component in the current representation, the system calculates a context conflict confidence score for each speech feature sub-vector separately. The scoring comprehensively considers three factors, namely, feature similarity of the sub-vector and a context standard sample to which the sub-vector belongs, preferably cosine similarity calculation, proportion of the sub-vector in the total energy of the current voice to measure expression intensity of the sub-vector, and matching degree between the expressed semantics of the current voice text and the context label, and evaluation by sentence vector comparison (such as vectors after BERT coding). After the three indexes are normalized, confidence scores are calculated through a weighted average mode. The weight can be set based on data statistics experience, for example, semantic consistency is set to be 0.4, acoustic similarity is set to be 0.4, energy ratio is set to be 0.2, and training can be performed through a supervised learning method.

Once the confidence score for a subvector is below a system-set threshold (e.g., 0.55), the subvector will be considered as an invalid or interfering context component, and the system triggers a feature masking mechanism to de-weight the subvector or directly mask its participation in subsequent fusion computations with a zero-value mask. For the reserved high-confidence subvectors, the system re-normalizes the confidence of the high-confidence subvectors by adopting a Softmax function to generate fusion weights.

The fusion weights are used for calling target pronunciation feature vectors corresponding to the context labels from the standard pronunciation database, and carrying out weighted combination on the target vectors according to the weights to finally form a dynamic target pronunciation feature vector. The vector not only expresses the characteristics of the dominant context, but also fuses the pronunciation style of the reserved components in the secondary context, and attaches weight information to the structure for subsequent comparison or visual presentation.

The dynamic target pronunciation feature vector has transitional expression capability, is particularly suitable for complex situations of context fast switching, emotion cross-scene fluctuation and the like of user expression in natural language communication scenes, and is beneficial to improving the accuracy and adaptability of a voice training system under various expressions. After the dynamic target pronunciation feature vector is generated, the dynamic target pronunciation feature vector is used as a comparison standard to be input into a difference analysis unit in the system for multidimensional comparison with the pronunciation feature vector of the current user. The system generates a difference parameter set by quantitatively analyzing the difference of the two vectors in the dimensions of pitch, speech speed, intonation, emotion expression and the like. This difference parameter set will be used in the subsequent error correction feedback unit to help the system give pronunciation correction advice that is more context-friendly and compatible with the transitional state.

The following are the reference implementation codes of the recognition building unit and the context-transformation simulation module:

import numpy as np

import torch

import torch.nn as nn

import torch.nn.functional as F

from sklearn.cluster import KMeans

from sklearn.metrics.pairwise import cosine_similarity

from scipy.spatial.distance import euclidean

from fastdtw import fastdtw

# BiLSTM used for transient expression modeling

class BiLSTMTransitionModel(nn.Module):

def __init__(self, input_dim, hidden_dim, output_dim):

super(BiLSTMTransitionModel, self).__init__()

self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers=1, bidirectional=True, batch_first=True)

self.fc = nn.Linear(hidden_dim 2, output_dim)

def forward(self, x):

Out, _=self lstm (x) # output shape: [ batch_size, seq_len, hidden_dim " 2]

Out=self fc (out [: -1:) # takes the last time step output

The return out# output is a transition state prediction feature vector

A # multi-headed attention mechanism for decoupling context components

class MultiHeadContextSeparator(nn.Module):

def __init__(self, input_dim, num_heads):

super(MultiHeadContextSeparator, self).__init__()

self.attn = nn.MultiheadAttention(embed_dim=input_dim, num_heads=num_heads, batch_first=True)

def forward(self, x):

Treat input x as multi-dimensional speech feature vector containing multi-context information simultaneously

Attn _output, _self attn (x, x, x) # self-attention, outputs multiple sub-vectors

Return attn _output# [ batch, seq, input_dim ], can be split according to context

Context conflict confidence score calculation function

def calculate_confidence_score(subvector, context_vector, energy_ratio, semantic_similarity,

weights=(0.4, 0.4, 0.2)):

acoustic_sim = cosine_similarity(subvector.reshape(1, -1), context_vector.reshape(1, -1))[0][0]

score = weights[0] semantic_similarity + weights[1]acoustic_sim + weights[2] energy_ratio

return score

# Feature shielding and fusion weight calculation module

def apply_mask_and_compute_weights(confidences, threshold):

masked_confidences = [c if c>= threshold else 0 for c in confidences]

masked_confidences = np.array(masked_confidences)

if masked_confidences.sum() == 0:

Masked_ confidences + =1e-6# prevents division by 0

weights = masked_confidences / masked_confidences.sum()

Return weights # output is normalized fusion weight

Dynamic target vector generation (weight weighting)

def generate_dynamic_target_vector(context_vectors, weights):

# Context_vectors List [ np. Array ], each being a vector of standard pronunciation samples

# Weights List [ float ], corresponding fusion weights

weighted_sum = sum(w v for w, v in zip(weights, context_vectors))

Return weighted sum # dynamic target pronunciation feature vector

Example # identifying a building Unit flow control function

def process_user_input(user_feature_vector, candidate_contexts, context_templates,

transition_model, separator_model, semantic_model, threshold=0.55):

"""

User feature vector np. Array, multidimensional feature vector of current speech

Candidate contexts: list str, preliminarily identified context labels

Context_templates Dict [ str, np. Array ], standard vector for each context label

Transition_model transition expression BiLSTM model

Separator_model multiple context decoupling module

Semantic _model callable sentence vector model (e.g., BERT)

"""

First step, # context decoupling

x_tensor = torch.tensor(user_feature_vector, dtype=torch.float32).unsqueeze(0) # [1, seq_len, dim]

separated = separator_model(x_tensor).squeeze(0) # [seq_len, dim]

Second step # computing context confidence for each sub-vector

confidences = []

context_subvectors = []

for i, ctx in enumerate(candidate_contexts):

subvec = separated[i].detach().numpy()

context_subvectors.append(subvec)

energy_ratio = np.linalg.norm(subvec) / np.linalg.norm(user_feature_vector)

semantic_similarity = cosine_similarity(semantic_model.encode(user_feature_vector).reshape(1, -1),

semantic_model.encode(context_templates[ctx]).reshape(1, -1))[0][0]

score = calculate_confidence_score(subvec, context_templates[ctx], energy_ratio, semantic_similarity)

confidences.append(score)

Third step # masking & normalizing fusion weights

fusion_weights = apply_mask_and_compute_weights(confidences, threshold)

Fourth step # generating dynamic target pronunciation vector

context_vecs = [context_templates[c] for c in candidate_contexts]

dynamic_target_vector = generate_dynamic_target_vector(context_vecs, fusion_weights)

return dynamic_target_vector, confidences, fusion_weights

The feature generating unit 103 is configured to invoke, based on the context label, a target pronunciation feature vector corresponding to the context label from a preset standard pronunciation database.

In the AI-driven personalized speech training and pronunciation correction system of the present invention, the feature generation unit 103 is operative to retrieve and invoke the target pronunciation feature vector matching the context from the preset standard pronunciation database according to the context label output by the recognition construction unit 102, thereby providing an accurate and context-adaptive reference standard for subsequent variance analysis.

Specifically, the feature generation unit first receives a context tag, which may represent a voice communication context in which the user is currently located, such as a child question-answer, a news broadcast, an emotion recitation, a formal speech, etc. Each context label corresponds to a set of predefined pronunciation reference samples in the system, which has been classified, sorted and encoded into standardized pronunciation feature vectors based on a large amount of high quality speech data annotated by professional phonetic teachers in the design stage. The target pronunciation feature vector and the user pronunciation feature vector are completely consistent in structure, and also comprise a plurality of dimensions such as pitch, speech speed, intonation outline, formant distribution, rhythm pattern, corresponding semantic-emotion labels and the like, so that good comparability and algorithm compatibility are ensured during comparison.

The feature generation unit typically embeds an efficient feature retrieval mechanism that quickly locates a reference pronunciation set in the corresponding context in the database based on the entered context labels. Preferably, the retrieval mechanism can be implemented based on hash index, vector similarity retrieval, or deep context matching networks, ensuring that the system can still invoke target features in the millisecond level, even if the database is large in size. In a specific implementation, the feature generation unit may set several invocation policies, such as invoking only the optimal reference pronunciation vector in the context, or invoking multiple reference samples and generating the final target pronunciation feature vector through an aggregation algorithm (such as an average vector, weighted fusion or confidence optimization), so as to improve the robustness and flexibility of the comparison.

In order to further improve the matching degree, the feature generation unit can also refer to personalized parameters such as age, gender, pronunciation habit and the like of the user, and call standard samples which are closer to the physiological features and expression styles of the user under the same context. For example, for a child user, the system calls the standard sample of the same person preferentially, and for a user with a pronounced accent, the standard sample with a slight dialect bias may be called as the target reference. The design ensures that the training effect not only meets the situation adaptation, but also has individual matching property, and is beneficial to improving the user acceptance and the learning efficiency.

Once the target pronunciation feature vector is generated, it is sent to the difference analysis unit 104 as a reference standard inside the system, and is compared with the current pronunciation feature vector of the user in a multi-dimensional manner. In order to ensure maintainability and expansibility of the system, the feature generation unit needs to consider an updating mechanism of a standard pronunciation database when designing, and supports dynamic supplement of context samples according to newly recorded samples of teachers or automatic optimization of weights or labels of existing target features based on user training feedback. The design can ensure that the system is continuously adapted to different users and continuously evolving language expression scenes in long-term use.

The difference analysis unit 104 is configured to perform multidimensional comparison on the user pronunciation feature vector and the target pronunciation feature vector, and obtain a difference parameter set including a pronunciation part deviation, a speech speed difference and an emotion expression difference.

In the AI-driven personalized speech training and pronunciation correction system of the present embodiment, the difference analysis unit 104 performs multidimensional comparison on speech features between the user pronunciation feature vector and the target pronunciation feature vector after receiving the two, so as to obtain a set of parameters capable of quantifying the difference between the current pronunciation and the standard pronunciation of the user, i.e., a difference parameter set. The unit plays roles of core analysis and precision feedback in the whole system, and is a key bridge for connecting 'recognition construction' and 'correction feedback'.

The variance analysis unit first receives the user pronunciation feature vector output by the recognition construction unit. The vector contains parameters of the user's speech in the current context such as pitch curve, speech rate time series, intonation profile, formant distribution, cadence structure, and mood-related speech characteristic metrics such as short-term energy fluctuations and pitch shift trends. At the same time, the difference analysis unit also receives the target pronunciation feature vector provided by the feature generation unit, wherein the target pronunciation feature vector is a feature representation of a standard pronunciation sample matched with the current context label of the user in the same dimension. To ensure consistency of alignment, the data dimensions, sampling lengths, and parameter sequences of the two vectors must be aligned exactly, and typically normalization is performed in the preprocessing stage by normalization, resampling, time alignment (e.g., based on dynamic time warping, DTW), and the like.

In the comparison stage, the difference analysis unit adopts a dimension-by-dimension comparison method to calculate the distance or deviation between the user pronunciation and the target pronunciation in each characteristic dimension. For continuous numerical class features (such as pitch, speech speed, intonation curve), indexes such as mean square error, correlation coefficient, bias integral, dynamic matching distance and the like can be used for quantitative analysis, and for emotion dimension matching, evaluation can be performed based on KL divergence, cross entropy or confidence difference between emotion classification probability distributions. Discrete event characteristics such as rhythm, accent position and the like can be judged through a similarity matrix or event matching rate. Finally, the difference values of all the dimensions are uniformly coded into a difference parameter set, wherein the parameter set not only comprises the quantized difference value of each type of characteristic, but also comprises the comprehensive score, the question priority label and the recommended intervention strength which are automatically generated by the system according to the experience rules or the learning model.

In order to facilitate the subsequent module call, the difference analysis unit can also carry out structural output on the difference parameter set and arrange the difference parameter set into a structure comprising three main subsets, wherein the first part is pronunciation part deviation information for reflecting the use abnormality of pronunciation organs such as tongue tip, tongue root, lip teeth and the like, the second part is speech speed related parameters comprising integral speech speed, local delay, rhythm distribution non-uniform expression and the like, and the third part is emotion expression deviation, and relates to the aspects of tone curve non-matching, emotion non-uniform, expression tendency deviation and the like. In addition, the unit can also combine the historical training result with the current comparison result to judge the deviation change trend, such as whether certain errors repeatedly occur and whether the correction progress is stable, so as to provide a reference basis of time dimension for personalized recommendation.

In practical implementation, the difference analysis unit can be deployed as a combined model framework, the front end is a rule engine or an index extraction module constructed based on expert experience, and the rear end is a deep learning model or a graph neural network, so that the recognition capability of complex pronunciation deviation or emotion inconsistency is improved. In order to improve the execution efficiency, the system can adopt a parallel processing architecture, respectively execute comparison operation on different characteristic dimensions, and unify merging difference parameter output in the final stage.

In summary, the difference analysis unit 104 outputs the difference parameter set which can be quantized, parsed and applied to personalized feedback through the refined comparison of the multi-dimensional voice features, and provides accurate, comprehensive and clear basis for the subsequent pronunciation correction suggestion.

Still further, the difference analysis unit comprises a user-individuated behavior deviation prediction network for identifying long-term expression deviation features of the user relative to the target pronunciation sample in each pronunciation dimension based on non-uniform pronunciation behavior tracks repeatedly appearing under the same semantic content and context expression conditions in the user history voice sample, and performing type classification and confidence labeling on the deviation features;

the behavior deviation prediction network comprises a sequence modeling structure for constructing a pronunciation stability feature matrix, wherein the feature matrix is used for statistically modeling the deviation direction, the deviation amplitude and the variation trend of each pronunciation dimension in a target semantic-context pair to output a deviation trend vector representing the individual behavior deviation trend, and the deviation trend vector comprises a mark dimension for distinguishing structural deviation and emotional deviation;

The difference analysis unit fuses the offset trend vector with the user pronunciation feature vector in the current training, and introduces an individual expression boundary weight factor, wherein the individual expression boundary weight factor is a feature tolerance parameter generated based on the labeled or learned identification of 'individual expression feature without correction' in the user history voice data and is used for inhibiting the class of expression offset from being incorrectly identified as pronunciation error in the comparison, so as to generate a corrected user pronunciation feature vector;

the corrected user pronunciation characteristic vector is used for carrying out multi-dimensional comparison with the target pronunciation characteristic vector so as to improve the tolerance and comparison accuracy of the system to the individuation expression difference.

In the AI-driven personalized speech training and pronunciation correction system, in order to improve the adaptability of the system to the long-term pronunciation habits of different users, a user personalized behavior deviation prediction network is introduced into the difference analysis unit. The core purpose of the network is to identify and model the frequently repeated inconsistent pronunciation behaviors of users under the same semantic content and context expression condition, and reasonably process the individual differences in the subsequent pronunciation comparison process, so as to prevent the system from misjudging the individual differences as mispronunciations.

In a specific implementation, the personalized behavior shift prediction network first receives historical speech training data from a speech acquisition unit and a recognition building unit. The system preprocesses the history samples and sorts the history samples based on semantic tags (such as "how you are good", "how you ask how you go") and context tags (such as "customer service context", "daily dialogue context") corresponding to the user's voice. Under the condition of the same semantic-context pair, the system extracts pronunciation characteristic vectors in a plurality of training rounds of the user, and calculates the deviation degree of the pronunciation characteristic vectors in each pronunciation dimension (such as pitch, speech speed, intonation, formant position, emotion intensity and the like). For example, if a user always shows about 15% slower than the standard speech rate and 20% lower emotion intensity in multiple rounds of "customer service context", the shift pattern will be recorded and a trace will be formed.

The sequence modeling structure in the network can adopt a mainstream sequence learning model such as bidirectional LSTM (BiLSTM) or a transducer to model the historical offset track. The system calculates the offset direction (positive or negative), the offset magnitude (e.g., percentage above the standard value), and the trend of change (e.g., linear increase, convergence, fluctuation) in multiple rounds of training, respectively, in each pronunciation dimension, and generates a pronunciation stability feature matrix. Each row of the matrix corresponds to a pronunciation dimension in a semantic-context pair, each column representing some statistical feature, such as mean shift, variance, slope, etc. Based on the matrix, the system further outputs an offset trend vector for describing the long-term pronunciation offset characteristic of the user in each dimension, and meanwhile, the offset characteristic is divided into two categories of structural offset and emotional offset by the system, wherein the structural offset is usually stable, such as the speech speed is always slow, and the emotional offset is susceptible to context fluctuation, such as insufficient emotion expression in a recitation context and normal emotion expression in a question-answer context.

When the difference analysis is executed, the system fuses the user pronunciation characteristic vector of the current training round with the offset trend vector, and the fusion mode can adopt a weight weighting or residual error compensation model. At this time, in order to prevent the system from misidentifying the personalized long-term expression style of the user as a pronunciation error, the system introduces an individual expression boundary weight factor, which is a characteristic tolerance parameter generated according to the fact that the user is automatically identified as 'expression offset without correction' through manual labeling or model learning in the historical pronunciation data. Specifically, the system records the expression habits (such as slight intonation and sliding, slight prolonged word tail and the like) manually marked as acceptable by experts in the historical data as sample characteristics in a training stage, or automatically identifies the behavior that the departure sound offset exists but does not affect the definition and acceptability of the semantic expression through a clustering algorithm. Under these conditions, the system calculates the offset interval of each type of individual expression feature in each pronunciation dimension and forms a feature tolerance boundary, for example, the "pitch is lower by not more than 10Hz and the speech speed is lower by not more than 20ms" can be regarded as the feature in the individual expression tolerance range.

In the phase of difference analysis, the weight factor is used to adjust the alignment sensitivity of each pronunciation dimension. For the offset dimension in the expression boundary range, the system can reduce the difference weight of the offset dimension, and avoid mismarking the pronunciation offset as the correction error, so as to generate the corrected user pronunciation characteristic vector. The correction process can be technically realized by adopting mechanisms such as differential masking (masking), weight attenuation or error tolerance threshold value and the like.

In the invention, the individual expression boundary weight factors are used for adjusting the comparison sensitivity of the system to each pronunciation dimension in the difference analysis stage, the calculation mode is based on the long-term pronunciation offset characteristic shown in the user history voice training data, and the calculation is obtained by combining the judgment result of the system on pronunciation accuracy, and the whole calculation process is completely and automatically executed by the system without user intervention.

Specifically, the system first collects and stores the original pronunciation sample of the user under each semantic-context label during the multiple rounds of speech training of the user, and extracts the characteristic values of the sample in each pronunciation dimension, including but not limited to pitch, speech speed, intonation, formant frequency, emotion expression parameters and the like. For each dimension, the system records the actual value of the dimension of the user in each round of training, and calculates the difference value with the target value of the same dimension in the preset standard pronunciation sample to obtain the offset value. The system sorts and sums the offset values according to the semantic-context labels and forms a sequence of offset values for each dimension.

Next, the system performs a statistical calculation on the offset value sequence. For each sounding dimension, the system calculates its offset meanAnd standard deviationAnd records the frequency of occurrence of the offset for that dimension. If the offset direction of a certain dimension is consistent in multiple training cycles, the offset values are all in a stable range (i.e. the standard deviation is smaller than the preset threshold value of the systemFor example, 5%) and the frequency of the shift in that dimension exceeds a set proportion (for example, 70%) in all training samples, while the shift is not recognized by the system as erroneous or causing artificial correction marks, the system determines that the shift belongs to the individual expression difference of the user.

On this basis, the system assigns the pronunciation dimension to a boundary weight factor according to the following rules. If the absolute value of the mean value of the offset is smaller than a certain tolerance range (for example, lower than 5% of the mean value of the dimension), the weight factor of the dimension is set to 1, which means that tolerance processing is not performed, if the mean value of the offset is larger but the standard deviation is still in a low fluctuation range, which means that the offset is in a stable expression style, the weight factor is set to 0.5, and if the mean value of the offset is large and the fluctuation is high, which means that the offset is most likely to be the result of the formation of a specific region or individual language habit, but semantic accuracy is not affected, the weight factor can be set to be lower, for example, 0.2. All weight factors are ultimately retained in the system as individual expression tolerance templates for each user, for each semantic-context combination, and are invoked for use as alignment weights in subsequent variance analysis.

Finally, the corrected user pronunciation feature vector will be used for multi-dimensional comparison with the standard target pronunciation feature vector. By introducing the individual behavior deviation trend and expressing the boundary factor, the system can accurately distinguish the deviation correction and the individual tolerance difference, and obviously reduce the misjudgment rate while improving the comparison precision, so that the system has higher man-machine adaptability, personalized feedback capability and voice evaluation reliability under multiple environments.

An error correction feedback unit 105 for generating a pronunciation correction suggestion based on the difference parameter set, the pronunciation correction suggestion including pronunciation action guide, intonation adjustment prompt and semantic emotion enhancement instruction, and for outputting the pronunciation correction suggestion to a user and receiving exercise feedback information of the user based on the pronunciation correction suggestion.

In the AI-driven personalized speech training and pronunciation correction system of the present invention, the error correction feedback unit 105 generates a targeted pronunciation correction suggestion according to the difference parameter set output by the difference analysis unit 104, and outputs the suggestion to the user in an understandable and interactive manner, and receives exercise feedback information generated by the user in the process of executing the suggestion, so as to construct a closed-loop speech training process. The unit directly influences the effectiveness of voice training, the intuitiveness of user experience and the dynamic adaptability of a personalized training mechanism, and is an important link of connection analysis and learning feedback in a system.

The error correction feedback unit firstly receives a difference parameter set, wherein the parameter set comprises difference information between the current pronunciation of the user and the target pronunciation in multiple dimensions, such as deviation of a pronunciation part, quick or slow speed of speech, unnatural pitch of a intonation, or inconsistent emotion expression and context expectation. Based on the difference information, the system calls a rule base and a training strategy model to read and decide the deviation of each dimension, so as to generate corresponding pronunciation correction suggestions. Preferably, the correction advice is represented by a structured instruction set, comprising three types of firstly pronunciation action guidance, such as prompting that the tip of the tongue needs to be lifted, noticing that the tongue is air-fed, lip is not round enough, needs to be opened more widely, and the like, wherein the guidance is usually presented in a mode of combining anatomically with the voice and is suitable for users with different native language backgrounds, secondly intonation adjustment prompting, such as prompting that the tail part of the word needs to be lifted, meaning that the doubtful language is air, or the whole intonation needs to be smoothly reduced to represent a sentence, and finally semantic emotion enhancement instructions, such as prompting that the emotion expression of the section is enhanced, adding moderate anger language is needed to be added in the sentence, so that the sentence is more consistent with the dialogue context of children.

The error correction feedback unit not only outputs text prompts, but also can enhance understanding and executing effects in a multi-mode, for example, the error correction feedback unit is demonstrated by combining dynamic images, oral structure schematic animations or standard pronunciation videos, so that a user can more intuitively understand the difference between errors and standards. In addition, to enhance real-time interactivity, the unit supports sentence-by-sentence or word-by-word feedback modes and allows the user to train repeatedly according to his own rhythm. After the user finishes a certain exercise, the system can acquire new voice data in real time and re-extract multidimensional voice characteristics to form updated user pronunciation characteristic vectors. The vector is again transmitted to the difference analysis unit for comparison, and the system judges the correction effect of the user on the specific error dimension according to the vector.

In the process of receiving exercise feedback information, the error correction feedback unit also records the exercise times, the error repetition rate, the local correction success rate, the user response time delay and other behavior indexes, and the feedback information is used as the basis for the subsequent dynamic adjustment of training strategies and model parameters. In practice, the unit can be deployed as an integrated feedback service module, which internally contains a feedback generation engine built based on deep reinforcement learning or expert system rules, and supports remote model updating and local feedback caching.

In order to ensure the personalized effect, the error correction feedback unit can also dynamically adjust the recommendation strategy by combining the historical training data of the user and the deviation evolution track. For example, for a type of recurring pronunciation errors, the system will raise the feedback priority of its error correction suggestions while giving more explicit or basic instructions, while for the grasped pronunciation points, prompts may be simplified, interventions may be reduced, to improve training efficiency and user satisfaction.

In summary, the error correction feedback unit 105 converts the difference parameter set into a hierarchical, multidimensional, personalized pronunciation correction suggestion, and constructs an interactive feedback channel, so as to ensure that the user can fully understand the error cause and obtain an effective improved path, and finally realize the whole process of closed loop control from detection, diagnosis, intervention and verification of voice training.

Still further, the error correction feedback unit includes a feedback trajectory modeling module configured to record and generate, in a plurality of training rounds, an error correction history sequence for a user over a plurality of pronunciation dimensions including pitch, speech speed, intonation, formant parameters, and emotion expression characteristics;

The feedback track modeling module is further used for calculating error correction performance parameters of each type of pronunciation errors under the corresponding context label based on the error correction history sequence, wherein the error correction performance parameters comprise error stability parameters, error volatility parameters and self-recovery capacity parameters, and the error correction rate is calculated respectively through occurrence frequency variation of the errors in a local time window, error variation amplitude between adjacent training rounds and error correction rate without prompting;

the feedback track modeling module is further used for generating a feedback state code according to the error correction performance parameters, wherein the feedback state code is used for representing feedback intervention priority of each type of pronunciation errors under a specific context;

The error correction feedback unit determines an output strategy of pronunciation correction suggestion based on the feedback state code, wherein the output strategy comprises intervention frequency, prompt content form and feedback content structure;

When the appointed analog pronunciation error has high stability and low self-recovery capability in a plurality of training rounds and similar expressions occur under two or more context labels, the error correction feedback unit generates multi-modal combined training content comprising a pronunciation part diagram, a standard pronunciation animation and a target context intonation example, marks the pronunciation error as an important tracking error, continuously records the expression and updates the corresponding feedback state code in the subsequent training process.

In the AI-driven personalized speech training and pronunciation correction system, the error correction feedback unit is not only responsible for outputting correction suggestions aiming at the pronunciation deviation of the user, but also further comprises a feedback track modeling module for continuously tracking, modeling and regulating the pronunciation performance of the user in a plurality of training rounds, thereby realizing refined personalized error correction management.

In a specific implementation, the feedback track modeling module records the pronunciation analysis result in each training process of the user, and constructs an error correction history sequence in a time sequence form. The sequence is based on pronunciation dimensions, and respectively tracks multiple pronunciation parameters such as pitch, speech speed, intonation, formant parameters (formants, such as F1, F2, F3, and the like) and emotion expression characteristics (such as energy distribution, tone fluctuation and emotion category). After each training, the system identifies the deviation types of the user in the dimensions and records the performance of the error in the round, including the information of error intensity, prompting times, whether to correct, the located context label and the like, so as to form a multidimensional error correction log with dimension indexes, time indexes and context indexes.

Based on the constructed error correction history sequence, the system further calculates error correction performance parameters of each type of pronunciation errors in different contexts so as to evaluate the evolution trend and the intervention difficulty of the errors. The error stability parameter is a standard deviation or a variation coefficient of occurrence frequency of the error in a window calculated by setting a time sliding window (such as the last 5 rounds of training), and whether the error is a permanent fixed deviation is judged. The false volatility parameter reflects whether the user exhibits repetitive erratic volatility by calculating a rate of change of error between adjacent training rounds, such as a percentage of difference in intonation deviation between the nth round and the n+1th round. The self-recovery capacity parameter is used for quantifying whether the user can automatically complete error correction on the premise of not receiving the prompt, and the calculating method can be the proportion of automatically disappearing errors in the prompting-free round.

After normalization processing, the three parameters are used as input to generate feedback state codes. Feedback status encoding is a structured vector or identifier that indicates the feedback priority level of a current class of pronunciation errors in a particular context. The system may set several levels, such as low, medium, high, or numeric priority weights, to dynamically adjust the feedback behavior of the system when generating pronunciation correction suggestions.

The error correction feedback unit selects an output strategy corresponding to the feedback state code based on the feedback state code. The output strategies include frequency of intervention (e.g., per round of feedback or interval feedback), prompt content form (e.g., text instructions, speech replay, organ-guided animation), and feedback content structure (e.g., whether or not target context-specific pronunciation feature interpretation is involved). The system can quickly correspond the feedback state codes to specific strategy schemes through table lookup mapping or strategy rule diagrams, and personalized training path adjustment is realized.

In further enhanced control, when a certain class of pronunciation errors exhibits high stability over multiple training rounds (i.e., the errors persist and do not significantly improve) and have low self-recovery capability (i.e., the user is hard to correct without prompting), and similar types of error manifestations are detected under both or more context labels, such as deviations in speech speed and emotional weak expressions in both "statement context" and "emotional reading context", the system determines the error type as a focus to track the error. Aiming at the key errors, the system does not only provide brief text prompts, but outputs multi-mode combined training materials containing three types of contents, namely, a pronunciation part structure diagram clearly shows states of response such as tongue position and mouth shape, a standard pronunciation animation, a pronunciation process of a standard sample can be repeatedly played and slowly decomposed, and a target intonation semantic example corresponding to a context where the pronunciation is located, such as 'suggestion of elongating the intonation of the sentence to embody a relaxed emotion' or 'slight upward raising of the word tail of the sentence to embody a questioning context'.

In addition, the error will be added by the system to the highlight tracking list, its performance is continuously recorded in the subsequent training rounds, and its feedback state code is updated after each training to dynamically adjust the training strategy until the performance parameters of the error type reach the set convergence criteria (e.g., the error is completely eliminated or the score is significantly reduced in consecutive three rounds of training).

The feedback track modeling mechanism solves the technical bottlenecks of 'feedback stiffness', 'no continuous tracking' and 'difficult personalized regulation' in the traditional voice training system by introducing the technical paths of quantized error correction behavior history, feedback strategy dynamic adaptation, cross-context expression fusion and the like.

Still further, the error correction feedback unit includes a multi-mode visual feedback module, configured to construct a multi-source fused correction guidance content based on a pronunciation part deviation parameter, a pitch contour difference curve and an emotion score difference value generated in a current training round of a user when generating a pronunciation correction suggestion, and output the following three types of visual feedback information at the same time:

The method comprises the steps of (1) a first type of feedback information, constructing a dynamic graphical path of the sound organ corresponding to the current error type of a user based on a three-dimensional sound organ modeling library and a sound anatomical mapping template which are built in a system, wherein the path comprises standard motion tracks of a tongue tip, a tongue root, a soft palate, a pharyngeal wall and a lip, and generating a comparison and correction animation path sequence by adapting to offset angles and offset amplitude parameters of the error part of the user;

The second type of feedback information is overlapped on a unified time axis according to the pitch change contour of the current user voice and the pitch sequence of the target pronunciation to form a tone contrast graph, and a segment interval with lower or higher user pitch and a target adjustment direction and amplitude interval recommended by the system are marked in the graph;

And the third type of feedback information is used for calling a preset emotion correction prompt template of the system according to the current context label and emotion expression scoring difference value, dynamically adjusting the strength and the presentation mode of prompt contents according to the actual performance of a user, and outputting the emotion prompt to the user display terminal in a visual form through voice intonation combined animation formed by semantic label marking, intonation dynamic change curve and intonation demonstration audio.

The multi-mode visual feedback module further comprises a real-time driving parameter synchronization mechanism, wherein the real-time driving parameter synchronization mechanism is used for dynamically updating the graphic contents according to a real-time voice input stream in each round of voice training of a user, so that the pronunciation action animation and the tone annotation curve change in real time along with the user re-reading action, and a correction guiding effect with stronger pertinence and interactivity is achieved.

In the AI-driven personalized voice training and pronunciation correction system, the error correction feedback unit is provided with a multi-mode visual feedback module which is used for outputting correction guide information with multiple dimensions in a graphical and animated mode in the process of generating pronunciation correction suggestions, so that a user can intuitively perceive the specific position, type and degree of own pronunciation deviation, and the training efficiency and error correction accuracy are improved. The module receives multi-source data input from a difference analysis unit and an identification construction unit, and mainly comprises three parameters, namely a pronunciation part deviation parameter generated by a user in the current training round, wherein the parameter is obtained by comparing a pronunciation characteristic vector with a target standard pronunciation vector, a specific pronunciation part and an offset angle, a displacement direction and an offset amplitude thereof are marked in an organ structure coding mode, a pitch contour difference curve is obtained by carrying out time normalization processing on a frame-level pitch of a user audio signal and carrying out dynamic time normalization on the pitch contour of target voice, a difference value sequence is obtained, so that the relative offset condition of the user tone and the target tone in the whole pronunciation section is reflected, and a emotion score difference value is obtained by analyzing the emotion expression level of the current voice under the limitation of a context through an emotion identification model and comparing the emotion expression reference value of the current voice with a preset target voice under the same context, and the deviation degree of the user voice on the emotion intensity or the type is quantitatively reflected.

And after receiving the data, the multi-mode visual feedback module respectively generates three types of graphical feedback information and synchronously outputs the three types of graphical feedback information to the user display terminal. The first type of feedback information is a dynamic graphical path of the sound organ. The system calls a pre-constructed oral anatomy structure in a three-dimensional pronunciation organ modeling library, wherein the pre-constructed oral anatomy structure comprises a standard motion path of key pronunciation organs such as a tongue tip, a tongue root, a soft palate, a pharyngeal wall, a lip and the like, positions corresponding organ parts according to the error type identified in the current training round, and maps user offset parameters into a model to generate an animation path which is compared with standard actions. The path can display the error of the current pronunciation of the user on the spatial displacement and prompt the user of the pronunciation action to be adjusted through the dynamic change of color, speed or arrow.

The second type of feedback information is a tone map. The system aligns and displays the pitch change curve of the user voice and the target pitch curve in a superimposed manner according to a time axis, clearly marks a section with lower or higher pitch in the graph, and additionally recommends an adjustment direction and an amplitude section. Color blocks, gradations or label indications may be used in the illustration to indicate the degree of deviation and how the correction should be, for example to indicate to the user that the pitch should be increased by about 20Hz in a certain speech segment, or to keep the pitch smooth at the end instead of falling.

The third type of feedback information is emotional expression cues. The system calls the emotion prompt template and dynamically generates the content intensity and the output mode by identifying the context label corresponding to the current voice of the user and comparing the difference between the emotion expression score and the target emotion reference value of the user and combining the current deviation type. The output form comprises highlighting marks of semantic key word positions, dynamic lines of intonation variation trend, audio playing demonstration of target intonation and the like, and the system combines the elements into a visual animation sequence and forms linkage presentation with a user voice segment, so that the user can further correct emotion expression deviation by watching and imitating.

In order to enhance the response speed and interactivity of the feedback, the multi-mode visual feedback module further comprises a real-time driving parameter synchronization mechanism. The mechanism can monitor the real-time audio input stream continuously when the user performs voice training, and immediately recalculate relevant difference parameters based on new real-time data after detecting the repeated pronunciation or re-reading behavior of the user, and synchronously refresh the three types of graphic contents. The system can dynamically update the pronunciation action track, tone curve or emotion prompt in the local area or the whole content, so that each round of user pronunciation can correspond to one time of real-time feedback update, thereby realizing continuous, natural and personalized feedback experience and helping users to more efficiently identify and correct own pronunciation problems.

Further, the error correction feedback unit comprises a user feedback analysis and training path scheduling module, which is used for extracting and accumulating the pronunciation error types, the corresponding context labels, the prompt response time and the mark information of whether the correction is successful or not in each training after receiving a plurality of rounds of voice feedback samples based on pronunciation correction suggestions by a user, constructing a structured error performance record set, and generating an error type statistical matrix for training scheduling based on the error performance record set;

The error correction feedback unit further comprises a training response analysis unit, wherein the training response analysis unit is used for calculating a training priority score of each error type according to the repetition frequency, the post-prompt correction time interval, the continuous occurrence turn and the error amplitude change trend of each error type in the error type statistics matrix under a specific context, and generating corresponding prompt adjustment weights, and the prompt adjustment weights comprise a recommended output frequency, a prompt content length, visual guidance intensity and an interactive tracking period length;

The user feedback analysis and training path scheduling module is configured to construct a dynamic training response tag mapping table based on user historical training state labels, divide training response efficiency of a user in different time periods or different contexts into a plurality of training state tags, select an adaptive training path module combination according to the current tag states, dynamically generate an individualized training prompt plan sequence, push the individualized training prompt plan sequence into a training task scheduling engine in a command form in real time, and update the calling sequence, content composition and feedback rhythm of subsequent training tasks.

In the AI-driven personalized voice training and pronunciation correction system, in order to realize a dynamically-adaptive personalized voice correction path, a user feedback analysis and training path scheduling module is arranged in an error correction feedback unit, and the module is mainly used for carrying out structural analysis and path planning on voice feedback samples generated by a user in a multi-round training process. Specifically, after the user completes one round of training based on pronunciation correction suggestion, the system automatically records feedback data of the round and extracts core elements including pronunciation error type, current context label, response time after system prompt, whether the round of pronunciation is successfully corrected and the like. The recognition of the error type is based on the pronunciation dimension deviation result output by the difference analysis unit, the context label is determined by combining semantic analysis with a scene classification module, the prompt response time is calculated by the system prompt triggering time and the user starting pronunciation time difference, and the correction result is compared with the difference degree of the front sample and the rear sample by the system and is scored by combining a confidence coefficient model to carry out binarization judgment.

The system stores the extracted multiple items of information into a user individual training file, builds an error expression record set according to round accumulation, and classifies and counts various pronunciation errors in the record set to form an error type statistics matrix. The matrix represents the type of error in the row dimension, and counts the occurrence frequency of the error in each context, the continuous round distribution, the post-prompt correction delay time and the error change curve in the column dimension. The training response analysis unit carries out traversal and normalization processing on the matrix, calculates the performance stability, the improvement speed and the intervention difficulty of each error type under different training stages respectively, maps the dimensions into a comprehensive training priority score, and the higher the numerical value is, the greater the influence of the error type on the learning progress is or the more intensive training is required.

According to the training priority score, the system further distributes corresponding prompt adjustment weights for each error type, so as to adjust the prompt output strategy of the error type in the subsequent training process. The alert adjustment weights cover a number of specific dimensions including the frequency of the alert output (e.g., repeated every n rounds), the length of the alert content (brief alert or detailed explanation), the clarity of the visual guide, and the response period of the interactive tracking, among others. The weights will be used to control the combination of training content and output cadence during the task scheduling phase.

In addition, the user feedback analysis and training path scheduling module also builds a dynamic training response tag mapping table according to the response state marked in the user history training process. The system classifies the user into a plurality of training state labels, such as 'quick adaptation', 'repeated error-prone', 'rhythm sensitive', and the like, by evaluating the response efficiency of the user in different time periods or different contexts, such as faster user response in customer service contexts, slow correction in emotion reading, and the like. The labels and the training content modules form a mapping relation, in the actual training task generation stage, the most suitable training path module group is matched according to the labels of the current user, for example, the user corresponding to the rhythm sensitive type preferentially calls the rhythm guiding type prompt, and the user with the repeated error-prone type invokes the multi-round feedback linkage module.

Finally, the system generates an individualized training prompt plan sequence according to the label, the priority score and the prompt weight combination, and sends the sequence as an instruction to a training task scheduling engine in real time. The scheduling engine adjusts the calling sequence, the content composition structure and the feedback interaction rhythm of the subsequent training tasks according to the received prompt plan, and achieves accurate adaptation of the personalized learning rhythm and the error correction performance of the user.

According to the scheme, the closed-loop optimization from data acquisition to training rhythm control is realized through links such as structured extraction of training feedback, multidimensional calculation of priority, construction of state label mapping, dynamic plan pushing and the like.

Still further, the training response analysis unit includes a prompt adjustment parameter generation module for dynamically calculating a prompt output weight of each error type in the current training task based on the following formula 1:

;

Wherein, the The historical cumulative prompt frequency of the error type under the current context label is represented to reflect the number of times that the system has prompted the error;

representing the average residual error ratio of the errors in the last three rounds of training, specifically, calculating the relative deviation between each round of user pronunciation sample and the target pronunciation in the error dimension, and carrying out average normalization processing to obtain the target pronunciation;

The error improvement gradient of the error between two successive rounds is defined as the difference value of the residual errors of the current round and the previous round and used for measuring the correction trend, the positive value and the negative value of the error improvement gradient represent the trend of the error aggravation or weakening respectively, and the value is used as the sensitive parameter input of the action response direction;

the system obtains the number of the context labels related to the error in the history training record by carrying out unique counting on the context labeling information in the training record;

Indicating the number of training rounds in which the error has accumulated,

The value of the context complexity adjusting coefficient is set according to the importance degree of the training scene on the multi-language adaptive capacity, and the value is usually selected from the range of 0.2 to 1.0.

The prompt outputs weightThe method is used for constructing a prompt output control strategy aiming at the error type, and comprises output frequency, graphic display duration, speech speed prompt density and interactive animation rhythm. The system automatically adjusts the prompt priority and the display mode of the error in the subsequent training task according to the weight value so as to realize training load balancing, prompt accurate adaptation and feedback path optimization, thereby improving the user error correction efficiency and the intelligent intervention capability of the system.

The system outputs the weight when completing promptAfter calculation, the model is used as a core control factor to be input into a prompt strategy parameter mapping model, and the model is a parameter mapping function set established based on experimental experience and machine learning rules. Each type of prompt strategy parametersBased on the method, the final output effect is regulated and controlled through the parameterized mapping relation. The specific steps are as follows:

1. Output frequency:

The system sets a basic prompt frequency When (when)When the number increases, the presentation frequency increases. Output frequencyThe calculation of (2) may be as follows:

;

wherein, the For the frequency adjustment factor, the frequency is empirically set in the interval 0.1-0.3, for example, the base frequency may be 2 times per minute, and may be adjusted based on empirical data.

2. Graphical presentation time:

The system can dynamically adjust the playing time of each prompt animation or graphic, so as to avoid understanding difficulty caused by too short prompt or fatigue caused by too long prompt. Display time Can be adjusted as follows:

;

wherein, the For a minimum presentation time (e.g., 2 seconds); for a time extension factor (e.g. 1.5 seconds per unit weight), And the maximum weight (such as 3.0) corresponding to the time upper limit is displayed, so that the overlong prompt caused by the abnormal value is avoided.

3. Animation rhythm control:

If it is The system has the advantages of high error difficulty, complex state, slow down the animation rhythm, and convenient observation of action details by users. Rhythm scaling factorIs determined by the following rules:

;

Wherein the method comprises the steps of Is a rhythm regulation factor, and is usually set as. This value is multiplied by the original animation frame rate to get the final presentation cadence. For example, the original speed is 1.0 x, whenThe time period scales to about 0.71 x.

4. Speech rate feedback density:

the system can increase or decrease the feedback density of the speech speed, namely the number of times of 'speech speed segmentation marking', 'slow demonstration' and the like added in the prompt. The density is that of The method can be obtained through nonlinear normalization mapping:

;

Wherein the method comprises the steps of For maximum insertion density (e.g. 5 times/cue),The method ensures that the system gradually tends to be saturated under the condition of high weight, and avoids the overtension.

The mapping functions can be trained through experiments, user test feedback is continuously optimized, and the coefficients or output forms of the mapping functions can be dynamically adjusted by combining training state labels of users in the system, so that a personalized prompt output path which truly changes according to needs is realized.

The system finally packages the parameters into prompt strategy parameter groups, registers the prompt strategy parameter groups in a training task scheduling module and directs the organization and presentation modes of corresponding prompt contents in subsequent training rounds, so that a logic closed loop path of weight driving, strategy generation and feedback output is formed.

By introducing the calculation formula, the system can dynamically generate high-resolution intervention control factors according to the historical training states and the instant behavior responses of the user in different contexts, so that the efficiency and the accuracy of prompting resource allocation are remarkably improved, and the pronunciation correction success rate and the training path convergence speed of the user are improved on the basis of not increasing training burden.

Further, the error correction feedback unit comprises a training state recognition and feedback adaptation module for extracting correction behavior parameters from the current voice sample after the user completes the voice feedback based on the previous round of pronunciation correction suggestion, wherein the correction behavior parameters comprise correction amplitude, response delay and residual pronunciation error amount, classifying the training state of the user based on a preset parameter threshold value to generate training state labels comprising an instant response type, a delayed response type and a structure solidification offset type,

The training state label is used for selecting a pronunciation correction suggestion output strategy corresponding to the training state label, the output strategy comprises dynamic combination of text prompt and graphic feedback, rhythm and speech speed auxiliary guidance and a multi-mode linkage error correction scheme, the training state recognition and feedback adaptation module is further provided with a label migration trigger mechanism, the training state recognition and feedback adaptation module is used for detecting the change trend of user correction behavior parameters in continuous training rounds, when the fact that the label migration condition is met is recognized, the current training state label is automatically switched and the subsequent feedback strategy is updated until the label is recognized to enter a stable state interval, and then a basic correction output mode is restored, so that training feedback closed-loop control based on behavior adaptability is achieved.

In the AI-driven personalized speech training and pronunciation correction system of the invention, the error correction feedback unit is provided with a training state recognition and feedback adaptation module which is used for extracting a plurality of indexes related to pronunciation correction behaviors from feedback speech samples submitted by a user after finishing speech training rounds and judging the training state of the current user according to the indexes, thereby dynamically matching with a proper pronunciation correction suggestion output strategy so as to improve training efficiency and adaptability.

In the specific implementation process, the module firstly receives a voice sample recorded by a user based on a previous round of prompt content of the system, and invokes a difference analysis unit and a behavior recognition component, three key parameters are extracted from the sample, wherein the first is correction amplitude, namely the difference degree of a current pronunciation sample of the user and a previous round of mispronounced pronunciation sample on multidimensional pronunciation characteristics, can be represented by the Euclidean distance change, the overlapping degree of a pitch curve or the alignment offset value of a speech speed time axis between a standard pronunciation characteristic vector and the current user characteristic vector in an equivalent way, the second is response time delay, namely the time interval from the output of a pronunciation correction prompt to the start of pronunciation of the user of the system, the parameter is obtained by recording the time scale difference between the prompt triggering time and the voice starting point of the user, and the third is residual pronunciation error, namely the number or the difference degree of pronunciation dimensions which are not completely corrected in the current round of pronunciation of the user, and the residual error type is recognized and the number and the deviation amplitude are calculated after the current sample and the target pronunciation are compared by the system.

The system takes the three parameters as input, compares the three parameters with a preset numerical threshold, and classifies the current behavior of the user into a certain training state label according to a combination judgment rule. If the correction amplitude is large, the response time is short, the residual error is low, the structure is judged to be of an immediate response type, if the correction amplitude is medium, the response delay is long, the residual error is medium, the structure is classified to be of a delayed response type, and if the correction amplitude is low, the residual error is continuously present and is stable and unchanged, the structure is classified to be of a structure curing offset type. The label dividing standard can be obtained by statistics in a large amount of historical sample training data, or a state classification mapping table is formed by manual rule configuration.

After the tag is generated, the system selects an output policy matching the tag according to the current state of the user. For the instant response type user, the system defaults to output simpler text prompt and static graphic information, emphasizes quick confirmation and self-correction, for the delay response type user, the system increases speech speed control, rhythm mark and speech segment segmentation prompt content to help the delay response type user to improve feedback processing speed and perceived rhythm stability, and for the structure solidification offset type user, the system calls a multi-mode linkage prompt scheme which comprises a sound organ action animation, emotion and mood simulation audio, context adaptation emotion adjustment prompt and the like, and forms an omnibearing intervention strategy to break through the inherent expression mode.

In order to adapt to the dynamic characteristics of the behavior state of the user changing along with the training rounds, the module is further provided with a label migration triggering mechanism which continuously monitors the changing trend of the correction behavior parameters in a plurality of continuous training rounds. When the system recognizes that the correction amplitude under the current label is obviously increased or the response time delay is continuously shortened and the residual error is gradually reduced, the system automatically judges that the user training state is in forward transition, triggers state label migration according to the forward transition, updates the current training state label and synchronously adjusts the correction suggestion output mode. The migration process is realized by setting a threshold window, a moving average model or a classifier based on trend, and the like, so that the rationality and the stability of label change are ensured.

When the user training state is identified as a stable interval, the system does not use the enhanced feedback strategy any more, but returns to the basic correction prompt mode so as to reduce the cognitive load and consolidate the existing correction results of the user. Through the closed loop path formed by the training state identification, the label classification, the feedback matching, the state migration and the output strategy updating, the system can generate a flexible prompting scheme aiming at different types of users, realize the training content scheduling driven by the behavior response, and effectively improve the adaptation capability and the intervention efficiency of personalized voice training.

While the application has been described in terms of preferred embodiments, it is not intended to be limiting, but rather, it will be apparent to those skilled in the art that various changes and modifications can be made herein without departing from the spirit and scope of the application as defined by the appended claims.

Claims

1. An AI-driven personalized speech training and pronunciation correction system, comprising:

A speech collection unit is used to collect the original speech data of the user during the speech training process, and extract a multi-dimensional speech feature vector including pitch, speech rate, intonation, formant parameters and speech corresponding text from the original speech data;

An identification and construction unit, configured to identify a context label of a user's current speech based on semantic information and emotional features in the multidimensional speech feature vector, and to construct a corresponding user pronunciation feature vector based on the multidimensional speech feature vector;

A feature generation unit, configured to call a target pronunciation feature vector corresponding to the context label from a preset standard pronunciation database based on the context label;

A difference analysis unit, used for performing a multi-dimensional comparison between the user pronunciation feature vector and the target pronunciation feature vector to obtain a difference parameter set including pronunciation position deviation, speech speed difference and emotional expression difference;

An error correction feedback unit, used to generate a pronunciation correction suggestion based on the difference parameter set, the pronunciation correction suggestion including a pronunciation action guide, a tone adjustment prompt and a semantic emotion reinforcement instruction, and used to output the pronunciation correction suggestion to the user and receive practice feedback information from the user based on the pronunciation correction suggestion;

The recognition construction unit includes a context transformation simulation module, which is specifically used to analyze the continuity of pronunciation features between different context labels in the user's historical training data, extract the transition expression mode between adjacent context categories, and establish a multi-context mixed expression model based on the transition expression mode;

When the recognition construction unit detects that the current multidimensional speech feature vector contains feature indicators corresponding to two or more context labels at the same time, and the feature indicators have a high correlation in the historical data, the context transformation simulation module calls the cross-context interference decoder to decouple and separate different context components in the multidimensional speech feature vector, generate speech feature sub-vectors corresponding to each context, and calculate the context conflict confidence score of each sub-vector based on the separation result;

When any context conflict confidence score exceeds a preset threshold, the context transformation simulation module triggers a feature masking mechanism to partially mask context feature sub-vectors whose confidence is lower than the threshold, and calculates a fusion weight for the retained part;

The fusion weight is used to perform weighted combination of target pronunciation feature vectors corresponding to multiple context labels in a preset standard pronunciation database to generate a dynamic target pronunciation feature vector annotated with an interpretable weight vector. The dynamic target pronunciation feature vector has a transitional expression feature and is used to compare with the current user pronunciation feature vector to cope with the nonlinear expression deviation caused by the rapid switching of context during the user's natural speech training process.

2. The AI-driven personalized speech training and pronunciation correction system according to claim 1, characterized in that the error correction feedback unit includes a feedback trajectory modeling module, which is used to record and generate a user's error correction history sequence in multiple pronunciation dimensions in multiple training rounds, and the pronunciation dimensions include pitch, speech rate, intonation, formant parameters and emotional expression characteristics;

The feedback trajectory modeling module is also used to calculate the error correction performance parameters of each type of pronunciation error under the corresponding context label based on the error correction history sequence, and the error correction performance parameters include error stability parameters, error volatility parameters and self-recovery ability parameters, which are calculated by the error occurrence frequency variation in the local time window, the error variation amplitude between adjacent training rounds, and the error correction rate without prompts;

The feedback trajectory modeling module is also used to generate a feedback state code according to the error correction performance parameter, wherein the feedback state code is used to indicate the feedback intervention priority of each type of pronunciation error in a specific context;

The error correction feedback unit determines an output strategy for pronunciation correction suggestions based on the feedback state code, wherein the output strategy includes intervention frequency, prompt content form, and feedback content structure;

Among them, when the specified analog pronunciation error has high stability and low self-recovery ability in multiple training rounds, and similar performance appears under two or more context labels, the error correction feedback unit generates a multimodal combined training content including pronunciation part diagrams, standard pronunciation animations and target context intonation examples, and marks the pronunciation error as a key tracking error, continuously records its performance during subsequent training and updates the corresponding feedback state code.

3. The AI-driven personalized speech training and pronunciation correction system according to claim 1 is characterized in that the difference analysis unit includes a user individualized behavior deviation prediction network, which is used to identify the long-term expression deviation characteristics of the user relative to the target pronunciation sample in each pronunciation dimension based on the inconsistent pronunciation behavior trajectory that repeatedly appears under the same semantic content and context expression conditions in the user's historical speech samples, and classify the deviation characteristics into types and mark the confidence level;

The behavior deviation prediction network includes a sequence modeling structure for constructing a pronunciation stability feature matrix, wherein the feature matrix performs statistical modeling on the deviation direction, deviation amplitude and change trend of each pronunciation dimension in the target semantic-context pair, and outputs a deviation trend vector representing the individual behavior deviation trend, wherein the deviation trend vector includes a label dimension for distinguishing between structural deviation and emotional deviation;

The difference analysis unit fuses the deviation trend vector with the user pronunciation feature vector currently in training, and introduces an individual expression boundary weight factor, wherein the individual expression boundary weight factor is a feature tolerance parameter generated based on the user's historical speech data that is annotated or learned and identified as "individual expression features that do not need to be corrected", and is used to suppress the expression deviation from being mistakenly identified as a pronunciation error in the comparison, thereby generating a corrected user pronunciation feature vector;

The modified user pronunciation feature vector is used for multi-dimensional comparison with the target pronunciation feature vector to improve the system's tolerance for individual expression differences and comparison accuracy.

4. The AI-driven personalized speech training and pronunciation correction system according to claim 1 is characterized in that the error correction feedback unit includes a multimodal visual feedback module, which is used to construct multi-source fusion correction guidance content based on the pronunciation position deviation parameters, pitch contour difference curves and emotional score differences generated in the user's current training round when generating pronunciation correction suggestions, and simultaneously output the following three types of visual feedback information:

The first type of feedback information, based on the system's built-in three-dimensional pronunciation organ modeling library and pronunciation anatomical mapping template, constructs a dynamic graphic path of the pronunciation organ corresponding to the user's current error type. The path includes the standard movement trajectory of the tongue tip, tongue root, soft palate, pharyngeal wall and lips, and is adapted to the offset angle and offset amplitude parameters of the user's error position to generate a contrast-corrected animation path sequence;

The second type of feedback information is based on the pitch change contour of the current user's voice and the pitch sequence of the target pronunciation, which are superimposed on a unified time axis to form a pitch comparison chart, and the segment intervals where the user's pitch is too low or too high, as well as the target adjustment direction and amplitude range recommended by the system are marked in the chart;

The third type of feedback information, based on the difference between the current context label and the emotion expression score, calls the system's preset emotion correction prompt template, and dynamically adjusts the prompt content intensity and presentation method according to the user's actual performance. The emotion correction prompt is animated by semantic labels, dynamic tone change curves and tone demonstration audio to form a voice and tone joint animation, which is output to the user's display terminal in a visual form;

Among them, the multimodal visual feedback module also includes a real-time driving parameter synchronization mechanism, which is used to dynamically update the above-mentioned graphic content according to the real-time voice input stream in each round of user voice training, so that the pronunciation action animation and the tone marking curve change in real time with the user's re-reading behavior, thereby achieving a more targeted and interactive correction and guidance effect.

5. The AI-driven personalized speech training and pronunciation correction system according to claim 1 is characterized in that the error correction feedback unit includes a user feedback analysis and training path scheduling module, which is used to extract and accumulate the pronunciation error types, corresponding context labels, prompt response time and correction success or failure mark information that appear in each round of training after receiving multiple rounds of speech feedback samples based on pronunciation correction suggestions from users, construct a structured error performance record set, and generate an error type statistical matrix for training scheduling based on the error performance record set;

The error correction feedback unit also includes a training response analysis unit, which is used to calculate the training priority score of each error type according to the repetition frequency, correction time interval after prompting, continuous rounds and error amplitude change trend of each error type in the error type statistical matrix in a specific context, and generate a corresponding prompt adjustment weight, wherein the prompt adjustment weight includes the recommended output frequency, prompt content length, visual guidance intensity and interactive tracking cycle length;

The user feedback analysis and training path scheduling module is configured to construct a dynamic training response label mapping table based on the user's historical training status annotations, divide the user's training response efficiency in different time periods or different contexts into multiple training status labels, and select an adaptive training path module combination according to the current label status, thereby dynamically generating an individualized training prompt plan sequence, and pushing it to the training task scheduling engine in real time in the form of instructions, and updating the calling order, content composition and feedback rhythm of subsequent training tasks.

6. The AI-driven personalized speech training and pronunciation correction system according to claim 1 is characterized in that the error correction feedback unit includes a training state recognition and feedback adaptation module, which is used to extract correction behavior parameters from the current speech sample after the user completes the speech feedback based on the previous round of pronunciation correction suggestions, the correction behavior parameters include correction amplitude, response delay and residual pronunciation error, and classify the user's training state based on a preset parameter threshold to generate training state labels including immediate response type, delayed response type and structural solidification offset type.

Among them, the training state label is used to select the corresponding pronunciation correction suggestion output strategy, and the output strategy includes a dynamic combination of text prompts and graphic feedback, rhythm and speech speed auxiliary guidance, and multi-modal linkage error correction schemes. The training state recognition and feedback adaptation module is also configured with a label migration trigger mechanism, which is used to detect the changing trend of user correction behavior parameters in continuous training rounds. When it is recognized that the label migration conditions are met, the current training state label is automatically switched and the subsequent feedback strategy is updated until the label is recognized as entering a stable state interval and then the basic correction output mode is restored, thereby realizing training feedback closed-loop control based on behavioral adaptability.