CN116945191A

CN116945191A - Robot control method based on artificial intelligence

Info

Publication number: CN116945191A
Application number: CN202311164021.2A
Authority: CN
Inventors: 袁祚斌; 龚磊; 吴朝阳
Original assignee: Chongqing Beiruixing Technology Co ltd
Current assignee: Beijing Longyifeng Technology Co ltd
Priority date: 2023-09-11
Filing date: 2023-09-11
Publication date: 2023-10-27
Anticipated expiration: 2043-09-11
Also published as: CN116945191B

Abstract

The present disclosure relates to a robot control method based on artificial intelligence, wherein a parallel light generator is arranged in a robot, and the method comprises: responding to a voice instruction sent by a target object, determining a first instruction range on a target plane through a parallel light beam generated by a parallel light generator, and prompting the target object to make a gesture instruction within the first instruction range, wherein the first instruction range is an incidence range of the parallel light beam on the target plane; responding to gesture instructions made by a target object, determining a second instruction range on a target plane, and then determining whether text content to be recognized exists on the target plane in the second instruction range by utilizing a visual recognition model; under the condition that the text content to be recognized exists in the second instruction range, a text image containing the text content to be recognized is obtained, and characters in the text image are recognized by utilizing a text recognition model to obtain a target text, so that the target text is read aloud.

Description

Robot control method based on artificial intelligence

Technical Field

The present disclosure relates generally to the field of artificial intelligence technology, and more particularly, to an artificial intelligence-based robot control method.

Background

The related art point-and-read robot can only read out content such as poetry, english, etc. stored in advance on a memory, and is essentially a player. Once the content in the memory is determined, the content outside the memory cannot be played, and even if the content is added to the memory, the user can only manually select one of the content for playing. The content stored in the related art click-to-read robot is very limited, and even if a huge amount of content is stored in the click-to-read robot, the user can hardly find the proper content from the huge amount of content to play because of manual selection. In addition, the interaction mode of the click-to-read robot and a user in the related art is single, and the interaction mode is far from real artificial intelligence.

Disclosure of Invention

The robot control method based on the artificial intelligence is free from depending on the content such as poetry and English stored in the robot control method, and can directly acquire the content from the outside to read according to the interactive instruction of a user.

In one general aspect, there is provided an artificial intelligence-based robot control method, the robot having a parallel light generator built therein, wherein the robot control method includes: responding to a voice command sent by a target object, determining a first command range on a target plane through a parallel light beam generated by the parallel light generator, and prompting the target object to make a gesture command within the first command range, wherein the first command range is an incidence range of the parallel light beam on the target plane, and an angle between a cross section of the parallel light beam and the target plane is smaller than a preset threshold value; determining a second instruction range on the target plane in response to the gesture instruction made by the target object, and then determining whether text content to be recognized exists on the target plane in the second instruction range by utilizing a visual recognition model, wherein the second instruction range does not exceed the first instruction range; and under the condition that the text content to be recognized exists in the second instruction range, acquiring a text image containing the text content to be recognized, and performing recognition operation on characters in the text image by using a text recognition model to obtain a target text, so as to perform reading operation on the target text.

Optionally, the action track of the gesture instruction includes at least one of a closed track, a linear track and a pointing track, wherein the determining the second instruction range on the target plane includes: when the action track is a closed track, taking a range wrapped by the closed track in the first instruction range as the second instruction range; when the motion track is a linear track, taking a range which is positioned above the linear track in the first instruction range as the second instruction range; and when the action track is a pointing track, taking a preset range above the pointed track in the first instruction range as the second instruction range.

Optionally, the text recognition model includes: the image correction unit is used for correcting the text image to obtain a corrected image; the feature extraction unit is used for extracting features of the corrected image to obtain text features; the text output unit is used for outputting a character sequence according to the text characteristics to obtain the target text; the extraction unit comprises a plurality of convolution network layers and a plurality of Transfomer network layers, wherein the plurality of convolution network layers are used for extracting the spatial characteristics of the corrected image, and the plurality of Transfomer network layers are used for extracting the sequence characteristics of the corrected image.

Alternatively, the feature extraction unit is obtained by: constructing a first search space corresponding to the convolution network layer and a second search space corresponding to the Transfomer network layer; searching, by neural network structure searching, for internal parameters of each convolutional network layer in the first search space, and for internal structures of each fransfomer network layer in the second search space; and obtaining the feature extraction unit based on the searched internal parameters of each convolution network layer and the internal structure of each Transfomer network layer.

Optionally, the internal parameters include a convolution type parameter and a sampling path parameter, wherein the convolution type parameter includes a kernel size and a coefficient of expansion, the sampling path parameter includes a first sampling step in a vertical direction and a second sampling step in a horizontal direction, wherein the kernel size includes 3 or 5, the coefficient of expansion includes 1 or 6, and the first sampling step and the second sampling step include 1 or 2.

Optionally, the internal structure includes a multi-headed self-focusing unit and a feedforward neural network, wherein the feedforward neural network includes a multi-layer perceptual unit or a gated linear unit, and the attention score matrix of the multi-headed self-focusing unit is determined by the following equation:

wherein M is the attention score matrix, n is the number of layers of the current Transfomer network layer, W is the query matrix, Y is the key matrix, R is the relative distance embedding matrix, P is the scale factor, and x takes on the value of 1 ory and z take values of 0 or 1.

Optionally, the neural network structure search includes: constructing a super network applied to text recognition, wherein the super network comprises a space network corresponding to the convolution network layer and a sequence network corresponding to the Transfomer network layer; orderly dividing the super network into H super network blocks, and training each super network block in turn to obtain a trained super network so as to search from the trained super network to obtain the extraction unit; wherein the spatial network is divided into first H-1 of the H super network blocks in an ordered manner, and the sequence network is divided into last of the H super network blocks.

Optionally, weights are shared between all candidate structures of the super network, wherein the training each super network block in turn includes: when an h super network block is trained, sampling a first path from the h super network block, and sampling a second path from h-1 trained super network blocks before the h super network block; and connecting the first path with the second path for training so as to update the weight of the first path, thereby obtaining a trained h super network block.

Optionally, the h super-network block is trained through multiple iterative training, wherein for each iterative training process, each super-network block is trained in turn, and further comprising: randomly sampling a first path from the h super network block and randomly sampling a second path from h-1 trained super network blocks before the h super network block in any iterative training process of the h super network block; and connecting the first path with the second path for the iterative training so as to update the weight of the first path based on a random gradient descent algorithm.

Optionally, the robot further has an audio player built in, where the speaking operation for the target text includes: converting each character of the plurality of characters of the target text into audio data, respectively; and playing the audio data of the corresponding characters in the plurality of characters sequentially from left to right and from top to bottom through the audio player.

According to the robot control method based on artificial intelligence, provided by the embodiment of the application, the text recognition operation can be performed in the appointed recognition area on the target plane in response to the voice instruction and the gesture instruction sent by the target object, so that a convenient and accurate interaction mode is provided for the target object and the robot; the first instruction range is determined on the target plane through the parallel light beams generated by the parallel light generator, so that the stability of the first instruction range can be ensured, and the gesture instruction of a target object in the first instruction range is facilitated because the distance between the parallel light beams and the target plane does not change too much; the text content to be recognized exists on the target plane within the second instruction range by utilizing the visual recognition model, then, the character in the text image containing the text content to be recognized is recognized by utilizing the text recognition model, the light-weight visual recognition model can be used for early recognition, and the text recognition model is called for text recognition with high precision under the condition that the early recognition is passed, so that the resource waste caused by calling the text recognition model each time is avoided, and the efficiency and the accuracy of the robot for reading the text are improved.

Additional aspects and/or advantages of the present general inventive concept will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the general inventive concept.

Drawings

The foregoing and other objects and features of embodiments of the present disclosure will become more apparent from the following description taken in conjunction with the accompanying drawings in which the embodiments are shown, in which:

fig. 1 is a flowchart illustrating an artificial intelligence based robot control method according to an embodiment of the present disclosure.

Detailed Description

The following detailed description is provided to assist the reader in obtaining a thorough understanding of the methods, apparatus, and/or systems described herein. However, various changes, modifications, and equivalents of the methods, apparatus, and/or systems described herein will be apparent after an understanding of the present disclosure. For example, the order of operations described herein is merely an example and is not limited to those set forth herein, but may be altered as will be apparent after an understanding of the disclosure of the application, except for operations that must occur in a specific order. Furthermore, descriptions of features known in the art may be omitted for clarity and conciseness.

The features described herein may be embodied in different forms and should not be construed as limited to the examples described herein. Rather, the examples described herein have been provided to illustrate only some of the many possible ways to implement the methods, devices, and/or systems described herein that will be apparent after an understanding of the present disclosure.

Unless defined otherwise, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs after understanding this disclosure. Unless explicitly so defined herein, terms (such as those defined in a general dictionary) should be construed to have meanings consistent with their meanings in the context of the relevant art and the present disclosure, and should not be interpreted idealized or overly formal.

In addition, in the description of the examples, when it is considered that detailed descriptions of well-known related structures or functions will cause a ambiguous explanation of the present disclosure, such detailed descriptions will be omitted.

An artificial intelligence-based robot control method according to an embodiment of the present disclosure will be described in detail with reference to fig. 1.

Fig. 1 is a flowchart illustrating an artificial intelligence based robot control method according to an embodiment of the present disclosure. Here, the robot according to the embodiment of the present disclosure may be built-in with a parallel light generator, and further, the parallel light generator may be, for example, a collimator, i.e., an optical instrument for generating parallel light beams, and may be capable of generating parallel light beams of a certain diameter range, but the present disclosure is not limited thereto.

Referring to fig. 1, in step S101, a first command range may be determined on a target plane by a parallel light beam generated by a parallel light generator in response to a voice command issued by a target object to prompt the target object to make a gesture command within the first command range. Here, the target object may be an operator of the robot, and the voice command may be a voice sentence transmitted by the operator through the sounding site, for example, but not limited to, a sentence of "robot start", "robot read here", or the like.

According to embodiments of the present disclosure, text content is recorded on the target plane, which may be, but is not limited to, a book plane, an electronic screen plane, a display screen, a wall surface, and the like. Here, the first instruction range is an incidence range of the parallel light beam on the target plane, that is, a range of an intersection area of the parallel light beam and the target plane, and the target object can easily recognize the first instruction range on the target plane by vision due to light irradiation, so that an area range in which a gesture instruction is further made is clear. Further, the angle between the cross section of the parallel light beam and the target plane is smaller than a preset threshold, and the accuracy of the subsequent gesture instruction recognition can be improved through angle limitation. Still further, the specific value of the preset threshold may be set by those skilled in the art according to the actual accuracy requirements, for example, the preset threshold may be, but not limited to, 30 °.

Next, in step S102, a second instruction range may be determined on the target plane in response to the gesture instruction made by the target object, and then whether text content to be recognized exists on the target plane within the second instruction range may be determined using the visual recognition model. Here, the second instruction range does not exceed the first instruction range, that is, the second instruction range determined by the gesture instruction needs to be within the area corresponding to the first instruction range, and if the range indicated by the gesture instruction exceeds the first instruction range, the partial area beyond the first instruction range cannot be used as the object of the subsequent recognition operation. Further, the visual recognition model can be simply constructed by a convolutional neural network, and high recognition accuracy is not required, so long as a certain area can be recognized that text exists, and resources are saved, however, a person skilled in the art can construct a visual recognition model with a proper volume according to actual conditions, and the method is not limited in this disclosure.

According to embodiments of the present disclosure, the gesture commanded motion trajectory may include at least one of a closed trajectory, a linear trajectory, and a pointing trajectory. Here, the closed-type trajectory may be a closed-shape trajectory having no opening, such as a circle or a polygon, but the present disclosure is not limited thereto, and the closed-type trajectory may be other shapes, such as an irregular shape. In addition, the linear track may be a non-closed shape track such as a straight line or a curve, and the pointing track may be a point track to which the gesture is directed.

According to an embodiment of the present disclosure, determining the second instruction range on the target plane may include: under the condition that the action track is a closed track, taking a range wrapped by the closed track in the first instruction range as a second instruction range; when the action track is a linear track, taking a range which is positioned above the linear track in the first instruction range as a second instruction range; when the motion trajectory is a pointing trajectory, a preset range above a point where the pointing trajectory is pointed in the first instruction range is set as a second instruction range. Here, the present disclosure does not limit the preset range above the location where the pointing track is pointed, and a person skilled in the art may determine a specific area of the preset range according to an actual accuracy requirement, for example, the preset range may be a square area with an area of 2 square centimeters above the location where the pointing track is pointed, or a circular area within the diameter 1, etc.

Next, in step S103, in the case that the text content to be recognized exists in the second instruction range, a text image including the text content to be recognized may be obtained, and the character in the text image may be recognized by using the text recognition model, so as to obtain a target text, so that a reading operation is performed on the target text. Here, the text image including the text content to be recognized may be acquired by, but not limited to, a camera or the like provided on the robot, which is not limited by the present disclosure. Further, the text recognition model may include an image correction unit, a feature extraction unit, and a text output unit.

According to the embodiment of the disclosure, the input of the text recognition model is a text image, and the text image shot by a device such as a camera usually has some distortion and other conditions, and the image correction unit can be used as a preprocessing unit for correcting the text image so as to convert the text image into a more easily-recognizable form and obtain a corrected image; the feature extraction unit is used for extracting features of the corrected image to obtain text features; the text output unit may be a recognition head (recognition head) for outputting a character sequence according to the text feature, thereby obtaining the target text.

The feature extraction unit of the text recognition model may extract features from the text image, and the typical feature extraction unit generally employs a combination of CNN (Convolutional Neural Networks, convolutional neural network) and RNN (Recurrent Neural Network ), where CNN is a spatial network part of the feature extraction unit, may extract spatial features from the text image, and RNN is a sequence network part of the feature extraction unit, and may enhance the spatial features to generate a feature sequence having long sequence dependence.

The feature extraction unit according to an embodiment of the present disclosure may include a plurality of convolutional network layers, which may serve as a spatial network of the feature extraction unit for extracting spatial features of the corrected image, and a plurality of fransfomer network layers, which may serve as a sequence network of the feature extraction unit for extracting sequence features of the corrected image. The sequential network portion uses multiple Transfomer network layers instead of the traditional RNNs in order to increase the parallelism of model reasoning.

According to an embodiment of the present disclosure, the feature extraction unit may be obtained by: constructing a first search space corresponding to the convolutional network layer and a second search space corresponding to the Transfomer network layer; searching (Neural Architecture Search, NAS) the internal parameters of each convolutional network layer in a first search space and the internal structure of each fransfomer network layer in a second search space by means of a neural network structure search; and obtaining a feature extraction unit based on the searched internal parameters of each convolution network layer and the internal structure of each Transfomer network layer. Here, the first search space is a search space for a spatial network of feature extraction units, and the second search space is a search space for a sequential network of feature extraction units, thereby enhancing the degree of depth customization for text recognition tasks.

According to embodiments of the present disclosure, the internal parameters of the convolutional network layer may include a convolutional parameter and a sampling path parameter. Here, the convolution type parameter may include a kernel size (kernel size) and an expansion factor (expansion factor), and the sampling path parameter may include a first sampling step in a vertical direction and a second sampling step in a horizontal direction. Accordingly, the first search space may include a plurality of candidate convolution parameters and a plurality of candidate sampling path parameters, so that when searching for the internal parameters of each convolution network layer in the first search space, the convolution parameters and the sampling path parameters of each convolution network layer may be searched for from the plurality of candidate convolution parameters and the plurality of candidate sampling path parameters by a traversal manner. Further, the kernel size may comprise 3 or 5, the coefficient of expansion may comprise 1 or 6, and the first and second sampling steps may comprise 1 or 2.

According to embodiments of the present disclosure, the internal structure of the Transfomer network layer may include Multi-head Self-attention (MHSA) units and feed forward neural networks (Feedforward Network, FFN). Accordingly, the second search space may include a plurality of candidate multi-headed self-attention operators and a plurality of candidate feedforward neural networks, such that when the second search space searches the internal structure of each fransfomer network layer, the multi-headed self-attention operator and the feedforward neural network of each fransfomer network layer may be searched from among the plurality of candidate multi-headed self-attention operators and the plurality of candidate feedforward neural networks in a traversal manner. Here, the feed-forward neural network may include a multi-layer sensing (Multilayer Perceptron, MLP) unit or a gated linear unit (Gated Linear Unit, GLU). Further, the attention score (attention score) matrix of the multi-headed self-attention unit can be determined by the following equation (1):

where M is the attention score matrix, n is the number of layers of the current Transfomer network layer, M _n Namely, the attention score matrix of the nth layer of Transfomer network layer, W is a query matrix, W ^T I.e. the transposed matrix of W, Y is the key matrix, R is the relative distance embedding (relative distance embedding) matrix, P is the scaling factor, and x takes on a value of 1 orThe values of y and z are 0 or 1, and the specific values of the parameters can be determined by traversing the search.

It should be appreciated that the first search space and the second search space described above correspond to a spatial network portion and a sequential network portion, respectively, of the feature extraction unit, but the first search space and the second search space are not split, and the first search space and the second search space are connected in series to collectively form the search space of the feature extraction unit.

The neural network structure search according to embodiments of the present disclosure may include: constructing a super network applied to text recognition; the method comprises the steps of orderly dividing the super network into H super network blocks, and obtaining a trained super network by training each super network block in turn so as to search for a feature extraction unit from the trained super network. Here, the super network can be constructed in a weight sharing mode, so that the super network can comprise a plurality of candidate structures, weights can be shared among all candidate structures, the super network can be trained once, the weights of the candidate structures serving as sub-networks can be inherited from the trained super network simply, and resource waste caused by independently training all the candidate structures is avoided. Further, the super network may include a spatial network corresponding to a convolutional network layer and a sequence network corresponding to a Transfomer network layer.

According to the embodiment of the disclosure, the space network can be orderly divided into the first H-1 super network blocks in the H super network blocks, and the sequence network can be divided into the last super network block in the H super network blocks, so that the problem that the blocking result of the super network is unreasonable due to the fact that the space model is far larger than the sequence model in practice is avoided. Here, training each super network block in turn may include: when the h super network block is trained, sampling a first path from the h super network block, and sampling a second path from h-1 trained super network blocks before the h super network block; and connecting the first path with the second path for training so as to update the weight of the first path, thereby obtaining a trained h super network block. The super network block is gradually trained in the progressive training mode, so that training difficulty can be greatly reduced.

According to an embodiment of the present disclosure, the h-th super-network block may be trained through multiple iterative training, where for each iterative training process, training each super-network block in turn may further include: randomly sampling a first path from the h super network block and randomly sampling a second path from h-1 trained super network blocks before the h super network block in any iterative training process of the h super network block; the first path is connected to the second path for the iterative training to update the weights of the first path based on a random gradient descent (Stochastic Gradient Descent, SGD) algorithm. Through random sampling paths in the multiple iterative processes of each super network block, the candidate structures in the super network can be guaranteed to be fully trained, and the problems of huge search space and low efficiency caused by retraining after uniform sampling are avoided.

According to the embodiment of the present disclosure, the robot further has an audio player built therein, and here, a person skilled in the art may configure the audio player according to actual situations as long as it is able to play audio data, which is not limited by the present disclosure. Further, performing a speakable operation on the target text may include: converting each of a plurality of characters of the target text into audio data, respectively; and playing the audio data of the corresponding characters in the plurality of characters in sequence from left to right and from top to bottom through the audio player.

An artificial intelligence-based robot control method according to an embodiment of the present disclosure may be written as a computer program and stored on a computer-readable storage medium. The artificial intelligence based robot control method as described above may be implemented when the computer program is executed by a processor. Examples of the computer readable storage medium include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card memory (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tape, floppy disks, magneto-optical data storage, hard disks, solid state disks, and any other means configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. In one example, the computer program and any associated data, data files, and data structures are distributed across networked computer systems such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed manner by one or more processors or computers.

Although a few embodiments of the present disclosure have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the disclosure, the scope of which is defined in the claims and their equivalents.

Claims

1. The robot control method based on artificial intelligence is characterized in that a parallel light generator is arranged in the robot, and the robot control method comprises the following steps:

responding to a voice command sent by a target object, determining a first command range on a target plane through a parallel light beam generated by the parallel light generator, and prompting the target object to make a gesture command within the first command range, wherein the first command range is an incidence range of the parallel light beam on the target plane, and an angle between a cross section of the parallel light beam and the target plane is smaller than a preset threshold value;

determining a second instruction range on the target plane in response to the gesture instruction made by the target object, and then determining whether text content to be recognized exists on the target plane in the second instruction range by utilizing a visual recognition model, wherein the second instruction range does not exceed the first instruction range;

and under the condition that the text content to be recognized exists in the second instruction range, acquiring a text image containing the text content to be recognized, and performing recognition operation on characters in the text image by using a text recognition model to obtain a target text, so as to perform reading operation on the target text.

2. The robot control method of claim 1, wherein the gesture commanded motion trajectory comprises at least one of a closed trajectory, a linear trajectory, and a pointing trajectory, wherein the determining the second command range on the target plane comprises:

when the action track is a closed track, taking a range wrapped by the closed track in the first instruction range as the second instruction range;

when the motion track is a linear track, taking a range which is positioned above the linear track in the first instruction range as the second instruction range;

and when the action track is a pointing track, taking a preset range above the pointed track in the first instruction range as the second instruction range.

3. The robot control method of claim 1, wherein the text recognition model comprises:

the image correction unit is used for correcting the text image to obtain a corrected image;

the feature extraction unit is used for extracting features of the corrected image to obtain text features;

the text output unit is used for outputting a character sequence according to the text characteristics to obtain the target text;

the feature extraction unit comprises a plurality of convolution network layers and a plurality of Transfomer network layers, wherein the plurality of convolution network layers are used for extracting the spatial features of the corrected image, and the plurality of Transfomer network layers are used for extracting the sequence features of the corrected image.

4. A robot control method according to claim 3, wherein the feature extraction unit is obtained by:

constructing a first search space corresponding to the convolution network layer and a second search space corresponding to the Transfomer network layer;

searching, by neural network structure searching, for internal parameters of each convolutional network layer in the first search space, and for internal structures of each fransfomer network layer in the second search space;

and obtaining the feature extraction unit based on the searched internal parameters of each convolution network layer and the internal structure of each Transfomer network layer.

5. The robot control method of claim 4, wherein the internal parameters comprise a convolution type parameter and a sampling path parameter, wherein the convolution type parameter comprises a kernel size and a coefficient of expansion, the sampling path parameter comprises a first sampling step in a vertical direction and a second sampling step in a horizontal direction, wherein the kernel size comprises 3 or 5, the coefficient of expansion comprises 1 or 6, and the first sampling step and the second sampling step comprise 1 or 2.

6. The robot control method of claim 4, wherein the internal structure comprises a multi-headed self-attention unit and a feedforward neural network, wherein the feedforward neural network comprises a multi-layer perceptual unit or a gated linear unit, and wherein an attention score matrix of the multi-headed self-attention unit is determined by the following equation:

7. The robot control method of claim 4, wherein the neural network structure search comprises:

constructing a super network applied to text recognition, wherein the super network comprises a space network corresponding to the convolution network layer and a sequence network corresponding to the Transfomer network layer;

orderly dividing the super network into H super network blocks, and training each super network block in turn to obtain a trained super network so as to search from the trained super network to obtain the feature extraction unit;

wherein the spatial network is divided into first H-1 of the H super network blocks in an ordered manner, and the sequence network is divided into last of the H super network blocks.

8. The robot control method of claim 7, wherein weights are shared among all candidate structures of the super network, wherein the training each super network block in turn comprises:

when an h super network block is trained, sampling a first path from the h super network block, and sampling a second path from h-1 trained super network blocks before the h super network block;

and connecting the first path with the second path for training so as to update the weight of the first path, thereby obtaining a trained h super network block.

9. The robot control method of claim 8, wherein the h-th super-network block is trained by a plurality of iterative training processes, wherein for each iterative training process, the training each super-network block in turn further comprises:

randomly sampling a first path from the h super network block and randomly sampling a second path from h-1 trained super network blocks before the h super network block in any iterative training process of the h super network block;

and connecting the first path with the second path for the iterative training so as to update the weight of the first path based on a random gradient descent algorithm.

10. The robot control method of claim 1, wherein the robot further has an audio player built-in, wherein the speaking operation for the target text comprises:

converting each character of the plurality of characters of the target text into audio data, respectively;

and playing the audio data of the corresponding characters in the plurality of characters sequentially from left to right and from top to bottom through the audio player.