CN114422743B

CN114422743B - Video stream display method, device, computer equipment and storage medium

Info

Publication number: CN114422743B
Application number: CN202111583153.XA
Authority: CN
Inventors: 余力丛; 于勇
Original assignee: Huizhou Shiwei New Technology Co Ltd
Current assignee: Huizhou Shiwei New Technology Co Ltd
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2025-05-06
Anticipated expiration: 2041-12-22
Also published as: CN114422743A

Abstract

The embodiments of the present application disclose a video stream display method, apparatus, computer equipment and storage medium; the embodiments of the present application obtain multiple video streams and sound source positions of the current scene, and an image acquisition area corresponding to each of the video streams; according to the sound source positions and the image acquisition area, a target video stream is determined from the multiple video streams; according to the target video stream, a target object is identified, and the target object is an object with lip movements; according to the identification result of the target object, a video stream to be displayed is determined from the multiple video streams; and the picture corresponding to the video stream to be displayed is displayed. In the embodiments of the present application, the target video stream used to identify the speaker is determined by the sound source position, so that the efficiency of identifying the speaker can be improved. At the same time, the video stream to be displayed is determined according to the identification result, so that the displayed picture can be focused on the speaker, presenting a better conference picture.

Description

Video stream display method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video stream display method, apparatus, computer device, and storage medium.

Background

With the development of video technology, more and more occasions can acquire field pictures in real time through a camera and play the field pictures. However, in a scene where multiple people participate, the pictures captured by the cameras cannot usually highlight the emphasis in the current scene.

Especially in the scene of a multi-person conference, a plurality of different speakers always exist in the conference process, and how to focus a displayed picture on the speaker to present a better conference picture is a problem to be solved currently.

Disclosure of Invention

The embodiment of the application provides a video stream display method, a video stream display device, computer equipment and a storage medium, which can enable a displayed picture to be focused on a speaker and present a better conference picture.

The embodiment of the application provides a video stream display method, which comprises the steps of obtaining multiple paths of video streams of a current scene and a sound source position, determining a target video stream from the multiple paths of video streams according to the sound source position and the image acquisition region, identifying a target object which is an object with lip actions according to the target video stream, determining a video stream to be displayed from the multiple paths of video streams according to the identification result of the target object, and displaying pictures corresponding to the video stream to be displayed.

The embodiment of the application also provides a video stream display device which comprises an acquisition unit, a first determination unit, a recognition unit and a second determination unit, wherein the acquisition unit is used for acquiring multiple paths of video streams of a current scene and a sound source position, the image acquisition area corresponds to each video stream, the first determination unit is used for determining a target video stream from the multiple paths of video streams according to the sound source position and the image acquisition area, the recognition unit is used for recognizing a target object according to the target video stream, the target object is an object with lip action, the second determination unit is used for determining a video stream to be displayed from the multiple paths of video streams according to a recognition result of the target object, and the display unit is used for displaying pictures corresponding to the video stream to be displayed.

The embodiment of the application also provides computer equipment, which comprises a memory, wherein a plurality of instructions are stored in the memory, and the processor loads the instructions from the memory so as to execute the steps in any video stream display method provided by the embodiment of the application.

The embodiment of the application also provides a computer readable storage medium, which stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor to execute the steps in any video stream display method provided by the embodiment of the application.

The embodiment of the application can acquire a plurality of paths of video streams of a current scene and a sound source position, wherein each path of video stream corresponds to one image acquisition area, determine a target video stream from the plurality of paths of video streams according to the sound source position and the image acquisition areas, identify a target object according to the target video stream, wherein the target object is an object with lip action, determine a video stream to be displayed from the plurality of paths of video streams according to the identification result of the target object, and display a picture corresponding to the video stream to be displayed. According to the application, the target video stream for identifying the speaker is determined by the sound source position, so that the efficiency of identifying the speaker can be improved, and meanwhile, the video stream to be displayed is determined according to the identification result, so that the displayed picture can be focused on the speaker and a better conference picture can be presented.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a video stream display system according to an embodiment of the present application;

fig. 2 is a flow chart of a video stream display method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a video stream display system according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a data processing module according to an embodiment of the present application;

fig. 5 is a flowchart of a video stream display method according to another embodiment of the present application;

Fig. 6 is a schematic structural diagram of a video stream display device according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application provides a video stream display method, a video stream display device, computer equipment and a storage medium.

The video stream display device can be integrated in an electronic device, and the electronic device can be a terminal, a server and other devices. The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a personal computer (Personal Computer, PC) or other devices, and the server can be a single server or a server cluster formed by a plurality of servers.

In some embodiments, the video stream display apparatus may also be integrated in a plurality of electronic devices, for example, the video stream display apparatus may be integrated in a plurality of servers, and the video stream display method of the present application is implemented by the plurality of servers.

In some embodiments, the server may also be implemented in the form of a terminal.

For example, referring to fig. 1, in some embodiments, a scene diagram of a video stream display system is provided, and the image rendering system may include a display data acquisition module 1000, a server 2000, and a terminal 3000.

The data acquisition module can acquire multiple paths of video streams of the current scene and the sound source position, and each video stream corresponds to one image acquisition area.

The server can determine a target video stream from multiple paths of video streams according to the sound source position and the image acquisition area, identify a target object according to the target video stream, wherein the target object is an object with lip action, and determine the video stream to be displayed from the multiple paths of video streams according to the identification result of the target object.

The terminal may display a picture corresponding to the video stream to be displayed.

The following will describe in detail. The numbers of the following examples are not intended to limit the preferred order of the examples.

In this embodiment, a video stream display method is provided, as shown in fig. 2, and the specific flow of the video stream display method may be as follows:

110. and acquiring multiple paths of video streams and sound source positions of the current scene, wherein each video stream corresponds to one image acquisition area.

Where the sound source position refers to the position where the sound is emitted in the current scene, for example, the position where the speaking sound may be emitted in the conference scene. Sound can be collected by setting a microphone array in the current scene and calculating the sound source position according to a sound source localization algorithm.

In some embodiments, the multiple video streams include one panoramic video stream and at least one near video stream. The image acquisition area refers to an area range of an image which can be acquired by the image acquisition device corresponding to the video stream and corresponds to the current scene. The panoramic video stream refers to a video stream containing a panoramic picture of a current scene, the corresponding image acquisition area is the panoramic picture of the current scene, the panoramic picture can be acquired by a camera with a wide-angle lens, the near-view video stream refers to a video stream containing a local scene of the current scene, the corresponding image acquisition area is the local scene of the current scene, and the panoramic picture can be acquired by a camera with a long-focus lens.

In some embodiments, the method for obtaining the sound source position may include steps 1.1 to 1.2, as follows:

1.1, collecting sound information of a current scene;

And 1.2, processing the collected sound information through a sound source positioning algorithm to obtain the sound source position.

Among them, the sound source localization algorithm may employ TDOA (TIME DIFFERENCE of Arrival time difference), GCC-phas (Generalized Cross Correlation PHAse Transformation, generalized cross-correlation-phase transformation method), and the like.

120. And determining a target video stream from the multiple video streams according to the sound source position and the image acquisition area.

The target video stream is a video stream determined according to the sound source position and the association relation of the image acquisition areas. The association relationship may be that the sound source position is located in the image acquisition area, or that the distance between the sound source position and the center of the image acquisition area is smaller than a preset distance.

In some embodiments, the step 120 may include determining a target image acquisition area from acquisition areas corresponding to a plurality of multi-path video streams according to the sound source position and the image acquisition area association relationship, and determining a video stream corresponding to the target image acquisition area as a target video stream.

In some embodiments, due to interference such as reflection and noise, there may be an error in the sound source position determined by the sound source positioning, so that the possible area where the sound source exists is determined by the sound source position, so as to determine the target video stream corresponding to the area, and increase the accuracy of the acquired image information, specifically, step 120 may include steps 2.1 to 2.4 as follows:

2.1, determining a sound source area according to the sound source position;

2.2, determining an overlapping area of a sound source area and an image acquisition area for each video stream;

2.3, determining an overlapping area meeting the preset first area size as a target area;

And 2.4, determining the video stream corresponding to the target area as a target video stream.

The sound source region refers to a region where the sound source position is located, and the region and the image acquisition region are located on the same plane. The sound source region may be determined according to a sound source position and a preset region parameter value, which may be set according to a current scene or experience, for example, a circular region with a circle center being the sound source position and a preset radius value being the radius is taken as the sound source region, and so on.

In some embodiments, the step 2.1 may include the steps of obtaining a reference point, determining an included angle satisfying a preset first angle by using the reference point as a vertex and a line connecting the reference point and the sound source position as an angular bisector, and determining an area corresponding to the included angle in the current scene as a sound source area. The reference point may be any one boundary point in the current scene. In some embodiments, the reference point may be a location point where sound information of the current scene is collected, for example, a point determined according to a microphone array for measuring a sound source location, and the point may be any location point on the microphone array or may be a midpoint. The reference point, the sound source position, the image acquisition area, the sound source area and the target area are all located on the same plane, and may be a horizontal plane, for example, the sound source position is a position point of the real sound source position calculated by the sound source positioning algorithm projected onto the horizontal plane.

The preset first area size is an area size condition set according to the current scene or experience. The specific value may be, for example, one third or more of the size of the image acquisition area corresponding to any one video stream, or may be one half or more of the size of the area determined according to the sound source area, for example, the size of the sound source area.

In some embodiments, step 2.3 may include the step of determining overlapping regions where the sound source regions are the same size as the target region.

130. And identifying a target object according to the target video stream, wherein the target object is an object with lip actions.

The target object is an object with lip motion identified from image information of the target video stream. In general, a person with lip motion is speaking and thus can be the speaker of the current scene. The lip motion may be a lip motion when a person speaks as determined in accordance with the prior art.

Because the image acquisition areas corresponding to different video streams are different, the target video stream for identifying the speaker is determined through the sound source position, so that the identification data volume can be reduced, and the efficiency of identifying the speaker is improved.

In some embodiments, the sound source position determined by sound source positioning may have errors due to interference of reflection, noise, etc., so that the area for identifying the speaker is determined by the sound source position determination, so as to determine the target video stream corresponding to the area, and increase the accuracy of the acquired image information, step 130 may include steps 3.1 to 3.3 as follows:

3.1, determining an identification area according to the sound source position;

3.2, acquiring target image information from the target video stream according to the identification area, wherein the target image information is the image information corresponding to the identification area;

And 3.3, identifying the target object according to the target image information.

The recognition area is an area which is determined according to the position of the sound source and used for recognizing the target object, and the area and the image acquisition area are positioned on the same plane. The sound source position is located in the identification area, the identification area can be determined according to the sound source position and a preset area parameter value, the preset area parameter value can be set according to the current scene or experience, for example, a circular area with the sound source position as a center and a preset radius value as a radius is used as the identification area, and the like. The identification area may also be a sound source area.

In some embodiments, the step 3.1 may include the steps of obtaining a reference point, determining an included angle satisfying a preset second angle by using the reference point as a vertex and a line connecting the reference point and the sound source position as an angular bisector, and determining an area corresponding to the included angle in the current scene as an identification area.

The target image information refers to image information of an identification area projected into an area of a picture acquired by a target video stream. Specifically, the coordinate position of the identification area can be obtained, the coordinate position is projected to a coordinate system where a picture acquired by the target video stream is located, so that a projected area is obtained, and image information in the area is taken as target image information.

And determining an identification area possibly containing the target object through the sound source position, acquiring image information corresponding to the identification area from the target video stream through the identification area, and identifying whether the target object exists in the image information.

In some embodiments, in order to improve the recognition efficiency, step 3.3 may include steps 3.3.1 to 3.3.2, as follows:

3.3.1, when the object with the lip action is identified from the target image information, taking the object with the lip action as a target object;

and 3.3.2, when the object with the lip action is not identified from the target image information, expanding the identification area to a preset second area size so as to identify the target object.

For example, a sector area having an included angle of 30 ° is set first, and when no target object is recognized in the area, the identification area is enlarged to a sector area having an included angle of 40 °, recognition is performed again, and when no target object is recognized in the area, the identification area is enlarged to a sector area having an included angle of 50 °, and so on until the target object is recognized or the identification area is enlarged to an upper limit value.

Since there may be an error in the sound source position determined by the sound source positioning, when the lip motion recognition is performed, the preset recognition area may not recognize the target object, and at this time, by gradually expanding the size of the recognition area, the recognition range can be expanded, so as to correct the recognition result, and at the same time, by gradually expanding the size of the recognition area, the area to be recognized each time can be smaller than the next recognition, so that the recognition result can be obtained as small as possible in the smallest area, and the recognition efficiency is improved.

In some embodiments, in order to further improve the recognition efficiency, the step 3.3.2 may include the steps of expanding the recognition area to a preset second area size to obtain an expanded area when no object having a lip motion is recognized from the target image information, regarding a non-overlapping area of the recognition area and the expanded area as a target recognition area, acquiring the target image information from the target video stream according to the target recognition area, the target image information being image information corresponding to the recognition area, and recognizing the target object according to the target image information.

140. And determining the video stream to be displayed from the multiple paths of video streams according to the identification result of the target object.

The video stream to be displayed refers to a video stream used for displaying the current scene. The target object may be displayed in focus by the video stream to be displayed.

In some embodiments, in order to provide a better current scene display effect, a display policy determined according to a target object recognition result is provided, and step 140 may include steps 4.1 to 4.4 as follows:

4.1, when the target object is identified, determining a region to be displayed according to the target object;

4.2, when the target object is not identified, determining a region to be displayed according to all objects in the target image information;

4.3, acquiring an image acquisition area corresponding to each video stream;

and 4.4, determining the video stream to be displayed according to the region to be displayed and the image acquisition region.

The to-be-displayed area refers to an area to be displayed through the to-be-displayed video stream. When the target object is identified, an area where the target object is located, for example, a sound source area or an identification area, may be used as an area to be displayed, and when the target object is not identified, an area where all objects in the target image information are located is used as an area to be displayed. The area to be displayed can be in the same plane with the image acquisition area, or can be in the same plane with the image corresponding to the target video stream, and when comparing the area to be displayed with different plane areas such as the image acquisition area, the area to be displayed can be projected to the plane of the image acquisition area and the like and then compared.

In some embodiments, when a plurality of target objects are identified, a region to be displayed is determined from the plurality of target objects. At this time, the area to be displayed is an area where the plurality of target objects are located.

And determining a region to be displayed through the identification result of the target object, and comparing the region to be displayed with the image acquisition region to determine a video stream to be displayed. For example, the video stream corresponding to the image acquisition region with the largest repeated region can be used as the video stream to be displayed by determining the region to be displayed and the repeated region of each image acquisition region.

In some embodiments, in order to focus a speaker and provide a better current scene display effect, step 4.4 may include determining a region size ratio of a region to be displayed to each image acquisition region, and taking a video stream corresponding to an image acquisition region having a highest ratio of a region size to be displayed/an image acquisition region size as the video stream to be displayed. In some embodiments, to avoid incomplete display of the speaker picture, the ratio of the size of the region to be displayed/the size of the image acquisition region is less than a preset value, which may be 1.

150. And displaying a picture corresponding to the video stream to be displayed.

In some embodiments, by clipping the display screen, focusing on the speaker, a better current scene display effect is provided, and step 150 may include steps 5.1 to 5.3, as follows:

5.1, obtaining a display picture of a video stream to be displayed;

5.2, cutting out a display picture of the video stream to be displayed according to the area to be displayed to obtain a cut display picture;

And 5.3, displaying the cut display picture.

The cut display picture is a picture corresponding to the area to be displayed in the display picture of the video stream to be displayed.

By clipping the display picture of the video stream to be displayed into a picture corresponding to the region to be displayed, the speaker can be further focused to provide a better current scene display effect.

The video stream display method provided by the embodiment of the application can be applied to various scenes in which multiple persons participate. For example, taking a multi-person conference as an example, acquiring multiple paths of video streams and sound source positions of a current scene, wherein each video stream corresponds to one image acquisition area, determining a target video stream from the multiple paths of video streams according to the sound source positions and the image acquisition areas, identifying a target object according to the target video stream, wherein the target object is an object with lip actions, determining a video stream to be displayed from the multiple paths of video streams according to the identification result of the target object, and displaying pictures corresponding to the video stream to be displayed. The proposal provided by the embodiment of the application can improve the efficiency of identifying the speaker by determining the target video stream for identifying the speaker through the sound source position, and simultaneously determine the video stream to be displayed according to the identification result, so that the displayed picture can be focused on the speaker and a better conference picture can be presented.

The method described in the above embodiments will be described in further detail below.

In this embodiment, a multi-person conference scenario will be taken as an example, and a method according to an embodiment of the present application will be described in detail.

As shown in fig. 3, a schematic structural diagram of a video stream display system is provided, and the system includes a data acquisition module, a data processing module, and a terminal.

The data acquisition module consists of a thermal infrared imager, an ultrasonic module, a double-camera module and an array microphone, and the camera module acquires information and sends the information to the data processing module. The method comprises the following steps:

The data acquisition module comprises two cameras, wherein the two cameras are a wide-angle lens and a long-focus lens respectively. The wide-angle lens has large field angle and wide visual range, but has blurred vision. The long-focus lens has small field angle and narrow visual range, but can see clear long-range view. When the angle of view appears overlapping the cameras cut to the tele lens, when the angle of view is outside the range of the tele lens, the cameras are switched to the wide-angle lens. The double-camera switching method comprises the steps that 1, a double-camera module comprises two types of cameras of a wide-angle lens and a long-focus lens, the cameras of the wide-angle lens are short in focal length and wide in visual field, the photographed pictures are more, the object occupation ratio of the pictures is smaller, the cameras of the long-focus lens are long in focal length and narrow in visual field, the photographed pictures are less, and the object occupation ratio of the pictures is larger. The camera of the wide-angle lens and the camera of the tele lens can respectively output two paths of video streams, one path of video stream is used for actual picture presentation and can be called as preview stream, and the other path of video stream is used for lip movement detection and face recognition for AI and can be called as AI image stream. 2. The picture presentation of the terminal can only be one preview stream from two cameras, but the AI image streams of the two cameras can be simultaneously provided for the AI thread of the image for lip movement detection and face recognition. 3. And the image AI thread decides to perform lip movement identification and face recognition on one of the two paths of AI image streams according to the angle information of sound source positioning, and then outputs the lip movement identification and face recognition to the UVC thread to decide which path of preview stream to cut to and cut, and finally, the face focusing effect is presented.

The thermal infrared imager is used for measuring the temperature of the target object.

The ultrasonic module is used for combining the thermal infrared imager to detect the distance of the target object, and as the thermal infrared imager is also a camera, the minimum imaging distance of the lens is required, for example, the distance between the measured object and the lens is larger than 25cm, so that the thermal image effect can be ensured to be clear. Therefore, the ultrasonic module can be used for detecting the distance of the target and prompting the distance requirement of the target.

And the matrix microphone module is used for positioning a sound source and determining the azimuth of the speaker.

The data processing module comprises a UVC thread, a UAC thread, an image AI thread and an audio AI thread, and acquires information acquired by the camera module and performs data processing. As shown in fig. 4, the workflow of the threads in the data processing module is as follows:

And the UVC thread is used for collecting video stream information of the two cameras, each camera outputs two paths of video streams, one path of video stream is used for outputting to a terminal to present a real-time picture, and the other path of video stream is used for analyzing lip actions and analyzing and recognizing face of an image AI thread.

The UAC thread is used for collecting the audio stream information of the array microphone and outputting the audio stream information in two types, wherein one type of audio information is to directly output the audio stream data in the PCM format of one microphone to the terminal for audio playing, and the other type of audio information is to combine the audio stream data in the PCM format collected by all microphones and then to give the audio AI thread for sound source localization.

And the image AI thread is used for analyzing and processing the image information of the two cameras output by the UVC thread and outputting a decision to the UVC thread, wherein the decision comprises feeding back the video stream of which camera is displayed and amplifying and cutting the image information of the video stream of the path so as to focus a speaker. Specifically, the image AI thread obtains two kinds of information, one is video stream information of two cameras provided for the UVC thread, and the other is sound source angle information provided for the audio AI thread. After the image AI thread obtains the sound source angle information, the current sound source angle of the speaker is determined, the video stream of which camera is obtained is determined according to the field angle range of the two cameras to analyze the lip action, the identification area corresponding to the lip action is determined, and the face information is identified. And finally, feeding back the UVC thread to switch the camera for display, and enlarging and cutting to focus the speaker.

And the audio AI thread is used for analyzing and processing the PCM format audio stream data output by the array MIC given by the UAC, performing sound source localization, and sending the output sound source angle information to the image AI thread for decision.

The data processing module further comprises a strategy management module, wherein the strategy management module is used for acquiring the data processed by the data processing module and making scene decisions so as to realize speaker tracking, speaker subtitle display and participant sign-in.

The terminal is used for displaying pictures, and the terminal can be a TV (television).

As shown in fig. 5, a specific flow of a video stream display method is as follows:

210. The array microphone collects ambient sound in real time.

220. And determining and outputting sound source angle information by the audio AI thread according to the collected environmental sound through a sound source positioning algorithm.

The method can further comprise the step of controlling the thermal infrared imager and the ultrasonic module by the strategy management module before the ambient sound is collected by the display microphone, and detecting the body temperature of the participants. The ultrasonic module starts a distance detection function, when the distance between the target participants reaches the imaging requirement of the thermal infrared imager, the thermal infrared imager starts to detect the body temperature of the target participants, and when the temperature exceeds the imaging requirement, the target participants cannot participate.

The sound source angle refers to an included angle between the sound source position and the array microphone, and may be set by taking a midpoint of a line segment formed by the array microphone as an apex, and an included angle formed by the sound source position, the midpoint of the line segment formed by the array microphone, and any apex of the line segment formed by the array microphone as the sound source angle.

The array microphone collects environmental sounds in real time and then sends the environmental sounds to the UAC thread, one path of the environmental sounds is sent to the terminal for playing after being processed by the UAC thread, and the other path of the environmental sounds is sent to the image AI thread for sound source positioning.

231. When the audio AI thread does not output sound source angle information, the image AI thread control terminal displays a picture shot by the wide-angle camera.

When there is no sound source angle output, the listening mode is entered, in which the UVC thread defaults to output the image presentation of the wide-angle camera, and when the image AI thread analyzes that one of the two cameras has face recognition, the UVC thread is informed to switch to the image presentation of the corresponding camera, and step 210 is executed to collect the environmental sound in real time. If the two AI image streams have the face recognition condition, the image picture of the wide-angle camera is output preferentially. If the two AI image streams have no face recognition, the image frames of the wide-angle camera are also preferentially output without focusing.

232. When the audio AI thread outputs the sound source angle information, the image AI thread determines a sound source region according to the sound source angle information.

When the sound source angle is output, the image AI thread divides the sector through the range of +/-15 degrees to +/-30 degrees of the sound source angle, and the sector is a sound source area.

240. And the image AI thread determines a target video stream from the two paths of video streams according to the sound source region.

The image AI thread judges according to the sound source area and identifies the image collected by which camera, if the sound source area is completely covered by two cameras, the image AI thread processes the image information of the long-focus camera preferentially, and if the sound source area is in the wide-angle camera, the image AI thread processes the image information of the wide-angle camera.

250. The image AI thread identifies a target object according to the target video stream, the target object being an object with lip motion.

The image AI thread performs lip motion analysis according to the image information photographed by the camera according to the camera for recognition determined in step 340 to recognize the target object.

261. When the target object is identified, the image AI thread determines the region to be displayed according to the target object.

When the person with the lip action is not identified, the area corresponding to the sound source angle of +/-15 degrees is taken as the area for identifying the face, and if the person with the lip action is not identified, the area corresponding to the sound source angle of +/-20 degrees is taken as the area for identifying the face, and the size of the area is increased gradually by 5 degrees each time until the person with the lip action is identified, and the area at the moment is taken as the area to be displayed. If there are multiple speakers, the area to be displayed is to cover all speakers.

Finally, the UVC thread controls the output of the image information of the camera, and cuts the output image information, so that a user can see the final face focusing effect.

262. When the target object is not identified, the image AI thread determines the area to be displayed according to all objects in the target image information.

270. And the image AI thread determines a video stream to be displayed according to the region to be displayed.

280. And cutting the display picture of the video stream to be displayed by the image AI thread according to the area to be displayed to obtain the cut display picture.

290. And the terminal displays the cut display picture.

When no person with lip action is identified, taking the area corresponding to +/-15 degrees of the sound source angle as a sector for identifying the face, if the face is identified, taking the area corresponding to +/-20 degrees of the sound source angle as the sector for identifying the face, increasing the size of the sector by 5 degrees each time in an increasing way until the face is identified, and taking the sector at the moment as the area to be displayed. If there are multiple people in the area to be displayed, the area to be displayed is to cover all the people.

When no face is recognized, a listening mode is entered, and step 210 is performed to collect environmental sounds in real time.

As can be seen from the above, in the embodiment of the present application, by acquiring the sound source angle and performing double-shot switching, focusing on the speaker is achieved, so that the displayed picture can be focused on the speaker, and a better conference picture is presented.

In order to better implement the method, the embodiment of the application also provides a video stream display device which can be integrated in electronic equipment, wherein the electronic equipment can be a terminal, a server and the like. The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices, and the server can be a single server or a server cluster consisting of a plurality of servers.

For example, in this embodiment, a method according to an embodiment of the present application will be described in detail by taking a specific integration of a video stream display device in a terminal as an example.

For example, as shown in fig. 6, the video stream display apparatus may include an acquisition unit 310, a first determination unit 320, an identification unit 330, a second determination unit 340, and a display unit 350, as follows:

first acquisition unit 310

The method is used for acquiring multiple paths of video streams and sound source positions of the current scene, and each video stream corresponds to one image acquisition area.

In some embodiments, the method for obtaining the sound source position may include steps 6.1 to 6.2, as follows:

6.1, collecting sound information of the current scene;

and 6.2, processing the collected sound information through a sound source positioning algorithm to obtain the sound source position.

(Two) first determination unit 320

The method is used for determining a target video stream from multiple video streams according to the sound source position and the image acquisition area.

In some embodiments, the first determining unit 320 may be specifically used in steps 7.1 to 7.4 as follows:

7.1, determining a sound source area according to the sound source position;

7.2, determining an overlapping area of a sound source area and an image acquisition area for each video stream;

7.3, determining an overlapping area meeting the preset first area size as a target area;

and 7.4, determining the video stream corresponding to the target area as a target video stream.

(III) identification unit 330

The method is used for identifying a target object according to the target video stream, wherein the target object is an object with lip actions.

In some embodiments, the identification unit 330 may have a method for including steps 8.1 to 8.3 as follows:

8.1, determining an identification area according to the sound source position;

8.2, acquiring target image information from the target video stream according to the identification area, wherein the target image information is the image information corresponding to the identification area;

and 8.3, identifying the target object according to the target image information.

In some embodiments, step 8.3 may include steps 8.3.1-8.3.2, as follows:

8.3.1, when the object with the lip action is identified from the target image information, taking the object with the lip action as a target object;

8.3.2, when the object with the lip action is not identified from the target image information, expanding the identification area to a preset second area size so as to identify the target object.

(Fourth) the second determination unit 340

And the video stream to be displayed is determined from the multiple paths of video streams according to the identification result of the target object.

In some embodiments, the second determining unit 340 may be specifically used in steps 9.1 to 9.4 as follows:

9.1, when the target object is identified, determining a region to be displayed according to the target object;

9.2, when the target object is not identified, determining a region to be displayed according to all objects in the target image information;

9.3, acquiring an image acquisition area corresponding to each video stream;

And 9.4, determining the video stream to be displayed according to the region to be displayed and the image acquisition region.

(Fifth) display unit 350

For displaying a picture corresponding to the video stream to be displayed.

In some embodiments, the display unit 350 may be specifically used in steps 10.1 to 10.3 as follows:

10.1, acquiring a display picture of a video stream to be displayed;

10.2, cutting out a display picture of the video stream to be displayed according to the area to be displayed to obtain a cut-out display picture;

and 10.3, displaying the cut display picture.

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

Therefore, the embodiment of the application can determine the target video stream for identifying the speaker through the sound source position, can improve the efficiency of identifying the speaker, and can simultaneously determine the video stream to be displayed according to the identification result, so that the displayed picture can be focused on the speaker and a better conference picture can be presented.

Correspondingly, the embodiment of the application also provides computer equipment, which can be a terminal or a server, wherein the terminal can be terminal equipment such as a smart phone, a tablet Personal computer, a notebook computer, a touch screen, a game console, a Personal computer, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA) and the like.

As shown in fig. 7, fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device 400 includes a processor 410 with one or more processing cores, a memory 420 with one or more computer readable storage media, and a computer program stored in the memory 420 and executable on the processor. The processor 410 is electrically connected to the memory 420. It will be appreciated by those skilled in the art that the computer device structure shown in the figures is not limiting of the computer device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

Processor 410 is a control center of computer device 400, connects various portions of the entire computer device 400 using various interfaces and lines, and performs various functions of computer device 400 and processes data by running or loading software programs and/or modules stored in memory 420, and invoking data stored in memory 420, thereby performing overall monitoring of computer device 400.

In an embodiment of the present application, the processor 410 in the computer device 400 loads the instructions corresponding to the processes of one or more application programs into the memory 420 according to the following steps, and the processor 410 executes the application programs stored in the memory 420, so as to implement various functions:

The method comprises the steps of obtaining multiple paths of video streams of a current scene and a sound source position, determining a target video stream from the multiple paths of video streams according to the sound source position and the image acquisition region, identifying a target object according to the target video stream, wherein the target object is an object with lip action, determining a video stream to be displayed from the multiple paths of video streams according to an identification result of the target object, and displaying a picture corresponding to the video stream to be displayed.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Optionally, as shown in FIG. 7, the computer device 400 further includes a touch display 430, a radio frequency circuit 440, an audio circuit 450, an input unit 460, and a power supply 470. The processor 410 is electrically connected to the touch display 430, the rf circuit 440, the audio circuit 450, the input unit 460 and the power supply 470, respectively. Those skilled in the art will appreciate that the computer device structure shown in FIG. 7 is not limiting of the computer device and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components.

The touch display 430 may be used to display a graphical user interface and receive operation instructions generated by a user acting on the graphical user interface. The touch display screen 430 may include a display panel and a touch panel. Wherein the display panel may be used to display information entered by a user or provided to a user as well as various graphical user interfaces of a computer device, which may be composed of graphics, text, icons, video, and any combination thereof. Alternatively, the display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations on or near the user (such as operations on or near the touch panel by the user using any suitable object or accessory such as a finger, stylus, etc.), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch controller receives touch information from the touch detection device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 410, and can receive and execute commands sent by the processor 410. The touch panel may overlay the display panel, and upon detection of a touch operation thereon or thereabout, the touch panel is passed to the processor 410 to determine the type of touch event, and the processor 410 then provides a corresponding visual output on the display panel based on the type of touch event. In an embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 430 to implement input and output functions. In some embodiments, however, the touch panel and the display panel may be implemented as two separate components to implement the input and output functions. I.e. the touch display 430 may also implement an input function as part of the input unit 460.

The radio frequency circuit 440 may be used to transceive radio frequency signals to establish wireless communication with a network device or other computer device via wireless communication.

Audio circuitry 450 may be used to provide an audio interface between a user and a computer device through speakers, microphones, and so on. The audio circuit 450 may convert the received audio data into an electrical signal, transmit the electrical signal to a speaker, and convert the electrical signal to a sound signal for output by the speaker, or alternatively, the microphone may convert the collected sound signal into an electrical signal, receive the electrical signal from the audio circuit 450, convert the electrical signal to audio data, process the audio data to the processor 410, transmit the audio data to another computer device, for example, via the rf circuit 440, or output the audio data to the memory 420 for further processing. The audio circuit 450 may also include an ear bud jack to provide communication of the peripheral ear bud with the computer device.

The input unit 460 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 470 is used to power the various components of the computer device 400. Alternatively, the power supply 470 may be logically connected to the processor 410 through a power management system, so as to perform functions of managing charging, discharging, and power consumption management through the power management system. The power supply 470 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown in fig. 7, the computer device 400 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described herein.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

As can be seen from the above, the computer device provided in this embodiment may determine, through the sound source position, the target video stream for identifying the speaker, so as to improve the efficiency of identifying the speaker, and determine, according to the identification result, the video stream to be displayed, so that the displayed picture is focused on the speaker, and a better conference picture is presented.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application provides a computer readable storage medium having stored therein a plurality of computer programs that can be loaded by a processor to perform the steps in any of the video stream display methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:

The storage medium may include a Read Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or the like.

The steps in any video stream display method provided by the embodiment of the present application can be executed by the computer program stored in the storage medium, so that the beneficial effects that any video stream display method provided by the embodiment of the present application can be achieved, and detailed descriptions of the previous embodiments are omitted.

The foregoing describes in detail a video stream display method, apparatus, storage medium and computer device provided in the embodiments of the present application, and specific examples are set forth herein to illustrate the principles and embodiments of the present application, and the above description of the embodiments is only for aiding in understanding of the method and core concept of the present application, and meanwhile, to those skilled in the art, according to the concept of the present application, there are variations in the specific embodiments and application ranges, so that the disclosure should not be interpreted as limiting the application.

Claims

1. A video stream display method, comprising:

Acquiring multiple paths of video streams and sound source positions of a current scene, wherein each video stream corresponds to one image acquisition area;

Determining a target video stream from the multiple paths of video streams according to the sound source position and the image acquisition area;

identifying a target object according to the target video stream, wherein the target object is an object with lip actions;

determining a video stream to be displayed from the multiple paths of video streams according to the identification result of the target object;

And determining a video stream to be displayed from the multiple paths of video streams according to the identification result of the target object, wherein the method comprises the following steps:

When the target object is identified, determining a region to be displayed according to the target object;

when the target object is not identified, determining an area to be displayed according to all objects in target image information, wherein the target image information is the image information corresponding to the identification area determined according to the sound source position;

Acquiring an image acquisition area corresponding to each video stream;

determining a video stream to be displayed according to the region to be displayed and the image acquisition region;

The determining the video stream to be displayed according to the region to be displayed and the image acquisition region includes:

determining a region to be displayed and a repeated region of each image acquisition region, and taking a video stream corresponding to the image acquisition region with the largest repeated region as a video stream to be displayed;

and displaying the picture corresponding to the video stream to be displayed.

2. The video stream display method according to claim 1, wherein the determining a target video stream from the multiple video streams according to the sound source position and the image acquisition region comprises:

determining a sound source area according to the sound source position;

determining, for each of the video streams, an overlapping region of the sound source region and the image acquisition region;

Determining the overlapping area meeting the preset first area size as a target area;

And determining the video stream corresponding to the target area as a target video stream.

3. The video stream display method according to claim 1, wherein the identifying a target object from the target video stream comprises:

Determining an identification area according to the sound source position;

Acquiring target image information from the target video stream according to the identification area, wherein the target image information is image information corresponding to the identification area;

and identifying a target object according to the target image information.

4. The video stream display method according to claim 3, wherein the identifying a target object based on the target image information comprises:

when the object with the lip action is identified from the target image information, the object with the lip action is taken as a target object;

and when the object with the lip action is not identified from the target image information, expanding the identification area to a preset second area size so as to identify the target object.

5. The video stream display method according to claim 1, wherein the displaying the picture corresponding to the video stream to be displayed includes:

Acquiring a display picture of the video stream to be displayed;

cutting out the display picture of the video stream to be displayed according to the area to be displayed to obtain a cut-out display picture;

and displaying the cut display picture.

6. The video stream display method according to claim 1, wherein the sound source position acquisition method includes:

Collecting sound information of a current scene;

and processing the collected sound information through a sound source positioning algorithm to obtain the sound source position.

7. A video stream display apparatus, comprising:

The acquisition unit is used for acquiring multiple paths of video streams and sound source positions of a current scene, and each video stream corresponds to one image acquisition area;

The first determining unit is used for determining a target video stream from the multiple paths of video streams according to the sound source position and the image acquisition area;

the identification unit is used for identifying a target object according to the target video stream, wherein the target object is an object with lip actions;

a second determining unit, configured to determine a video stream to be displayed from the multiple video streams according to a recognition result of the target object;

Acquiring an image acquisition area corresponding to each video stream;

and the display unit is used for displaying the picture corresponding to the video stream to be displayed.

8. A computer device comprising a processor and a memory, the memory storing a plurality of instructions, the processor loading instructions from the memory to perform the steps in the video stream display method according to any one of claims 1 to 6.

9. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the video stream display method of any one of claims 1 to 6.