US20130321566A1

US20130321566A1 - Audio source positioning using a camera

Info

Publication number: US20130321566A1
Application number: US13/599,678
Authority: US
Inventors: Guillaume Simonnet
Original assignee: Microsoft Corp
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2012-05-31
Filing date: 2012-08-30
Publication date: 2013-12-05
Also published as: US20130321418A1; US8917270B2; US20130321589A1; US20130321396A1; US9846960B2; US20130321586A1; US20130321590A1; US20130321410A1; US9256980B2; US20130321575A1; US20130321593A1; US20130321413A1; US9251623B2

Abstract

Audio source positioning technique embodiments are presented that are employed in a video teleconference or telepresence session between a local site and one or more remote sites. Each of these sites has one participant, and a virtual scene is constructed and displayed at each site that depicts each of the participants from the other sites in the constructed scene. However, rather than simply playing audio captured at the other site or sites in the viewing participant's site, audio source positioning is used to make it seem to a participant viewing a rendering of the virtual scene that the voice of another participant is emanating from a location on the display device where the remote participant is depicted.

Description

CROSS REFERENCE TO RELATED APPLICATION

This application claims the benefit of and priority to provisional U.S. patent application Ser. No. 61/653,983 filed May 31, 2012.

BACKGROUND

A spatial audio teleconference between two or more geographically distant sites is typically achieved by processing audio signals captured with microphones at one site to produce spatial audio data. This spatial audio data is then transmitted to the other sites and processed at each of these sites to generate a plurality of output audio signals that are played through multiple audio speakers in a manner that spatializes the sound from a sending site to a distinct location in the receiving site. This process is repeated at all the sites resulting in the voices of participants at other sites seeming to a participant at the receiving site as if they are emanating from different locations in the receiving site. This spatializing of the voices of other participants in the receiving site is typically accomplished using only the spatial audio data received from the other sites.

SUMMARY

This Summary is provided to introduce a selection of concepts, in a simplified form, that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Audio source positioning technique embodiments described herein are generally employed in a video teleconference or telepresence session between a local site and one or more remote sites. In one embodiment, each of these sites has one participant, and a virtual scene is constructed and displayed at each site that depicts each of the participants from the other sites. However, rather than simply playing audio captured at the other site or sites in the viewing participant's site, audio source positioning technique embodiments described herein are used to make it seem to a participant viewing a rendering of the virtual scene that the voice of each depicted participant is emanating from a location on the display device where that participant is depicted.
In general this audio source positioning is accomplished at a site (referred to as the local site for convenience) by transmitting data to the other site or sites (referred to as remote sites for convenience), which is then used at those sites to construct the aforementioned virtual scene with spatialized audio. In addition, similar data is received from the other site or sites to construct a virtual scene with spatialzed audio at the local site.
More particularly, in one general embodiment, streams of sensor data generated from an arrangement of sensors that capture participant data are input into a computing device or devices resident at the local site. This arrangement of sensors includes a plurality of video and audio devices. Each video capture device captures the participant from a different geometric perspective, and each audio capture device captures the voice of the participant. Scene proxies are generated from the streams of sensor data, which geometrically describes the local site including the participant on a frame by frame basis. In addition, the streams of video sensor data and a face tracking technique are employed to identify a 3D point representing the location of the participant in the local site for each frame of the scene proxies. The scene proxies representing each frame are transmitted in the order generated over a data communication network to each remote site, along with, two additional items. Namely, audio data representing the local site participant's voice captured, if any, during the time period between the frame currently being transmitted and next frame of scene proxies to be transmitted, and the 3D point coordinates representing the location of the participant in the local site for the frame currently being transmitted.
Meanwhile, the local site's computing device or devices receive scene proxies representing successive scene proxy frames from each remote site. In addition, audio data representing the remote site participant's voice captured, if any, during the time period between the currently received frame and the next frame of scene proxies to be received from the remote site, and a 3D point representing the location of the participant in the remote site, are received from each remote site that is facilitating audio source positioning at the local site. For each frame of scene proxies received from a remote site if there is only one remote site sending frames, or for each group of frames of scene proxies contemporaneously received from remote sites if there are multiple remote sites sending frames, a frame of a virtual scene is rendered from the last-received frame or frames of scene proxies that includes a depiction of each of the remote site participants. The rendered frame is then displayed to the local site participant via a display device. In addition, for each remote site participant depicted in the last-rendered frame of the virtual scene that is resident at a remote site that sent the aforementioned audio data representing the remote site participant's voice and the 3D point representing the location of the participant in the remote site, a spatial audio technique is employed to make it seem to the local site participant that the voice of the remote site participant is emanating from a location on the display device where the remote participant is depicted.

DESCRIPTION OF THE DRAWINGS

The specific features, aspects, and advantages of the disclosure will become better understood with regard to the following description, appended claims, and accompanying drawings where:

FIG. 1 is a flow diagram illustrating an exemplary embodiment, in simplified form, of a process for a local site to facilitate audio source positioning at a remote site.

FIG. 2 is a flow diagram illustrating an exemplary embodiment, in simplified form, of a process for audio source positioning in a local site.

FIG. 3 is a flow diagram illustrating an exemplary embodiment, in simplified form, of an implementation of the part of the process of FIG. 2 involving the rendering and displaying the frames of a virtual scene.

FIG. 4 is a flow diagram illustrating an exemplary embodiment, in simplified form, of a process for audio source positioning in a local site that adds simulated reverberation.

FIG. 5 is a diagram illustrating an exemplary video conferencing or telepresence application that supports the generation, storage, distribution, and presentation of a virtual scene in which audio source positioning technique embodiments described herein can be implemented.

FIG. 6 is a diagram depicting a general purpose computing device constituting an exemplary system for implementing audio source positioning technique embodiments described herein.

DETAILED DESCRIPTION

In the following description of audio source positioning technique embodiments reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific embodiments in which the technique may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the technique.
It is also noted that for the sake of clarity specific terminology will be resorted to in describing the audio source positioning technique embodiments described herein and it is not intended for these embodiments to be limited to the specific terms so chosen. Furthermore, it is to be understood that each specific term includes all its technical equivalents that operate in a broadly similar manner to achieve a similar purpose. Reference herein to “one embodiment”, or “another embodiment”, or an “exemplary embodiment”, or an “alternate embodiment”, or “one implementation”, or “another implementation”, or an “exemplary implementation”, or an “alternate implementation” means that a particular feature, a particular structure, or particular characteristics described in connection with the embodiment or implementation can be included in at least one embodiment of the audio source positioning technique. The appearances of the phrases “in one embodiment”, “in another embodiment”, “in an exemplary embodiment”, “in an alternate embodiment”, “in one implementation”, “in another implementation”, “in an exemplary implementation”, “in an alternate implementation” in various places in the specification are not necessarily all referring to the same embodiment or implementation, nor are separate or alternative embodiments/implementations mutually exclusive of other embodiments/implementations. Yet furthermore, the order of process flow representing one or more embodiments or implementations of the audio source positioning technique does not inherently indicate any particular order nor imply any limitations of the audio source positioning technique.
The term “sensor” is used herein to refer to any one of a variety of scene-sensing devices which can be used to generate a stream of sensor data that represents a given scene. Generally speaking and as will be described in more detail hereafter, the audio source positioning technique embodiments described herein employ one or more sensors which can be configured in various arrangements to capture a scene, thus allowing one or more streams of sensor data to be generated each of which represents the scene from a different geometric perspective. Each of the sensors can be any type of video capture device (e.g., any type of video camera), or any type of audio capture device, or any combination thereof. Each of the sensors can also be either static (i.e., the sensor has a fixed spatial location and a fixed rotational orientation which do not change over time), or moving (i.e., the spatial location and/or rotational orientation of the sensor change over time). The audio source positioning technique embodiments described herein can employ a combination of different types of sensors to capture a given scene.

1.0 Audio Source Positioning

Audio source positioning technique embodiments described herein are generally employed in a video teleconference or telepresence session between a local site and one or more remote sites. In one embodiment, each of these sites has one participant, and a virtual scene is constructed and displayed at each site that depicts each of the participants from the other sites in the constructed scene. Thus, it appears to a participant who is viewing the virtual scene that he or she is in a space with the participant or participants from the other site or sites. The construction of such a virtual scene is accomplished using conventional methods, with an exception. Rather than simply playing audio captured at the other site(s) in the viewing participant's site, audio source positioning technique embodiments described herein are used to co-locate the voice of each of the other participant(s) with the depiction of that person on a display. In other words, audio source positioning technique embodiments described herein make it seem to a participant viewing a rendering of the virtual scene that the voice of another participant is emanating from a location on the display device where the remote participant is depicted. This audio illusion enhances the video teleconference or telepresence session experience and makes it seem more like the viewing participant is actually present with the other participant(s) in the virtual scene.
It is noted that for convenience, the participant who is viewing the rendered virtual scene will be referred to as a local or first participant, and the site that this participant is viewing from will be referred to the local or first site. Each of the other participants involved will be referred to as a remote or other participant, and the site associated with a remote participant will be referred to as a remote or other site. Given this, it will be evident that any of the sites participating in a video teleconference or telepresence session can be considered the local site with the others being the remote sites.
Referring to FIG. 1, one general embodiment of the audio source positioning technique involves, from the viewpoint of the local site, using a computing device to perform the following process actions. Streams of sensor data generated from an arrangement of sensors that capture participant data are input (block 100). This arrangement includes a plurality of video and audio devices which generate a plurality of streams of sensor data. Each video device captures the site participant from a different geometric perspective, and each audio device captures the voice of the participant at the site. Scene proxies are then generated from the streams of sensor data (block 102). In general, a scene proxy geometrically describes the local site including the participant on a frame by frame basis. A frame of scene proxies refers to the geometric and texture data needed to render a frame of the aforementioned virtual scene. Examples of a scene proxy include a stream of depth map images of the captured scene. A scene proxy can also include a stream of calibrated point cloud reconstructions of the captured scene. A scene proxy can further include one or more types of high order geometric models such as planes, billboards, and existing (i.e., previously created) generic object models (e.g., human body models) which can be either modified, or animated, or both. A scene proxy can also include other high fidelity proxies such as a stream of mesh models of the captured scene, and the like. Further, more than one type of scene proxy can be employed in a frame of scene proxies.
In addition to generating scene proxies, the streams of sensor data are used, along with a face tracking technique, to identify a 3D point representing the location of the participant in the local site for each frame of the scene proxies (block 104). In one embodiment, this 3D point representing the location of the participant in the local site is a 3D point representing the location of the participant's head in the local site. In another embodiment, the 3D point representing the location of the participant in the local site is a 3D point representing the location of the participant's mouth in the local site.
The scene proxies representing each frame are transmitted in the order generated over a data communication network to the remote site or sites, along with, audio data representing the local site participant's voice captured, if any, during the time period between the frame currently being transmitted and next frame of scene proxies to be transmitted, and the 3D point coordinates representing the location of the participant in the local site identified for the frame of scene proxies currently being transmitted (block 106). It is noted that the “if any” caveat refers to the fact that the local participant may not speak during the frame time period alluded to above.
The foregoing process actions provided the data used at a remote site to perform audio source positioning. Thus, the foregoing action can be said to facilitate audio source positioning at a remote site. If audio source positioning is to be implemented at the local site as well, then the same type of data is provided from a remote site or sites. Referring to FIG. 2, the process actions generally employed for audio source positioning in a video teleconference or telepresence session at the local site will now be described. First, scene proxies representing successive scene proxy frames are receiving from each remote site participating in the conference over a data communication network (block 200). In addition, other data is received from at least one remote site. It is noted that while it will be assumed in the following description that each remote site sends this other data (which is used to implement the audio source positioning), that may not be the case. If a remote site does not send the other data, then any audio received from that site is played in the normal manner in the local site (in one embodiment, this may be using monophonic playback). For each remote site sending the other data, this data includes audio data representing the remote site participant's voice captured, if any, during the time period between the currently received frame and the next frame of scene proxies to be received from the remote site, and a 3D point representing the location of the participant in the remote site (block 202).
For each frame of scene proxies received from a remote site if there is only one remote site, or for each group of frames of scene proxies contemporaneously received from remote sites if there are multiple remote sites, a frame of a virtual scene is rendered (block 204). As indicated previously, the virtual scene frame includes a depiction of each of the remote site participants from the last-received frame or frames of scene proxies. The rendered virtual frame is then displayed to the local site participant via a display device (block 206). It is noted that the term contemporaneously used above is not to be taken literally. For example, in one implementation, the frames of scene proxies coming from multiple remote sites are considered contemporaneous if they arrive before the next frame from any of the sites.
In addition, for each remote site participant depicted in the last-rendered frame of the virtual scene that is resident at a remote site that sent audio data representing the remote site participant's voice and the 3D point representing the location of the participant in the remote site, a spatial audio technique is employed to make it seem to the local site participant that the voice of the remote site participant is emanating from a location on the display device where the remote participant is depicted (block 208). This is accomplished using conventional methods given the received audio data and 3D point representing the location of the participant in the remote site.
As mentioned previously, the 3D point representing the location of the participant in the remote site can be the person's head or mouth. More particularly, in one embodiment, the 3D point representing the location of a participant in a remote site is a 3D point representing the location of the participant's mouth in the remote site when the mouth of that participant is visible in the last-rendered frame of the virtual scene. In another embodiment, the 3D point representing the location of a participant in a remote site is a 3D point representing the location of the participant's head in the remote site when the mouth of that participant is not visible by the sensor used to determine that 3D point.

1.1 Rendering, Display and Spatial Audio

With regard to the foregoing action of rendering the frames of the virtual scene, it is noted that as part of this process, for each remote site, a first transform is computed that converts 3D locations in the remote site to points in the frame of the virtual scene. In addition, the action of displaying a rendered frame to the local site participant involves the use of a second transform that converts points in a frame of the virtual scene to screen coordinates on the local site's display device. These transforms are used in the aforementioned spatial audio technique. More particularly, referring to FIG. 3, in one embodiment, the first transform is used to convert the 3D point representing the location of the remote participant in the remote site to a point in the last-rendered frame of the virtual scene (block 300), and the second transform is employed to convert the point in the last-rendered frame of the virtual scene representing the remote participant location to screen coordinates on the local site's display device (block 302). A third transform is also computed that converts screen coordinates in the display device to 3D points in the local site (block 304). It is noted that this transform need only be computed once, unless the display device is moved—at which point it would be re-computed. The third transform is used to compute the 3D point in the local site of the screen coordinates representing the location of the remote participant depicted on the display device (block 306). The spatial audio technique and a plurality of audio speakers resident in the local site are then used to make it seem to the local site participant that the voice of the remote site participants are respectively emanating from the computed 3D point in the local site of the screen coordinates representing the location of the remote participant depicted on the display device (block 308).

1.2 Parallax Effect

It is noted that the location of the local participant within the local site has an effect on how audio source positioning is accomplished. In general, a parallax effect results when the local site residence moves and in one embodiment, the spatial audio technique compensates based on the current location of the local participant. Generally, the head of the local site participant is tracked and periodically a 3D point representative of the location of the local site participant's head in the local site is computed. The point is then used in the audio source positioning. More particularly, in one embodiment, each time a 3D point representative of the location of the local site participant's head in the local site is computed, the spatial audio technique is used to make it seem to the local site participant that the voice of the remote site participant is emanating from a location on the display device where the remote participant is depicted taking into consideration the last-computed 3D point representative of the location of the local site participant's head.
It is noted that to provide a more realistic experience for the local participant, the rate at which 3D points representative of the location of the local site participant's head in the local site are computed should be high. In one embodiment, this rate exceeds the rate at which frames of the virtual scene are calculated. For example, a typical virtual scene frame rate is 30 frames per second (fps). In one implementation, the rate at which 3D points representative of the location of the local site participant's head are rendered is four times the virtual frame rate-namely 120 times per second. Thus, while the content of the scene may only be updated at 30 fps, the depiction of the scene from the point of view of the local participant is updated at 120 fps. In other words the scene is calculated at 30 fps, but rendered at 120 fps.

1.3 Simulating Reverberation

The audio source positioning technique embodiments described so far make it seem to a participant viewing a rendering of the virtual scene that the voice of another participant is emanating from a location on the display device where the remote participant is depicted. However, there is another enhancement that can be made to make the video teleconference or telepresence session experience even more like the viewing participant is actually present with the other participant(s) in the virtual scene. This enhancement involves simulating the reverberations a participant's voice would create in the virtual scene (e.g., reverberations of the sound against the virtual walls or other virtual objects in the scene) and playing these reverberation in the participant's site.
This reverberation enhancement can be accomplished at the local site, given, from each remote site, the 3D point representing the location of the remote participant in the remote site and a modified version of the audio data representing the remote site participant's voice site. The modification to the audio data involves suppressing reverberations and noise in the audio captured at the remote site. While this modification can be performed at the local site given certain information about the remote site, the more efficient method would be for the reverberations and noise in the audio captured at the remote site to be suppressed in the audio data prior to the data being sent to the local site. In either case, conventional suppression techniques are employed to accomplish the modification.
Assuming the above-described modified audio data and the 3D point representing the location of the remote participant has been received from a remote site, one general embodiment of the audio source positioning technique that adds reverberation on a frame-by-frame basis involves, from the viewpoint of the local site, using the local site computing device to perform the following process actions. First, the previously-described first transform computed to convert 3D locations in the remote site to points in the last-rendered frame of the virtual scene is employed to convert the 3D point representing the location of the remote participant in the remote site to a point in the last-rendered frame of the virtual scene (block 400). In this embodiment, the 3D point representing the location of the remote participant in the remote site corresponds to a 3D point representing the location of the remote participant's mouth in the remote site. Next, the orientation of the remote site participant's face in the virtual scene, as depicted in the last-rendered virtual scene frame, is identified (block 402). Conventional methods are employed to accomplish this task. The direction that the remote participant's voice projects in the virtual space from the point in the last-rendered frame of the virtual scene that corresponds to the 3D point representing the location of the remote participant's mouth is then computed based on the orientation of the remote site participant's face in the virtual scene (block 404). In addition, the reverberation characteristics of the virtual scene, as depicted in the last-rendered virtual scene frame, are estimated (block 406).
Given the point representing the location of the remote participant's mouth in the virtual scene and the computed direction, reverberation audio data is then computed that when added to the received audio data simulates the reverberations of the remote participant's voice in the virtual space for the current frame (block 408). This computed reverberation audio data is then added into the audio played in the local site in conjunction with the display of the current virtual scene frame (block 410).

1.4 Exemplary Video Conferencing or Telepresence Application

The audio source positioning technique embodiments described herein can be employed in a variety of video conferencing or telepresence applications. Generally, any video conferencing or telepresence application that involves the generation and display of a virtual scene for each participant can be enhanced using the audio source positioning technique embodiments described herein.
One exemplary video conferencing or telepresence application supports the generation, storage, distribution, and presentation of a virtual scene (such as a virtual conference room). The exemplary video conferencing or telepresence application can support various types of traditional, single viewpoint virtual scene presentations in which the viewpoint of the scene is fixed when the video is recorded/captured and this viewpoint cannot be controlled or changed by a participant while they are viewing the virtual scene. In other words, in a single viewpoint virtual scene the viewpoint of the scene is fixed and cannot be modified when the scene is being rendered and displayed to a participant. However, the exemplary video conferencing or telepresence application can support various types of free viewpoint video in which the viewpoint of the virtual scene can be interactively controlled and changed by a participant at will while they are viewing the scene. In other words, in a free viewpoint video a participant can interactively generate different viewpoints of the scene on-the-fly when the virtual scene is being rendered and displayed.

1.4.1 Video Conferencing or Telepresence Application Processing Pipeline

FIG. 5 illustrates an exemplary video conferencing or telepresence application processing pipeline in which the audio source positioning technique embodiments described herein can be implemented. As exemplified in FIG. 5, the exemplary processing pipeline 500 starts with a generation stage 502 during which, and generally speaking, the aforementioned scene proxies of a site are generated. The generation stage 502 includes a capture sub-stage 504 and a processing sub-stage 506 whose operations will now be described in more detail.
Referring again to FIG. 5, the capture sub-stage 504 of the processing pipeline 500 generally captures the scene in a site including the participant 508 and generates one or more streams of sensor data that represent the scene. More particularly, during the capture sub-stage 504, an arrangement of sensors is used to capture the scene, where the arrangement includes a plurality of video capture devices 510 (as will be described shortly) and one or more audio capture devices 512 (such as a microphone or microphone array). The arrangement of sensors generates a plurality of streams of sensor data each of which represents the scene from a different geometric perspective. These streams of sensor data are input from the sensors and calibrated, and then output to the processing sub-stage 506.
Referring again to FIG. 5, the processing sub-stage 506 inputs the streams of sensor data from the capture sub-stage 504, and then generates scene proxies which geometrically describe the captured scene as a function of time from the streams of sensor data. These scene proxies also include texture data for rendering the virtual scene. The scene proxies are output to a storage and distribution stage 514, which stores them, along with the aforementioned audio data captured using the audio capture devices 512. Typically, the generation stage 502 is implemented on one or a collection of computing devices at a participant site (such as the local site shown) and a presentation stage 516 of the pipeline 500 is implemented on one or more computing devices resident at the other participant sites (such as the exemplary remote site shown in FIG. 5). The storage and distribution stage 514 distributes the scene proxies and audio data to the other participating sites by transmitting over whatever one or more data communication networks 518 to which the participant site computing devices are connected. It is noted that each participant site has a generation stage 502 and storage and distribution stage 514 (although only those associated with the aforementioned local site are shown in FIG. 5).
Referring again to FIG. 5 and generally speaking, a presentation stage 516 of the processing pipeline 500 is resident at each of the other participating sites (one of which is shown). The presentation stage 516 inputs the scene proxies and audio data that were transmitted from the storage and distribution stage 514 resident at each of the other sites (again one of which is shown in FIG. 5), and presents the participant at the receiving site with a rendering of the scene proxies in the form of the previously described virtual scene frames. The presentation stage 516 includes a rendering sub-stage 520 and a participant viewing experience sub-stage 522 whose operations will now be described in more detail.
The rendering sub-stage 520 of the processing pipeline 500 inputs the scene proxies from the storage and distribution stage 514, and then generates successive frames of the virtual scene (one of which 524 is shown in FIG. 5). If more than one other participant site is involved, then generating successive frames of the virtual scene entails the rendering sub-stage 520 inputting the scene proxies from the storage and distribution stage 514 operating at each of the other sites, and combining the proxy data using conventional methods to create an aggregate virtual scene (such as 524). Each virtual scene frame generated is then output to the participant viewing experience sub-stage 522 of the pipeline 500. The participant viewing experience sub-stage 522 inputs each frame from the rendering sub-stage 520, and then displays it on a display device 526 for viewing by the participant. In addition, the audio source positioning technique embodiments described herein are implemented as described previously to provided spatialized audio in association with each frame displayed using two or more audio speakers 528 located in the receiving site.
It is noted that in a video conferencing or telepresence application that can support various types of free viewpoint video in which the viewpoint of the virtual scene can be interactively controlled and changed by a participant at will while they are viewing the scene, in addition to the foregoing, the rendering sub-stage 520 inputs the scene proxies output from the storage and distribution stage 514 (or stages if multiple other sites are involved), and then generates a frame exhibiting a current synthetic viewpoint. The current synthetic viewpoint is either a default viewpoint, or if the participant has specified a viewpoint, is the last-specified viewpoint. The participant-specified viewpoint comes from the participant viewing experience sub-stage 522, which inputs it from the participant via a user interface.

1.4.2 Video Capture Devices

Referring again to FIG. 5, this section provides an overview description, in simplified form, of two implementations of the video capture devices 510 of the capture sub-stage 504. It will be appreciated that the implementations described in this section are merely exemplary. Many other implementations of are also possible which use other types of sensor arrangements.
In one implementation, the video capture devices 510 include a circular arrangement of eight genlocked sensors used to capture a site which includes the participant, where each of the sensors has a combination of one infrared structured-light projector, two infrared video cameras, and one color camera. Accordingly, the sensors each generate a different stream of video data which includes both a stereo pair of infrared image streams and a color image stream. The pair of infrared image streams and the color image stream generated by each sensor are used to generate different depth map image streams. The different depth map image streams are then merged into a stream of calibrated point cloud reconstructions of the scene. These point cloud reconstructions can then used to generate a stream of mesh models of the scene. A conventional view-dependent texture mapping method which accurately represents specular textures such as skin is then used to extract texture data from the color image stream generated by each sensor and map this texture data to the stream of mesh models of the scene. The combination of the mesh models and texture data, among other information, forms the scene proxies. Finally, these sensors and their data streams are also used in a face tracking process to identify the 3D location of the participant (which as described above can be the location of the participant's head or mouth).
In another implementation, the video capture devices 510 include four genlocked visible light video cameras used to capture a site which includes the participant, where the cameras are evenly placed around the site. Accordingly, the cameras each generate a different stream of video data which includes a color image stream. An existing 3D geometric model of a human body can be used in the scene proxies as follows. Conventional methods can be used to kinematically articulate the model over time in order to fit (i.e., match) the model to the streams of video data generated by the cameras. The kinematically articulated model can then be colored as follows. A conventional view-dependent texture mapping method can be used to extract texture data from the color image stream generated by each camera and map this texture data to the kinematically articulated model. The combination of the kinematically articulated model and texture data, among other information, forms the scene proxies. Here again, the cameras and their video data streams are also used in a face tracking process to identify the 3D location of the participant (which can be the location of the participant's head or mouth).

2.0 Exemplary Operating Environments

The audio source positioning technique embodiments described herein are operational within numerous types of general purpose or special purpose computing system environments or configurations. FIG. 6 illustrates a simplified example of a general-purpose computer system on which various embodiments and elements of the audio source positioning technique embodiments, as described herein, may be implemented. It should be noted that any boxes that are represented by broken or dashed lines in FIG. 6 represent alternate embodiments of the simplified computing device, and that any or all of these alternate embodiments, as described below, may be used in combination with other alternate embodiments that are described throughout this document.
For example, FIG. 6 shows a general system diagram showing a simplified computing device 10. Such computing devices can be typically be found in devices having at least some minimum computational capability, including, but not limited to, personal computers, server computers, hand-held computing devices, laptop or mobile computers, communications devices such as cell phones and PDA's, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, audio or video media players, etc.
To allow a device to implement the audio source positioning technique embodiments described herein, the device should have a sufficient computational capability and system memory to enable basic computational operations. In particular, as illustrated by FIG. 6, the computational capability is generally illustrated by one or more processing unit(s) 12, and may also include one or more GPUs 14, either or both in communication with system memory 16. Note that that the processing unit(s) 12 of the general computing device may be specialized microprocessors, such as a DSP, a VLIW, or other micro-controller, or can be conventional CPUs having one or more processing cores, including specialized GPU-based cores in a multi-core CPU.
In addition, the simplified computing device of FIG. 6 may also include other components, such as, for example, a communications interface 18. The simplified computing device of FIG. 6 may also include one or more conventional computer input devices 20 (e.g., pointing devices, keyboards, audio input devices, video input devices, haptic input devices, devices for receiving wired or wireless data transmissions, etc.). The simplified computing device of FIG. 6 may also include other optional components, such as, for example, one or more conventional display device(s) 24 and other computer output devices 22 (e.g., audio output devices, video output devices, devices for transmitting wired or wireless data transmissions, etc.). Note that typical communications interfaces 18, input devices 20, output devices 22, and storage devices 26 for general-purpose computers are well known to those skilled in the art, and will not be described in detail herein.
The simplified computing device of FIG. 6 may also include a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 10 via storage devices 26 and includes both volatile and nonvolatile media that is either removable 28 and/or non-removable 30, for storage of information such as computer-readable or computer-executable instructions, data structures, program modules, or other data. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes, but is not limited to, computer or machine readable media or storage devices such as DVD's, CD's, floppy disks, tape drives, hard drives, optical drives, solid state memory devices, RAM, ROM, EEPROM, flash memory or other memory technology, magnetic cassettes, magnetic tapes, magnetic disk storage, or other magnetic storage devices, or any other device which can be used to store the desired information and which can be accessed by one or more computing devices.
Retention of information such as computer-readable or computer-executable instructions, data structures, program modules, etc., can also be accomplished by using any of a variety of the aforementioned communication media to encode one or more modulated data signals or carrier waves, or other transport mechanisms or communications protocols, and includes any wired or wireless information delivery mechanism. Note that the terms “modulated data signal” or “carrier wave” generally refer to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For example, communication media includes wired media such as a wired network or direct-wired connection carrying one or more modulated data signals, and wireless media such as acoustic, RF, infrared, laser, and other wireless media for transmitting and/or receiving one or more modulated data signals or carrier waves. Combinations of the any of the above should also be included within the scope of communication media.
Further, software, programs, and/or computer program products embodying some or all of the various audio source positioning technique embodiments described herein, or portions thereof, may be stored, received, transmitted, or read from any desired combination of computer or machine readable media or storage devices and communication media in the form of computer executable instructions or other data structures.
Finally, the audio source positioning technique embodiments described herein may be further described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The embodiments described herein may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices, or within a cloud of one or more devices, that are linked through one or more communications networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media including media storage devices. Still further, the aforementioned instructions may be implemented, in part or in whole, as hardware logic circuits, which may or may not include a processor.

3.0 Other Embodiments

While the audio source positioning technique embodiments described so far involve only one participant at each site, in one embodiment it is possible to have any number of participants at a site, as long as a separate audio stream and separate location information are sent for each participant. In general, the operation is the same as described for a site sending audio source positioning data to another site or sites, except that audio data representing a remote site participant's voice and the 3D point representing the location of the participant in the remote site is sent for each participant at the site. At a site receiving this data, the virtual scene is rendered so as to include all the remote site participants as before (including each participant at a site having multiple participants). If the receiving site has one participant, then the spatial audio technique employed to spatialize the audio is accomplished in the same manner as described previously. However, if the receiving site has more than one participant, the sound is separately spatialized as described previously for each participant. This can be easily accomplished if the participants each wear audio earphones (i.e., the plurality of audio speakers at the site are sets of headphones) and a spatial audio technique designed for earphones is employed.
It is noted that any or all of the aforementioned embodiments throughout the description may be used in any combination desired to form additional hybrid embodiments. In addition, although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

Wherefore, what is claimed is:

1. A computer-implemented process for audio source positioning in a video teleconference or telepresence session between a local site and one or more remote sites, each of said sites having one or more participants, comprising for the local site:

using a computing device to perform the following process actions:

receiving from each remote site, scene proxies representing successive scene proxy frames transmitted by a remote site over a data communication network;

receiving from at least one remote site, along with each frame of scene proxies received from the site,

audio data representing each remote site participant's voice captured, if any, during the time period between the currently received frame and the next frame of scene proxies to be received from the remote site, and

a 3D point representing the location of each participant in the remote site;

for each frame of scene proxies received from a remote site if there is only one remote site sending frames, or for each group of frames of scene proxies contemporaneously received from remote sites if there are multiple remote sites sending frames,

rendering a frame of a virtual scene comprising a depiction of each of the remote site participants from the last-received frame or frames of scene proxies, and

displaying the rendered frame to the local site participant or participants via a display device;

for each remote site participant depicted in the last-rendered frame of the virtual scene that is resident at a remote site that sent audio data representing the remote site participant's voice and the 3D point representing the location of the participant in the remote site, employing a spatial audio technique to make it seem to each local site participant that the voice of the remote site participant is emanating from a location on the display device where the remote participant is depicted using the audio data and the 3D point representing the location of the participant in the remote site that was received from the remote site in conjunction with the last-received frame of scene proxies.

2. The process of claim 1, wherein said 3D point representing the location of a participant in a remote site is a 3D point representing the location of the participant's mouth in that remote site for each remote site participant depicted in the last-rendered frame of a virtual scene whose mouth is visible.

3. The process of claim 1, wherein said 3D point representing the location of a participant in a remote site is a 3D point representing the location of the participant's head in that remote site for each remote site participant depicted in the last-rendered frame of a virtual scene whose mouth is not visible.

4. The process of claim 1, wherein the process action of rendering a frame of the virtual scene, comprises, for each remote site, computing a first transform that converts 3D locations in the remote site to points in the frame of the virtual scene, and wherein the process action of displaying the rendered frame to the local site participant via a display device, comprises computing a second transform that converts points in a frame of the virtual scene to screen coordinates on the display device.

5. The process of claim 4, wherein the process action of employing a spatial audio technique to make it seem to the local site participant that the voice of a remote site participant is emanating from a location on the display device where the remote participant is depicted using the audio data and the 3D point representing the location of a remote participant in the remote site that was received in conjunction with the last-received frame of scene proxies from the remote site, comprises the actions of:

employing the first transform computed to convert 3D locations in the remote site to points in the last-rendered frame of the virtual scene, to convert the 3D point representing the location of the remote participant in the remote site to a point in the last-rendered frame of the virtual scene;

employing the second transform computed to convert points in a frame of the virtual scene to screen coordinates on the display device, to convert the point in the last-rendered frame of the virtual scene representing the remote participant location to screen coordinates on the display device;

employing a third transform that converts screen coordinates in the display device to 3D points in the local site to compute the 3D point in the local site of the screen coordinates representing the location of the remote participant depicted on the display device; and

employing said spatial audio technique and a plurality of audio speakers resident in the local site to make it seem to the local site participant that the voice of the remote site participant is emanating from the computed 3D point in the local site of the screen coordinates representing the location of the remote participant depicted on the display device.

6. The process action of claim 1, wherein the process action of employing a spatial audio technique to make it seem to a local site participant that the voice of the remote site participant is emanating from a location on the display device where the remote participant is depicted, further comprises the actions of:

tracking the head of the local site participant and periodically computing a 3D point representative of the location of the local site participant's head in the local site; and

each time a 3D point representative of the location of the local site participant's head in the local site is computed, employing the spatial audio technique to make it seem to the local site participant that the voice of the remote site participant is emanating from a location on the display device where the remote participant is depicted taking into consideration the last-computed 3D point representative of the location of the local site participant's head.

7. The process of claim 6, wherein the process action of periodically computing a 3D point representative of the location of a local site participant's head in the local site, comprises computing a 3D point representative of the location of the local site participant's head in the local site at a rate that exceeds the rate at which frames of the virtual scene are computed.

8. The process of claim 1, wherein said audio data representing a remote site participant's voice received from a remote site has been modified so as to suppress reverberations and noise in the audio captured at that remote site.

9. The process of claim 8, wherein the process action of rendering a frame of the virtual scene, comprises for each remote site, computing a first transform that converts 3D locations in the remote site to points in the frame of the virtual scene.

10. The process of claim 9, further comprising the process actions of:

for each remote site participant depicted in the last-rendered frame of the virtual scene that is resident at a remote site that sent audio data representing the remote site participant's voice and the 3D point representing the location of the participant in the remote site,

employing the first transform computed to convert 3D locations in the remote site to points in the last-rendered frame of the virtual scene, to convert the 3D point representing the location of the remote participant in the remote site to a point in the last-rendered frame of the virtual scene, wherein said 3D point representing the location of the remote participant in the remote site corresponds to a 3D point representing the location of the remote participant's mouth in the remote site,

identifying the orientation of the remote site participant's face in the virtual scene as depicted in the last-rendered virtual scene frame,

computing the direction from the point in the last-rendered frame of the virtual scene that corresponds to the 3D point representing the location of the remote participant's mouth in the remote site that the remote participant's voice projects in the virtual space based on the orientation of the remote site participant's face in the virtual scene,

estimating the reverberation characteristics of the virtual scene as depicted in the last-rendered virtual scene frame,

computing reverberation audio data that when added to the received audio data simulates the reverberations of the remote participant's voice in the virtual scene as spoken from the point representing the location of the remote participant's mouth in the virtual scene in the computed direction, and

adding the computed reverberation audio data into audio played in the local site in conjunction with the display of the virtual scene frame.

11. A computer-implemented process for facilitating audio source positioning at a remote site in a video teleconference or telepresence session between a local site and the remote site, each of said sites having one or more participants, comprising for the local site:

using a computing device to perform the following process actions:

inputting streams of sensor data generated from an arrangement of sensors that capture participant data, said arrangement comprising a plurality of video and audio devices which generate a plurality of streams of sensor data, each video capture device of which captures the participant from a different geometric perspective, and each audio capture device of which captures the voice of the participant at the local site;

generating scene proxies from the streams of sensor data which geometrically describes the local site including the participant on a frame by frame basis;

employing the streams of sensor data and a face tracking technique to identify a 3D point representing the location of the participant in the local site for each frame of the scene proxies; and

transmitting the scene proxies representing each frame in the order generated over a data communication network to the remote site, along with,

audio data representing each local site participant's voice captured, if any, during the time period between the frame currently being transmitted and next frame of scene proxies to be transmitted, and

the 3D point coordinates representing the location of each participant in the local site identified for the frame currently being transmitted.

12. The process of claim 11, wherein said 3D point representing the location of a participant in the local site is a 3D point representing the location of the participant's head in the local site.

13. The process of claim 11, wherein said 3D point representing the location of a participant in the local site is a 3D point representing the location of the participant's mouth in the local site.

14. The process of claim 11, wherein prior to performing the process action of transmitting audio data representing a local site participant's voice, performing an action of suppressing reverberations and noise in the audio data.

15. A computer-implemented process for audio source positioning in a video teleconference or telepresence session between two non co-located sites, each of said sites having one participant, comprising for a first of the two sites:

using a computing device to perform the following process actions:

receiving from the other site, scene proxies representing successive scene proxy frames transmitted by the other site over a data communication network, along with for each scene proxy frame received,

audio data representing the other site participant's voice captured, if any, during the time period between the currently received frame and the next frame of scene proxies to be received from the other site, and

a 3D point representing the location of the participant in the other site;

for each frame of scene proxies received from the other site, rendering a frame of a virtual scene comprising a depiction of the other site's participant from the last-received frame of scene proxies and displaying the rendered frame to the first site participant via a display device; and

whenever audio data representing the other site participant's voice is received, employing a spatial audio technique to make it seem to the first site participant that the voice of the other site participant is emanating from a location on the display device where the other site participant is depicted using the audio data and the 3D point representing the location of the participant in the other site that was received from the other site in conjunction with the last-received frame of scene proxies.

16. The process of claim 15, wherein said 3D point representing the location of the participant in the other site is a 3D point representing the location of the participant's mouth in the other site whenever the other site participant's mouth is visible in the last-rendered frame of a virtual scene.

17. The process of claim 15, wherein said 3D point representing the location of the participant in the other site is a 3D point representing the location of the participant's head in the other site whenever the other site participant's mouth is not visible in the last-rendered frame of a virtual scene.

18. The process of claim 15, wherein the process action of rendering a frame of the virtual scene, comprises, computing a first transform that converts 3D locations in the other site to points in the frame of the virtual scene, and wherein the process action of displaying the rendered frame to the first site participant via a display device, comprises computing a second transform that converts points in a frame of the virtual scene to screen coordinates on the display device.

19. The process of claim 18, wherein the process action of employing a spatial audio technique to make it seem to the first site participant that the voice of the other site participant is emanating from a location on the display device where the other site participant is depicted using the audio data and the 3D point representing the location of the other participant in the other site that was received in conjunction with the last-received frame of scene proxies, comprises the actions of:

employing the first transform computed to convert 3D locations in the other site to points in the last-rendered frame of the virtual scene, to convert the 3D point representing the location of the other site participant in the other site to a point in the last-rendered frame of the virtual scene;

employing the second transform computed to convert points in a frame of the virtual scene to screen coordinates on the display device, to convert the point in the last-rendered frame of the virtual scene representing the other participant's location to screen coordinates on the display device;

employing a third transform that converts screen coordinates in the display device to 3D points in the first site to compute the 3D point in the first site of the screen coordinates representing the location of the other site participant depicted on the display device; and

employing said spatial audio technique and a plurality of audio speakers resident in the first site to make it seem to the first site participant that the voice of the other site participant is emanating from the computed 3D point in the first site of the screen coordinates representing the location of the other participant depicted on the display device.

20. The process action of claim 15, wherein the process action of employing a spatial audio technique to make it seem to the first site participant that the voice of the other site participant is emanating from a location on the display device where the other participant is depicted, further comprises the actions of:

tracking the head of the first site participant and periodically computing a 3D point representative of the location of the first site participant's head in the first site; and

each time a 3D point representative of the location of the first site participant's head in the first site is computed, employing the spatial audio technique to make it seem to the first site participant that the voice of the other site participant is emanating from a location on the display device where the other participant is depicted taking into consideration the last-computed 3D point representative of the location of the first site participant's head.