JP7587391B2

JP7587391B2 - Video encoding device and program

Info

Publication number: JP7587391B2
Application number: JP2020176680A
Authority: JP
Inventors: 菊文神田
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2024-11-20
Anticipated expiration: 2040-10-21
Also published as: JP2022067849A

Description

本発明は、超高精細、大画面及び広視野角の特徴を持つ映像を符号化伝送する映像符号化装置及びプログラムに関する。 The present invention relates to a video encoding device and program that encodes and transmits video with ultra-high definition, large screen, and wide viewing angle characteristics.

従来から、入力された映像信号を解析し、映像シーンに関する映像オブジェクトの位置座標を検出し、予め用意された映像の内容に関する文字列データや音声データを対応付けて映像関連のコンテンツとして生成する映像関連コンテンツ生成装置が知られている（例えば、特許文献１参照）。従って、映像信号を符号化する前に、原信号となる映像信号と、映像オブジェクトの位置座標を示す映像メタデータと、映像の内容に関する音声データとを得ることができる。 There is a known video-related content generating device that analyzes an input video signal, detects the position coordinates of video objects related to a video scene, and generates video-related content by associating it with character string data and audio data related to the content of the video that has been prepared in advance (see, for example, Patent Document 1). Therefore, before encoding the video signal, it is possible to obtain the video signal that serves as the original signal, video metadata indicating the position coordinates of video objects, and audio data related to the content of the video.

また、入力された映像信号の映像フレームから映像オブジェクトの位置座標及び形状座標を検出する方法も種々の技術が知られている（例えば、特許文献２参照）。 In addition, various techniques are known for detecting the position coordinates and shape coordinates of a video object from the video frames of an input video signal (see, for example, Patent Document 2).

ところで、近年では、将来の映像メディアとしてイマーシブメディア（没入型高臨場感メディア）が期待されている。また、イマーシブメディアだけでなく、誰もが時間や場所を気にせず、好みの装置を使ってさまざまなコンテンツを視聴・体験できるサービスとして「ダイバースビジョン」が検討されている（例えば、非特許文献１参照）。 In recent years, immersive media (highly realistic media) has been expected to be the video media of the future. In addition to immersive media, "Diverse Vision" is being considered as a service that allows anyone to watch and experience a variety of content using their preferred device regardless of time or place (see, for example, Non-Patent Document 1).

これらの将来の映像システムは、高い臨場感をもつ映像体験を提供するため、この映像システムで用いられる映像は、超高精細、大画面及び広視野角などの特徴をもつ。このような映像のデータは極めて膨大な量となり、超高精細、大画面及び広視野角の特徴を持つ映像を符号化伝送するサービスを実現するためには極めて効率的で高性能な映像符号化装置が要望される。 These future video systems will provide a highly realistic video experience, and the images used in these video systems will have characteristics such as ultra-high definition, large screens, and wide viewing angles. The amount of data for such images will be extremely large, and to realize services that code and transmit images with the characteristics of ultra-high definition, large screens, and wide viewing angles, an extremely efficient, high-performance video encoding device is required.

また、イマーシブメディアに限らず、映像コンテンツには、映像と連動して音声が併存している。音声データの表現方法には、そのシステムに応じて様々な方法があるが、将来の高度なコンテンツにおいては、オブジェクトベース音響システムをはじめ、話者や発音体をオブジェクトとして定義し、音声オブジェクトの再生位置情報など、それらに付随する様々な情報がメタデータとして付加されるようなシステムが検討されている（例えば、非特許文献２参照）。 In addition, audio coexists in conjunction with video in all video content, not just immersive media. There are various methods for expressing audio data depending on the system, but for future advanced content, systems such as object-based audio systems are being considered in which speakers and sound sources are defined as objects, and various information associated with them, such as playback position information for audio objects, is added as metadata (see, for example, Non-Patent Document 2).

例えば、音響メタデータのリアルタイム伝送装置では、音声信号に音響メタデータを付与するものとなっている（例えば、非特許文献３参照）。尚、ＩＴＵ－Ｒでは、国際標準の音響メタデータとして音響定義モデル（Audio Definition Model: ADM）が規定され、この音響メタデータにおけるＡＤＭには、番組内容を記述するコンテンツ（Content）部と、スピーカ配置やオブジェクトの再生位置などを記述するフォーマット（Format）部）からなる。オブジェクトの再生位置など時間的に変化するメタデータは動的メタデータと呼ばれ、コンテンツ（Content）部などの番組を通して変化しない静的メタデータと区別して扱うことができる。 For example, a real-time transmission device for audio metadata adds audio metadata to an audio signal (see, for example, Non-Patent Document 3). The ITU-R specifies the Audio Definition Model (ADM) as an international standard for audio metadata, and the ADM in this audio metadata consists of a content section that describes the program contents, and a format section that describes speaker placement and object playback positions, etc. Metadata that changes over time, such as the playback position of an object, is called dynamic metadata, and can be handled separately from static metadata such as the content section, which does not change throughout the program.

特開２００４－１５５２３号公報JP 2004-15523 A 特開２００１－３０７０９１号公報JP 2001-307091 A

“２０３０～２０４０年ごろのメディア技術ダイバースビジョン”、［online］、ＮＨＫ放送技術研究所、技研公開２０１９、２０１９年５月３０日開催、［令和２年９月１８日検索］、インターネット〈https://www.nhk.or.jp/strl/open2019/tenji/e1.html〉“Diverse Vision: Media Technology from 2030 to 2040”, [online], NHK Broadcasting Technology Research Laboratories, Open House 2019, held on May 30, 2019, [retrieved on September 18, 2020], Internet: https://www.nhk.or.jp/strl/open2019/tenji/e1.html “オブジェクトベース音響”、［online］、ＮＨＫ放送技術研究所、技研だより、２０１９年１月号連載、［令和２年９月１８日検索］、インターネット〈https://www.nhk.or.jp/strl/publica/giken_dayori/166/5.html〉“Object-based acoustics”, [online], NHK Science and Technology Research Laboratories, Giken Dayori, January 2019 issue, [Retrieved September 18, 2020], Internet: https://www.nhk.or.jp/strl/publica/giken_dayori/166/5.html 久保、大出、“オブジェクトベース音響における音響メタデータのリアルタイム伝送装置の開発”、２０１９年映像情報メディア学会冬季大会、２２Ａ－３、２０１９年１２月１３日発表Kubo and Oide, "Development of a real-time transmission device for audio metadata in object-based audio", 2019 Institute of Image Information and Television Engineers Winter Conference, 22A-3, presented on December 13, 2019

上述したように、例えば、放送や配信などのような限られた伝送容量の伝送路を用いて、超高精細、大画面及び広視野角の特徴を持つ映像を符号化伝送するサービスを行うためには、伝送容量にあわせてデータ圧縮する映像符号化が不可欠である。８Ｋ映像の符号化などのこれまでの映像符号化技術では、各フレーム内を映像の内容によらずに均一に符号化するか、フレームを幾つかの領域に区切って、それぞれの符号化難易度等の主に映像の信号成分の特徴によって、符号量を割り当てるなどの方法がとられてきた。 As mentioned above, in order to provide services that code and transmit ultra-high definition, large screen, and wide viewing angle video using transmission channels with limited transmission capacity, such as broadcasting and distribution, video coding that compresses data according to the transmission capacity is essential. Previous video coding technologies, such as 8K video coding, have either coded each frame uniformly regardless of the video content, or divided a frame into several regions and assigned the amount of code mainly based on the characteristics of the video signal components, such as the difficulty of coding each region.

ところで、イマーシブメディアで用いられる映像では、８Ｋを上回る超高精細、大画面及び広視野角の映像が用いられることが想定される。このような超高精細、大画面及び広視野角の映像を視聴する場合、視聴者は画面全体を均一に見ているわけではなく、映像内容に応じて、その一部分を注視するような視聴形態が中心となる。このため、限られた伝送容量を有効に活用するためには、必ずしも画面全体を均一に符号化する必要はなく、視聴者が注視している映像の注視領域（ROI: Region of Interest）の品質のみを高く保つことでも、映像コンテンツの品質は高く保たれるケースが多い。そこで、映像信号を符号化する際に、視聴者の注視する領域である注視領域に応じてビットレートを増減させて画質を制御し、膨大な映像データを効率的に符号化することが有効である。 Incidentally, it is expected that ultra-high definition, large screen, and wide viewing angle images exceeding 8K will be used in immersive media. When watching such ultra-high definition, large screen, and wide viewing angle images, the viewer does not watch the entire screen uniformly, but mainly focuses on a part of the image depending on the content of the image. For this reason, in order to effectively utilize limited transmission capacity, it is not always necessary to uniformly encode the entire screen, and in many cases, the quality of the image content can be maintained high by maintaining high quality only in the region of interest (ROI) of the image where the viewer is focusing. Therefore, when encoding a video signal, it is effective to control the image quality by increasing or decreasing the bit rate according to the region of interest, which is the area where the viewer is focusing, and to efficiently encode huge amounts of video data.

ただし、従来技術によれば、視聴者の注視する領域（注視領域）を判別し、その情報を符号化装置に通知して映像符号化装置を制御する必要がある。また、視聴者によって注視領域は異なる場合が多く、視聴者の注視領域ごとに映像符号化装置を異なる制御を用いて動作させる必要がある。 However, with conventional technology, it is necessary to determine the area that the viewer is gazing at (gaze area) and notify the encoding device of that information to control the video encoding device. Also, the gaze area often differs from viewer to viewer, and the video encoding device must be operated using different controls for each viewer's gaze area.

一方、注視領域を、映像に応じて自動的に誘導することができれば、その注視領域を示す情報を映像符号化装置に付加情報として通知する必要がなくなり、それだけでなく、映像及び音声のコンテンツの視聴体験を効果的に高める要素として積極的に活用できる。 On the other hand, if the gaze area could be automatically guided according to the video, there would be no need to notify the video encoding device of information indicating the gaze area as additional information. Moreover, this information could be actively utilized as an element that effectively enhances the viewing experience of video and audio content.

例えば、コンテンツの視聴中に、或る方向から呼びかけられたり、大きな音がしたりすれば、その音の方向に視線を向け、注目領域を移動させるのは視聴者の自然な反応である。これを用いて、入力した映像中の映像オブジェクトと、音声オブジェクトとを紐付け、音声オブジェクトによる音の発生する方向の映像オブジェクトの周辺を注視領域として定め、その注視領域の符号化品質を他の領域よりも高く保つように符号化制御することで、主観的な品質の向上を図ることが可能である。 For example, if a viewer is called out to or hears a loud sound from a certain direction while watching content, it is a natural reaction for the viewer to turn their gaze in the direction of the sound and move the area of attention. Using this, it is possible to link a video object in the input video with an audio object, define the area around the video object in the direction from which the sound from the audio object is produced as the area of attention, and control the encoding so that the encoding quality of the area of attention is kept higher than other areas, thereby improving the subjective quality.

つまり、映像及び音声に応じて映像の注視領域を誘導し、映像コンテンツの品質を高く保ちながら映像を符号化する映像符号化装置が望まれる。 In other words, there is a need for a video encoding device that can guide the area of interest in a video in accordance with video and audio, and encode the video while maintaining high quality of the video content.

しかし、上述したオブジェクトベース音響システムでは、音声オブジェクトの再生位置情報など、それらに付随する様々な情報がメタデータとして付加されるようなシステムが検討されているが、この音声オブジェクトを２次元的な映像の映像オブジェクトに結び付けて、その映像符号化に活用するものとはなっていない。 However, while the object-based audio system discussed above is designed to add various pieces of information associated with audio objects, such as playback position information, as metadata, it is not yet possible to link these audio objects to video objects in two-dimensional video and use them in video encoding.

そこで、本発明の目的は、上述の問題に鑑みて、映像及び音声に応じて映像の注視領域を誘導し、映像コンテンツの品質を高く保ちながら映像を符号化する映像符号化装置及びプログラムを提供することにある。 In view of the above problems, the object of the present invention is to provide a video encoding device and program that guides the gaze area of a video in accordance with video and audio, and encodes the video while maintaining high quality of the video content.

本発明の映像符号化装置は、映像及び音声に応じて映像の注視領域を誘導するように映像を符号化する映像符号化装置であって、符号化処理前の映像を入力し、映像フレームにおける映像オブジェクトの位置座標を示す情報を含む付随する映像メタデータから、又は映像解析により、当該映像を構成する処理対象の映像フレームにおける映像オブジェクトの位置座標及び形状座標を抽出する映像オブジェクト位置・形状抽出部と、映像に付随する音響メタデータを基に、処理対象の映像フレーム毎に発音体とされる音声オブジェクトについて処理対象の映像フレームにおける位置座標を生成して、前記映像オブジェクトの位置座標を基に最近位置の映像オブジェクトに紐づけするとともに、処理対象の映像フレームにおける前記音声オブジェクトの音声を入力するか、又は前記映像メタデータを参照して、前記音声オブジェクトの音の大きさを検出し、処理対象の映像フレームにおける音声オブジェクトに紐づけされた映像オブジェクトの位置座標とその音の大きさの情報とを対応付ける音声位置生成部と、前記音声オブジェクトに紐づけされた映像オブジェクトの位置座標、及びその音の大きさの情報と前記映像オブジェクトの形状座標とを基に、所定の基準に基づいて、処理対象の映像フレーム毎に発音体として紐づけされる映像オブジェクトのうち注視誘導する映像オブジェクトを定め、その形状を囲むように誘導注視領域を決定する誘導注視領域生成部と、前記誘導注視領域の情報を基に、前記誘導注視領域の符号化画質を他の領域よりも高めるように、処理対象の映像フレームについて符号化する際の符号量を制御する符号化制御情報生成部と、前記符号量の制御に基づいて前記映像を符号化し、前記符号量の制御情報を少なくとも示す符号化パラメータを含む符号化ストリームを生成する映像符号化部と、を備えることを特徴とする。 The video encoding device of the present invention is a video encoding device that encodes video so as to guide a focus area of the video in accordance with video and audio, and includes a video object position/shape extraction unit that inputs video before encoding processing and extracts position coordinates and shape coordinates of a video object in a video frame to be processed that constitutes the video from accompanying video metadata including information indicating position coordinates of the video object in the video frame or by video analysis, and generates position coordinates in the video frame to be processed for an audio object that is a sound source for each video frame to be processed based on audio metadata accompanying the video, and links the audio object to the nearest video object based on the position coordinates of the video object, and inputs the audio of the audio object in the video frame to be processed or references the video metadata to detect the loudness of the audio object, and extracts the position coordinates and shape coordinates of the audio object in the video frame to be processed. The system is characterized by comprising: an audio position generation unit that associates the position coordinates of a video object linked to an audio object in a video frame with information on the volume of the video object; an induced gaze area generation unit that determines a video object to be gaze-guided among the video objects linked as sound sources for each video frame to be processed based on the position coordinates of the video object linked to the audio object, the volume information of the sound, and the shape coordinates of the video object based on a predetermined criterion; an encoding control information generation unit that controls the amount of code when encoding the video frame to be processed based on the information on the induced gaze area so as to improve the encoded image quality of the induced gaze area more than other areas; and a video encoding unit that encodes the video based on the control of the amount of code and generates an encoded stream including encoding parameters indicating at least the control information on the amount of code.

また、本発明の映像符号化装置において、前記符号化ストリームの受信側の視聴環境が予め想定して定められ、音響メタデータは、受信側の想定視聴位置及び表示装置の想定位置について予め定めた仮想空間上において、処理対象の映像フレームにおける音声オブジェクト毎に予め定めた音源の位置情報を含むことを特徴とする。 The video encoding device of the present invention is also characterized in that the viewing environment of the receiver of the encoded stream is assumed and determined in advance, and the audio metadata includes predetermined position information of a sound source for each audio object in the video frame to be processed in a virtual space that is predetermined for the assumed viewing position of the receiver and the assumed position of the display device.

また、本発明の映像符号化装置において、映像メタデータにおける映像オブジェクトと音響メタデータにおける音声オブジェクトとが前記符号化処理前に予め紐づけされていることを特徴とする。 The video encoding device of the present invention is also characterized in that the video objects in the video metadata and the audio objects in the audio metadata are linked in advance before the encoding process.

また、本発明の映像符号化装置において、前記音響メタデータは、音声オブジェクトの音の大きさを示す情報を含むことを特徴とする。 In the video encoding device of the present invention, the audio metadata includes information indicating the loudness of the audio object.

また、本発明の映像符号化装置において、前記符号化ストリームを受信する受信システムと双方向通信可能にオンライン接続され、前記受信システム側から、実際の視聴位置と実際の視聴位置からの視聴方向のいずれか一方、又は双方を示す情報を取得する手段を更に備え、前記誘導注視領域生成部は、前記音響メタデータから得られる想定視聴位置と想定視聴位置からの方向のいずれか一方又は双方を補正して前記誘導注視領域を定める手段を更に有することを特徴とする。 The video encoding device of the present invention is further characterized in that it is connected online to a receiving system that receives the encoded stream in a bidirectional manner and further comprises a means for acquiring information indicating either or both of the actual viewing position and the viewing direction from the actual viewing position from the receiving system, and the induced gaze area generating unit further comprises a means for determining the induced gaze area by correcting either or both of the assumed viewing position and the direction from the assumed viewing position obtained from the audio metadata.

更に、本発明のプログラムは、コンピューターを、本発明の映像符号化装置として機能させるためのプログラムとして構成する。 Furthermore, the program of the present invention is configured as a program for causing a computer to function as the video encoding device of the present invention.

本発明によれば、主観的な品質を低下させずに符号化データ量を圧縮することができるだけでなく、演出意図をより強く反映して効果的な視聴体験に結び付けるようなコンテンツの制作が可能となる。 The present invention not only makes it possible to compress the amount of encoded data without reducing subjective quality, but also makes it possible to produce content that more strongly reflects the director's intentions and leads to an effective viewing experience.

本発明による一実施例の映像符号化装置の概略構成を示すブロック図である。1 is a block diagram showing a schematic configuration of a video encoding device according to an embodiment of the present invention; 本発明による一実施例の映像符号化装置における映像符号化処理を示すフローチャートである。4 is a flowchart showing a video encoding process in a video encoding device according to an embodiment of the present invention. （ａ），（ｂ）は、それぞれ本発明による一実施例の映像符号化装置における映像メタデータ及び音響メタデータから得られる情報例を示す図である。3A and 3B are diagrams showing examples of information obtained from video metadata and audio metadata, respectively, in a video encoding device according to an embodiment of the present invention. 本発明による一実施例の映像符号化装置における動作を説明する図である。FIG. 2 is a diagram for explaining the operation of a video encoding device according to an embodiment of the present invention.

（装置構成）
以下、図面を参照して、本発明による一実施例の映像符号化装置１の構成について説明する。図１は、本発明による一実施例の映像符号化装置１の概略構成を示すブロック図である。図１に示す映像符号化装置１は、入力した映像中の映像オブジェクトと、音声オブジェクトとを紐付け、音声オブジェクトによる音の発生する方向の映像オブジェクトの周辺を注視領域として定め、その注視領域の符号化品質を他の領域よりも高く保つように符号化制御する装置であり、映像オブジェクト位置・形状抽出部２、音声位置生成部３、誘導注視領域生成部４、符号化制御情報生成部５、及び映像符号化部６を備える。 (Device configuration)
Hereinafter, the configuration of a video encoding device 1 according to an embodiment of the present invention will be described with reference to the drawings. Fig. 1 is a block diagram showing a schematic configuration of a video encoding device 1 according to an embodiment of the present invention. The video encoding device 1 shown in Fig. 1 is a device that links a video object and an audio object in an input video, determines the periphery of the video object in the direction in which a sound is generated by the audio object as a gaze area, and controls encoding so as to maintain the encoding quality of the gaze area higher than other areas, and includes a video object position/shape extraction unit 2, an audio position generation unit 3, an induced gaze area generation unit 4, an encoding control information generation unit 5, and a video encoding unit 6.

映像オブジェクト位置・形状抽出部２は、符号化処理前の映像（１映像フレーム単位又はＧＯＰ（group of pictures）単位で示される符号化処理前の映像ファイル）を入力し、映像フレームにおける映像オブジェクトの位置座標を示す情報を含む付随する映像メタデータから、又は映像解析により、処理対象の映像フレームにおける映像オブジェクトの位置座標及び形状座標を抽出し、それぞれ音声位置生成部３及び誘導注視領域生成部４に出力する。処理対象の映像フレームにおける映像オブジェクトの位置座標は、その映像オブジェクトの位置の代表的な位置を識別できればよく、例えば映像オブジェクトの重心とする。 The video object position/shape extraction unit 2 inputs pre-encoding video (a pre-encoding video file shown in units of one video frame or GOP (group of pictures)), extracts the position coordinates and shape coordinates of the video object in the video frame to be processed from the associated video metadata including information indicating the position coordinates of the video object in the video frame, or by video analysis, and outputs these to the audio position generation unit 3 and the induced gaze area generation unit 4, respectively. The position coordinates of the video object in the video frame to be processed can be any position that identifies a representative position of the video object, for example the center of gravity of the video object.

ここで、映像オブジェクト位置・形状抽出部２は、付随する映像メタデータを入力するものとするときは、その映像メタデータから処理対象の映像フレームにおける映像オブジェクトの位置座標を得て、その映像オブジェクトの位置座標に位置する映像オブジェクトの形状（形状座標）を抽出する。また、映像オブジェクト位置・形状抽出部２は、映像メタデータから処理対象の映像フレームにおける映像オブジェクトの位置座標が得られないときは、処理対象の映像フレームから映像オブジェクトの位置及び形状の双方を抽出する。処理対象の映像フレームにおける映像オブジェクトの位置及び形状の各座標の抽出方法は、任意であり、例えば特許文献１，２の技法を利用できる。 When the video object position/shape extraction unit 2 receives associated video metadata, it obtains the position coordinates of the video object in the video frame being processed from the video metadata, and extracts the shape (shape coordinates) of the video object located at the position coordinates of the video object. When the video object position/shape extraction unit 2 cannot obtain the position coordinates of the video object in the video frame being processed from the video metadata, it extracts both the position and shape of the video object from the video frame being processed. Any method can be used to extract the position and shape coordinates of the video object in the video frame being processed, and the techniques described in Patent Documents 1 and 2, for example, can be used.

音声位置生成部３は、映像に付随する音響メタデータを入力し、処理対象の映像フレーム毎に音響メタデータを基に発音体とされる音声オブジェクトについて処理対象の映像フレームにおける位置座標を生成し、映像オブジェクト位置・形状抽出部２から得られる映像オブジェクトの位置座標を参照して、音声オブジェクトの位置座標に対し最近位置の映像オブジェクトを選定し、音声オブジェクトと映像オブジェクトとを紐づけする。また、音声位置生成部３は、処理対象の映像フレームにおける当該発音体とされる音声オブジェクトの音声を入力し、その音声オブジェクトの音の大きさ（例えば平均又は最大の音響パワー）を検出して、処理対象の映像フレームにおける音声オブジェクトに紐づけされた映像オブジェクトの位置座標とその音の大きさの情報とを対応付けて誘導注視領域生成部４に出力する。ただし、音響メタデータに、音声オブジェクトの音の大きさを示す情報が含まれているときは、音声入力を省略できる。 The audio position generation unit 3 inputs audio metadata associated with the video, generates position coordinates in the video frame to be processed for the audio object that is considered to be a sound generator based on the audio metadata for each video frame to be processed, selects the video object closest to the position coordinates of the audio object by referring to the position coordinates of the video object obtained from the video object position/shape extraction unit 2, and links the audio object to the video object. The audio position generation unit 3 also inputs the audio of the audio object that is considered to be a sound generator in the video frame to be processed, detects the loudness of the audio object (e.g., average or maximum audio power), and outputs the position coordinates of the video object linked to the audio object in the video frame to be processed and information on the loudness of the sound to the induced gaze area generation unit 4 in association with each other. However, when the audio metadata includes information indicating the loudness of the audio object, audio input can be omitted.

本実施例では、映像メタデータにおける映像オブジェクトと音響メタデータにおける音声オブジェクトとが制作段階で紐づけされていない場合を想定している。音響メタデータには受信側の想定視聴位置及び表示装置の想定位置について予め定めた仮想空間上において、処理対象の映像フレームにおける音声オブジェクト毎に予め定めた音源の位置情報が含まれている。このため、音声位置生成部３は、音響メタデータを基に発音体とされる音声オブジェクトについて処理対象の映像フレームにおける、想定視聴位置を基準とした位置座標を生成することができ、映像オブジェクトの位置座標を基に最近位置の映像オブジェクトに紐づけすることができる。一方で、上述したオブジェクトベース音響システムのように、ＣＧ（Computer Graphics）による映像制作などの制作段階で、映像メタデータにおける映像オブジェクトと音響メタデータにおける音声オブジェクトとが予め紐づけされているときは、音声位置生成部３は、この紐づけ処理を省略し、処理対象の映像フレームにおける音声オブジェクトに予め紐づけされた映像オブジェクトの位置座標とその音の大きさの情報とを対応付けて誘導注視領域生成部４に出力する。 In this embodiment, it is assumed that the video object in the video metadata and the audio object in the audio metadata are not linked at the production stage. The audio metadata includes position information of a sound source that is predetermined for each audio object in the video frame to be processed in a virtual space that is predetermined for the expected viewing position of the receiving side and the expected position of the display device. Therefore, the audio position generation unit 3 can generate position coordinates based on the expected viewing position in the video frame to be processed for the audio object that is a sound source based on the audio metadata, and can link it to the video object in the nearest position based on the position coordinates of the video object. On the other hand, as in the above-mentioned object-based audio system, when the video object in the video metadata and the audio object in the audio metadata are linked in advance at the production stage, such as video production by CG (Computer Graphics), the audio position generation unit 3 omits this linking process and outputs the position coordinates of the video object that is linked in advance to the audio object in the video frame to be processed and the information on the volume of the sound to the induced gaze area generation unit 4 in association with each other.

誘導注視領域生成部４は、音声位置生成部３から得られる処理対象の映像フレームにおける音声オブジェクトに紐づけされた映像オブジェクトの位置座標、及びその音の大きさの情報と、映像オブジェクト位置・形状抽出部２から得られる映像オブジェクトの形状座標とを基に、処理対象の映像フレーム毎に、「呼びかけ」などの音声の内容、映像オブジェクトの位置や大きさ等の所定の基準に基づいて、発音体として紐づけされる映像オブジェクトのうち注視誘導する映像オブジェクトを定め、その形状を囲むように誘導注視領域を決定し、その誘導注視領域の情報を符号化制御情報生成部５に出力する。ここで、「映像オブジェクトの形状を囲むように」とは、注視誘導する映像オブジェクトの形状そのものの領域としてもよいし、映像オブジェクトの形状の周辺の数画素分をなぞる領域としてもよいし、映像オブジェクトの形状全体を含む予め定めた丸状又は角状の領域としてもよい。 Based on the position coordinates and volume information of the video object linked to the audio object in the video frame to be processed obtained from the audio position generation unit 3, and the shape coordinates of the video object obtained from the video object position and shape extraction unit 2, the induced gaze area generation unit 4 determines the video object to be gaze-guided among the video objects linked as sound sources, based on predetermined criteria such as the content of the audio such as a "call" and the position and size of the video object, for each video frame to be processed, determines an induced gaze area to surround the shape, and outputs information on the induced gaze area to the coding control information generation unit 5. Here, "surrounding the shape of the video object" may be an area of the shape of the video object to be gaze-guided, an area tracing a few pixels around the shape of the video object, or a predetermined round or angular area including the entire shape of the video object.

符号化制御情報生成部５は、誘導注視領域生成部４から得られる誘導注視領域の情報を基に、誘導注視領域の符号化画質を他の領域よりも高めるように符号化制御情報を生成し、映像符号化部６における処理対象の映像フレームについて符号化する際の符号量を制御する。 The encoding control information generating unit 5 generates encoding control information based on the information on the induced gaze area obtained from the induced gaze area generating unit 4 so as to improve the encoding image quality of the induced gaze area more than other areas, and controls the amount of code when encoding the video frame to be processed in the video encoding unit 6.

映像符号化部６は、符号化制御情報生成部５による符号量の制御に基づいて、１映像フレーム単位又はＧＯＰ単位で、入力する映像を符号化し、その符号量の制御を少なくとも示す符号化パラメータ（符号化制御情報を含む。）を含む符号化ストリームを生成して映像復号装置（図示略）に向けて伝送する。 The video encoding unit 6 encodes the input video in units of one video frame or one GOP based on the control of the code amount by the encoding control information generation unit 5, generates an encoded stream including encoding parameters (including encoding control information) that at least indicate the control of the code amount, and transmits it to a video decoding device (not shown).

（装置動作）
以下、図２を基に、図３及び図４を参照しながら、本実施例の映像符号化装置１における映像符号化処理と動作を説明する。図２は、本発明による一実施例の映像符号化装置１における映像符号化処理を示すフローチャートである。また、図３（ａ），（ｂ）は、それぞれ本発明による一実施例の映像符号化装置１における映像メタデータ及び音響メタデータから得られる情報例を示す図である。そして、図４は、本発明による一実施例の映像符号化装置１における動作を説明する図である。 (Device operation)
The video encoding process and operation in the video encoding device 1 of this embodiment will be described below based on Fig. 2 with reference to Figs. 3 and 4. Fig. 2 is a flowchart showing the video encoding process in the video encoding device 1 of one embodiment according to the present invention. Also, Figs. 3(a) and 3(b) are diagrams showing examples of information obtained from video metadata and audio metadata, respectively, in the video encoding device 1 of one embodiment according to the present invention. And Fig. 4 is a diagram explaining the operation in the video encoding device 1 of one embodiment according to the present invention.

図２を参照するに、まず、本実施例の映像符号化装置１は、映像オブジェクト位置・形状抽出部２により、１映像フレーム単位又はＧＯＰ（group of pictures）単位で示される符号化処理前の映像を入力し（ステップＳ１）、映像フレームにおける映像オブジェクトの位置座標を示す情報を含む付随する映像メタデータから、又は映像解析により、処理対象の映像フレームにおける映像オブジェクトの位置座標及び形状座標を抽出する（ステップＳ２）。処理対象の映像フレームにおける映像オブジェクトの位置座標は、その映像オブジェクトの位置の代表的な位置を識別できればよく、例えば映像オブジェクトの重心とする。 Referring to FIG. 2, the video encoding device 1 of this embodiment first inputs pre-encoding video represented in units of one video frame or GOP (group of pictures) by the video object position/shape extraction unit 2 (step S1), and extracts the position coordinates and shape coordinates of the video object in the video frame to be processed from the associated video metadata including information indicating the position coordinates of the video object in the video frame, or by video analysis (step S2). The position coordinates of the video object in the video frame to be processed can be any position that identifies a representative position of the video object, for example the center of gravity of the video object.

ここで、映像メタデータにおける映像オブジェクトと音響メタデータにおける音声オブジェクトとが符号化処理前に予め紐づけされているときは受信側の視聴環境を予め想定して定めておく必要はないが、本実施例では、図４に示すように、映像符号化装置１が生成した符号化ストリームの受信側の視聴環境が予め想定して定められたものとしており、例えば想定視聴位置を原点とする世界座標（Ｘ，Ｙ，Ｚ）において、想定視聴位置からの表示装置（ディスプレイ）の位置と、その表示装置（ディスプレイ）に対応する映像フレームＦｎの大きさ（Ｈ，Ｖ）が定義づけされている。即ち、映像フレームＦｎの位置座標も世界座標（Ｘ，Ｙ，Ｚ）上で定義づけされている。尚、図４に示す例は一例であり、想定視聴位置及び世界座標（Ｘ，Ｙ，Ｚ）の定義は、任意に定めることができる。 Here, when the video object in the video metadata and the audio object in the audio metadata are linked in advance before the encoding process, there is no need to assume and determine the viewing environment on the receiving side in advance. However, in this embodiment, as shown in FIG. 4, the viewing environment on the receiving side of the encoded stream generated by the video encoding device 1 is assumed and determined in advance. For example, in world coordinates (X, Y, Z) with the assumed viewing position as the origin, the position of the display device (display) from the assumed viewing position and the size (H, V) of the video frame Fn corresponding to that display device (display) are defined. In other words, the position coordinates of the video frame Fn are also defined on the world coordinates (X, Y, Z). Note that the example shown in FIG. 4 is just one example, and the definitions of the assumed viewing position and world coordinates (X, Y, Z) can be determined arbitrarily.

このため、映像オブジェクト位置・形状抽出部２は、付随する映像メタデータを入力するものとするときは、その映像メタデータから処理対象の映像フレームにおける映像オブジェクトの位置座標を得て、その映像オブジェクトの位置座標に位置する映像オブジェクトの形状（形状座標）を抽出する。映像メタデータは、図３（ａ）に示すように、或る映像フレームＦｎ（ｎはフレーム番号を示す。）中の予め定義された映像オブジェクト毎に、例えば映像オブジェクト番号♯１の重心座標を（ｈ１，ｖ１）、映像オブジェクト番号♯２の重心座標を（ｈ２，ｖ２）とするように、映像フレームＦｎの大きさ（Ｈ，Ｖ）の範囲内で各映像オブジェクトの位置座標を抽出できるものとすることができる。 For this reason, when the video object position/shape extraction unit 2 receives associated video metadata, it obtains the position coordinates of the video object in the video frame being processed from the video metadata, and extracts the shape (shape coordinates) of the video object located at the position coordinates of the video object. As shown in FIG. 3(a), the video metadata can be such that, for each predefined video object in a video frame Fn (n indicates the frame number), the position coordinates of each video object can be extracted within the size (H, V) of the video frame Fn, such that the center of gravity coordinates of video object number #1 are (h1, v1), the center of gravity coordinates of video object number #2 are (h2, v2), etc.

また、映像オブジェクト位置・形状抽出部２は、映像メタデータから処理対象の映像フレームにおける映像オブジェクトの位置座標が得られないときは、処理対象の映像フレームから映像オブジェクトの位置及び形状の双方を抽出する。この場合、映像フレームＦｎにおける各映像オブジェクトを識別する映像オブジェクト番号を仮設定する。 In addition, when the position coordinates of a video object in a video frame to be processed cannot be obtained from the video metadata, the video object position/shape extraction unit 2 extracts both the position and shape of the video object from the video frame to be processed. In this case, a video object number that identifies each video object in video frame Fn is provisionally set.

次に、本実施例の映像符号化装置１は、音声位置生成部３により、映像に付随する音響メタデータを入力し、処理対象の映像フレーム毎に音響メタデータを基に発音体とされる音声オブジェクトについて処理対象の映像フレームにおける位置座標を生成して、映像オブジェクト位置・形状抽出部２から得られる映像オブジェクトの位置座標を基に最近位置の映像オブジェクトに紐づけするとともに、処理対象の映像フレームにおける当該発音体とされる音声オブジェクトの音声を入力し、その音声オブジェクトの音の大きさ（例えば平均又は最大の音響パワー）を検出して、処理対象の映像フレームにおける音声オブジェクトに紐づけされた映像オブジェクトの位置座標とその音の大きさの情報とを対応付ける（ステップＳ３）。ただし、音響メタデータに、音声オブジェクトの音の大きさを示す情報が含まれているときは、音声入力を省略できる。 Next, the video encoding device 1 of this embodiment inputs audio metadata associated with the video by the audio position generation unit 3, generates position coordinates in the video frame to be processed for the audio object that is considered to be a sound generator based on the audio metadata for each video frame to be processed, and links it to the nearest video object based on the position coordinates of the video object obtained from the video object position/shape extraction unit 2, and inputs the audio of the audio object that is considered to be a sound generator in the video frame to be processed, detects the loudness of the audio object (e.g., average or maximum audio power), and associates the position coordinates of the video object linked to the audio object in the video frame to be processed with the information on the loudness of that sound (step S3). However, if the audio metadata includes information indicating the loudness of the audio object, audio input can be omitted.

本実施例では、映像メタデータにおける映像オブジェクトと音響メタデータにおける音声オブジェクトとが制作段階で紐づけされていない場合を想定した例を説明する。この場合の音響メタデータには、受信側の想定視聴位置及び表示装置の想定位置について予め定めた仮想空間上において、処理対象の映像フレームにおける音声オブジェクト毎に予め定めた音源の位置情報が含まれている。このため、図３（ｂ）に示すように、処理対象の映像フレームにおける音声オブジェクト毎に予め定めた想定視聴位置を基準とした音源方向を特定できる。音響メタデータは、或る映像フレームＦｎ中の予め定義された音声オブジェクト毎に、例えば音声オブジェクト番号♯１の想定視聴位置を基準とした音源方向を（ｘ１，ｙ１，ｚ１）、音声オブジェクト番号♯２の想定視聴位置を基準とした音源方向を（ｘ２，ｙ２，ｚ２）とするように、世界座標（Ｘ，Ｙ，Ｚ）において、想定視聴位置からのディスプレイの位置が定義づけされていることから、或る映像フレームＦｎ中の各音声オブジェクトの位置座標を抽出できる。 In this embodiment, an example will be described assuming that the video object in the video metadata and the audio object in the audio metadata are not linked at the production stage. The audio metadata in this case includes position information of the sound source predefined for each audio object in the video frame to be processed in a virtual space predefined for the expected viewing position of the receiving side and the expected position of the display device. Therefore, as shown in FIG. 3B, the sound source direction can be specified based on the predefined expected viewing position for each audio object in the video frame to be processed. For example, the sound source direction based on the expected viewing position of audio object number #1 is (x1, y1, z1), and the sound source direction based on the expected viewing position of audio object number #2 is (x2, y2, z2) for each predefined audio object in a certain video frame Fn. Since the display position from the expected viewing position is defined in world coordinates (X, Y, Z), the position coordinates of each audio object in a certain video frame Fn can be extracted.

そして、音声位置生成部３は、或る映像フレームＦｎ中の各音声オブジェクトの位置座標に対し最近位置の位置座標を持つ映像オブジェクトを選定し、発音体とされる音声オブジェクトと映像オブジェクトを紐づけすることができる。ここでは、図４に示すように、説明の便宜上、映像オブジェクト♯１が音声オブジェクト♯１に紐づけされ、映像オブジェクト♯１が音声オブジェクト♯１に紐づけされる。即ち、音声位置生成部３は、仮設定した映像オブジェクト番号であれ、予め定義された映像オブジェクト番号であれ、或る映像フレームＦｎ中の各音声オブジェクトの位置座標に対し最近位置の位置座標を持つ映像オブジェクトを選定するため、発音体とされる音声オブジェクトと映像オブジェクトを紐づけすることができる。尚、紐付けするオブジェクト数は１つとは限らず複数となる場合もある。 Then, the audio position generation unit 3 can select a video object having position coordinates that are closest to the position coordinates of each audio object in a certain video frame Fn, and link the audio object that is the sound generator to the video object. Here, as shown in FIG. 4, for convenience of explanation, video object #1 is linked to audio object #1, and video object #1 is linked to audio object #1. In other words, the audio position generation unit 3 selects a video object having position coordinates that are closest to the position coordinates of each audio object in a certain video frame Fn, whether it is a provisionally set video object number or a predefined video object number, and can link the audio object that is the sound generator to the video object. Note that the number of objects to be linked is not limited to one, and may be multiple.

そして、音声位置生成部３は、入力音声から音声オブジェクトの音の大きさ（例えば平均又は最大の音響パワー）を検出するため、処理対象の映像フレームにおける音声オブジェクトに紐づけされた映像オブジェクトの位置座標とその音の大きさの情報とを対応付けることができる。一方で、上述したオブジェクトベース音響システムのように、制作段階で、映像メタデータにおける映像オブジェクトと音響メタデータにおける音声オブジェクトとが予め紐づけされているときは、音声位置生成部３は、この紐づけ処理を省略し、処理対象の映像フレームにおける音声オブジェクトに予め紐づけされた映像オブジェクトの位置座標とその音の大きさの情報とを対応付けることができる。 The audio position generation unit 3 detects the loudness (e.g., average or maximum sound power) of the audio object from the input audio, and can associate the position coordinates of the video object linked to the audio object in the video frame to be processed with the loudness information. On the other hand, as in the above-mentioned object-based audio system, when the video object in the video metadata and the audio object in the audio metadata are linked in advance at the production stage, the audio position generation unit 3 can omit this linking process and associate the position coordinates of the video object linked in advance to the audio object in the video frame to be processed with the loudness information.

次に、本実施例の映像符号化装置１は、誘導注視領域生成部４により、音声位置生成部３から得られる処理対象の映像フレームにおける音声オブジェクトに紐づけされた映像オブジェクトの位置座標、及びその音の大きさの情報と、映像オブジェクト位置・形状抽出部２から得られる映像オブジェクトの形状座標とを基に、処理対象の映像フレーム毎に、所定の基準に基づいて、発音体として紐づけされる映像オブジェクトのうち注視誘導する映像オブジェクトを定め、その形状を囲むように誘導注視領域を決定する（ステップＳ４）。 Next, in the video encoding device 1 of this embodiment, the induced gaze area generation unit 4 determines, based on the position coordinates and sound volume information of the video object linked to the audio object in the video frame to be processed obtained from the audio position generation unit 3, and the shape coordinates of the video object obtained from the video object position and shape extraction unit 2, a video object to be gaze-induced from among the video objects linked as sound sources, based on a predetermined criterion, for each video frame to be processed, and determines an induced gaze area to surround the shape of that video object (step S4).

例えば、誘導注視領域生成部４は、「呼びかけ」などの音声の内容、映像オブジェクトの位置や大きさ等の所定の基準に基づいて、どの映像オブジェクトが注視されやすいかを順序付けて、注視されやすいと判断した高順位のものから所定順位以内（例えば、図４に示す２つ以内と定めることができる。）の映像オブジェクトを注視誘導する誘導注視領域として決定（その映像オブジェクトの形状を囲むように決定）する。尚、図４では、誘導注視領域を、映像オブジェクトの形状全体を含む予め定めた角状の領域としているが、注視誘導する映像オブジェクトの形状そのものの領域としてもよいし、映像オブジェクトの形状の周辺の数画素分をなぞる領域としてもよいし、映像オブジェクトの形状全体を含む予め定めた丸状の領域としてもよい。これにより、誘導注視領域生成部４は、どの映像オブジェクトに視線を誘導したいかを定めることができる。尚、誘導注視領域は、１つ又は複数として予め定めた数に制限するのが符号量を抑制するのに好適であり、複数の誘導注視領域は互いに重なる領域を持つときは１つの誘導注視領域として改めて定めるのが処理負荷及び品質の観点で好適である。 For example, the induced gaze area generating unit 4 ranks which video objects are likely to be gazed at based on predetermined criteria such as the content of the voice such as a "call" and the position and size of the video object, and determines the video objects within a predetermined rank (for example, within two as shown in FIG. 4) from the highest ranked ones determined to be likely to be gazed at as the induced gaze area for guiding the gaze (determined to surround the shape of the video object). Note that in FIG. 4, the induced gaze area is a predetermined angular area that includes the entire shape of the video object, but it may be an area of the shape of the video object itself that is to be gazed at, an area that traces several pixels around the shape of the video object, or a predetermined circular area that includes the entire shape of the video object. In this way, the induced gaze area generating unit 4 can determine which video object to guide the gaze to. In addition, it is preferable to limit the number of induced attention areas to a predetermined number, such as one or more, in order to reduce the amount of code, and when multiple induced attention areas overlap each other, it is preferable to redefine them as a single induced attention area in terms of processing load and quality.

次に、本実施例の映像符号化装置１は、符号化制御情報生成部５により、誘導注視領域生成部４から得られる誘導注視領域の情報を基に、誘導注視領域の符号化画質を他の領域よりも高めるように、映像符号化部６における処理対象の映像フレームについて符号化する際の符号量を制御する。そして、映像符号化部６は、符号化制御情報生成部５による符号量の制御に基づいて、１映像フレーム単位又はＧＯＰ単位で、入力する映像を符号化し、その符号量の制御情報を少なくとも示す符号化パラメータ（符号化制御情報を含む。）を含む符号化ストリームを生成して映像復号装置（図示略）に向けて伝送する（ステップＳ５）。 Next, in the video encoding device 1 of this embodiment, the encoding control information generation unit 5 controls the amount of code when encoding the video frame to be processed in the video encoding unit 6 based on the information on the induced attention area obtained from the induced attention area generation unit 4 so as to improve the encoded image quality of the induced attention area more than other areas. Then, based on the control of the amount of code by the encoding control information generation unit 5, the video encoding unit 6 encodes the input video in units of one video frame or one GOP, generates an encoded stream including encoding parameters (including encoding control information) indicating at least the control information of the amount of code, and transmits it to the video decoding device (not shown) (step S5).

このように、符号化制御情報生成部５では、誘導注視領域の符号化画質を他の領域よりも高めるように符号量の制御を行う。具体的には、注視領域に多くの符号量を割り当てるなどの方法がある。 In this way, the coding control information generator 5 controls the amount of code so as to improve the coding quality of the induced attention area more than other areas. Specifically, one method is to allocate a larger amount of code to the attention area.

そして、映像符号化部６は、符号化制御情報生成部５による符号量の制御によって、入力映像の符号化を符号化し符号化ストリームを符号化パラメータとともに外部出力する。符号化方式は任意の方法で構わないが、画面内の位置によって符号量（量子化パラメータの値）を変化させることができるものとする。そして、量子化パラメータの値は符号化パラメータとして伝送可能である。変化させる領域は符号化処理に応じたブロック等を単位とすることとする。 Then, the video encoding unit 6 encodes the input video by controlling the code amount by the encoding control information generation unit 5, and outputs the encoded stream together with the encoding parameters to the outside. Any encoding method may be used, but it is assumed that the code amount (value of the quantization parameter) can be changed depending on the position on the screen. The value of the quantization parameter can be transmitted as an encoding parameter. The area to be changed is assumed to be in units of blocks, etc., according to the encoding process.

尚、図示を省略するが、本実施例の映像符号化装置１から伝送された符号化ストリームを受信し復号する映像復号装置は、映像符号化装置１から得られる符号化パラメータを基に当該符号化ストリームを復号する形態であればよく、既存の映像復号処理と同様の装置を利用できる。 Although not shown in the figure, the video decoding device that receives and decodes the encoded stream transmitted from the video encoding device 1 of this embodiment may be configured to decode the encoded stream based on the encoding parameters obtained from the video encoding device 1, and may use a device similar to that used for existing video decoding processes.

このようにして、本実施例の映像符号化装置１は、音響情報（音響メタデータ及び音声入力）を基に、映像符号化に係る誘導注視領域を決定し符号化する構成としたことにより、主観的な品質を保ちながら、符号化データ量を削減するとともに、受信側で演出意図をより強く反映して効果的な視聴体験をもたらすことができる。 In this way, the video encoding device 1 of this embodiment is configured to determine and encode an induced gaze area for video encoding based on audio information (audio metadata and audio input), thereby reducing the amount of encoded data while maintaining subjective quality, and providing an effective viewing experience that more strongly reflects the director's intentions on the receiving side.

従って、本発明に係る映像符号化装置１によれば、主観的な品質を低下させずに符号化データ量を圧縮することができるだけでなく、演出意図をより強く反映して効果的な視聴体験に結び付けるようなコンテンツの制作が可能となる。 Therefore, the video encoding device 1 according to the present invention not only makes it possible to compress the amount of encoded data without reducing subjective quality, but also makes it possible to produce content that more strongly reflects the director's intentions and leads to an effective viewing experience.

本発明に係る映像符号化装置１は、コンピューターにより構成することができ、映像符号化装置１の各処理部を機能させるためのプログラムを好適に用いることができる。具体的には、映像符号化装置１の各処理部を制御するための制御部をコンピューター内の中央演算処理装置（ＣＰＵ）で構成でき、且つ、各処理部を動作させるのに必要となるプログラムを適宜記憶する記憶部を少なくとも１つのメモリで構成させることができる。即ち、そのようなコンピューターに、ＣＰＵによって該プログラムを実行させることにより、映像符号化装置１の各処理部の有する機能を実現させることができる。更に、映像符号化装置１の各処理部の有する機能を実現させるためのプログラムを、前述の記憶部（メモリ）の所定の領域に格納させることができる。そのような記憶部は、装置内部のＲＡＭ又はＲＯＭなどで構成させることができ、或いは又、外部記憶装置（例えば、ハードディスク）で構成させることもできる。また、そのようなプログラムは、コンピューターで利用されるＯＳ上のソフトウェア（ＲＯＭ又は外部記憶装置に格納される）の一部で構成させることができる。更に、そのようなコンピューターに、映像符号化装置１の各処理部として機能させるためのプログラムは、コンピューター読取り可能な記録媒体に記録することができる。また、映像符号化装置１の各処理部をハードウェア又はソフトウェアの一部として構成させ、各々を組み合わせて実現させることもできる。 The video encoding device 1 according to the present invention can be configured by a computer, and a program for making each processing unit of the video encoding device 1 function can be suitably used. Specifically, a control unit for controlling each processing unit of the video encoding device 1 can be configured by a central processing unit (CPU) in a computer, and a storage unit for appropriately storing a program required for operating each processing unit can be configured by at least one memory. That is, by having such a computer execute the program by the CPU, the function of each processing unit of the video encoding device 1 can be realized. Furthermore, a program for realizing the function of each processing unit of the video encoding device 1 can be stored in a predetermined area of the above-mentioned storage unit (memory). Such a storage unit can be configured by a RAM or ROM inside the device, or can be configured by an external storage device (for example, a hard disk). In addition, such a program can be configured as part of the software on the OS used by the computer (stored in a ROM or an external storage device). Furthermore, a program for making such a computer function as each processing unit of the video encoding device 1 can be recorded on a computer-readable recording medium. Additionally, each processing unit of the video encoding device 1 can be configured as part of hardware or software, and can be realized by combining each part.

以上、特定の実施形態の例を挙げて本発明を説明したが、本発明は前述の実施形態の例に限定されるものではなく、その技術思想を逸脱しない範囲で種々変形可能である。例えば、上述した実施例では、主として、音声入力に基づいて音声オブジェクト毎の音響レベルを検出する例を説明したが、音響メタデータが音声オブジェクト毎の音響レベルを記述した形態とすることで、音声入力を省略して、発音体とされる映像オブジェクトを検出できる。 Although the present invention has been described above by giving examples of specific embodiments, the present invention is not limited to the above-mentioned embodiments, and various modifications are possible without departing from the technical concept thereof. For example, in the above-mentioned embodiment, an example was mainly described in which the acoustic level of each audio object is detected based on audio input, but by making the audio metadata describe the acoustic level of each audio object, it is possible to omit the audio input and detect the video object that is the sound source.

また、本発明に係る映像符号化装置１は、映像符号化装置１が生成した符号化ストリームを受信する受信システム（映像復号装置、表示装置、及び周辺機器を含む）と双方向通信可能にオンライン接続した形態とすることができる。この場合には、受信システム側から映像符号化装置１側への上り回線を利用して、映像符号化装置１は、「実際の視聴位置」と「実際の視聴位置からの視聴方向」のいずれか一方、又は双方を示す情報を取得する手段を更に備えるものとする。そして、映像符号化装置１における誘導注視領域生成部４は、音響メタデータから得られる「想定視聴位置」と「想定視聴位置からの方向」のいずれか一方又は双方を補正して、誘導注視領域を定めることもできる。 The video encoding device 1 according to the present invention can also be configured to be connected online to enable two-way communication with a receiving system (including a video decoding device, a display device, and peripheral devices) that receives the encoded stream generated by the video encoding device 1. In this case, the video encoding device 1 further includes a means for acquiring information indicating either or both of the "actual viewing position" and the "viewing direction from the actual viewing position" by using an upstream line from the receiving system to the video encoding device 1. The induced gaze area generating unit 4 in the video encoding device 1 can also determine the induced gaze area by correcting either or both of the "assumed viewing position" and the "direction from the assumed viewing position" obtained from the audio metadata.

例えば、本発明に係る映像符号化装置１と、受信システム（映像復号装置、表示装置、及び周辺機器を含む）とを双方向通信可能にオンライン接続した形態において、受信システムにおける周辺機器としてＶＲ（Virtual Reality）ゴーグルを利用する形態とすることができる。この場合、映像符号化装置１における誘導注視領域生成部４は、そのＶＲゴーグルに設けられる加速度センサ、ジャイロスコープといった慣性計測装置（IMU：Inertial Measurement Unit）からの視線方向が得られるときは、音響メタデータから得られる「想定視聴位置からの方向」を補正して誘導注視領域を定めることができる。 For example, in a configuration in which the video encoding device 1 according to the present invention and a receiving system (including a video decoding device, a display device, and peripheral devices) are connected online to enable two-way communication, a configuration can be adopted in which VR (Virtual Reality) goggles are used as peripheral devices in the receiving system. In this case, when the gaze direction is obtained from an inertial measurement unit (IMU) such as an acceleration sensor and a gyroscope provided in the VR goggles, the induced gaze area generation unit 4 in the video encoding device 1 can determine the induced gaze area by correcting the "direction from the expected viewing position" obtained from the audio metadata.

また、本発明に係る映像符号化装置１と、受信システム（映像復号装置、表示装置、及び周辺機器を含む）とを双方向通信可能にオンライン接続した形態において、受信システムにおける周辺機器として表示装置と実際の視聴位置との相対的な位置関係を測定する受信環境感知センサーが設けられた形態とすることができる。この場合、映像符号化装置１における誘導注視領域生成部４は、この受信環境感知センサーで実際の視聴位置を測定し、その実際の視聴位置を示す情報を上り回線を使用して映像符号化装置１側に伝える構成とすることで、音響メタデータから得られる「想定視聴位置」を補正して誘導注視領域を定めることができる。 In addition, in a configuration in which the video encoding device 1 according to the present invention and a receiving system (including a video decoding device, a display device, and peripheral devices) are connected online to enable two-way communication, a receiving environment detection sensor that measures the relative positional relationship between the display device and the actual viewing position can be provided as a peripheral device in the receiving system. In this case, the induced gaze area generation unit 4 in the video encoding device 1 can be configured to measure the actual viewing position using this receiving environment detection sensor and transmit information indicating the actual viewing position to the video encoding device 1 side using an upstream line, thereby correcting the "assumed viewing position" obtained from the audio metadata to determine the induced gaze area.

本発明によれば、主観的な品質を低下させずに符号化データ量を圧縮することができるだけでなく、演出意図をより強く反映して効果的な視聴体験に結び付けるようなコンテンツの制作が可能となるので、超高精細、大画面及び広視野角の特徴を持つ映像の符号化伝送の用途に有用である。 The present invention not only makes it possible to compress the amount of encoded data without reducing subjective quality, but also enables the production of content that more strongly reflects the director's intentions and leads to an effective viewing experience, making it useful for the encoding and transmission of video with ultra-high definition, large screen, and wide viewing angle characteristics.

１映像符号化装置
２映像オブジェクト位置・形状抽出部
３音声位置生成部
４誘導注視領域生成部
５符号化制御情報生成部
６映像符号化部 Reference Signs List 1 Video encoding device 2 Video object position/shape extraction unit 3 Audio position generation unit 4 Guidance gaze area generation unit 5 Encoding control information generation unit 6 Video encoding unit

Claims

A video encoding device that encodes a video so as to guide a gaze area of the video according to the video and audio,
a video object position/shape extraction unit that receives a video before encoding processing and extracts position coordinates and shape coordinates of a video object in a video frame to be processed that constitutes the video from associated video metadata including information indicating position coordinates of the video object in the video frame or by video analysis;
an audio position generation unit which generates position coordinates in the video frame to be processed for an audio object that is a sound source for each video frame to be processed based on audio metadata accompanying the video, links the audio object to a nearest video object based on the position coordinates of the video object, inputs the audio of the audio object in the video frame to be processed or detects the loudness of the audio object by referring to the video metadata, and associates the position coordinates of the video object linked to the audio object in the video frame to be processed with information on the loudness of the audio object;
an induced gaze area generating unit which, based on the position coordinates of the video object associated with the audio object, information on the volume of the sound, and the shape coordinates of the video object, determines a video object to be gaze-induced from among the video objects associated as sound sources for each video frame to be processed, based on a predetermined criterion, and determines an induced gaze area to surround the shape of the video object;
a coding control information generating unit that controls a code amount when coding a video frame to be processed based on information of the induced attention area so as to improve the coding image quality of the induced attention area more than other areas;
a video encoding unit that encodes the video based on the control of the code amount and generates an encoded stream including encoding parameters that indicate at least control information of the code amount;
A video encoding device comprising:

A viewing environment on a receiving side of the encoded stream is assumed and determined in advance,
The video encoding device described in claim 1, characterized in that the audio metadata includes position information of a sound source predetermined for each audio object in the video frame to be processed in a virtual space predetermined for the expected viewing position of the receiving side and the expected position of the display device.

The video encoding device according to claim 1, characterized in that the video objects in the video metadata and the audio objects in the audio metadata are linked in advance before the encoding process.

The video encoding device according to any one of claims 1 to 3, characterized in that the audio metadata includes information indicating the loudness of an audio object.

The apparatus further includes a means for acquiring information indicating an actual viewing position and/or a viewing direction from the actual viewing position, the information being connected online to a receiving system that receives the encoded stream and capable of bidirectional communication, from the receiving system;
The video encoding device according to claim 1 , wherein the induced gaze area generation unit further comprises means for determining the induced gaze area by correcting either or both of the assumed viewing position and the direction from the assumed viewing position obtained from the audio metadata.

A program for causing a computer to function as a video encoding device according to any one of claims 1 to 5.