+

WO2018150774A1 - Dispositif de traitement de signal vocal et système de traitement de signal vocal - Google Patents

Dispositif de traitement de signal vocal et système de traitement de signal vocal Download PDF

Info

Publication number
WO2018150774A1
WO2018150774A1 PCT/JP2018/000736 JP2018000736W WO2018150774A1 WO 2018150774 A1 WO2018150774 A1 WO 2018150774A1 JP 2018000736 W JP2018000736 W JP 2018000736W WO 2018150774 A1 WO2018150774 A1 WO 2018150774A1
Authority
WO
WIPO (PCT)
Prior art keywords
rendering
audio
audio signal
unit
signal processing
Prior art date
Application number
PCT/JP2018/000736
Other languages
English (en)
Japanese (ja)
Inventor
健明 末永
永雄 服部
Original Assignee
シャープ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by シャープ株式会社 filed Critical シャープ株式会社
Publication of WO2018150774A1 publication Critical patent/WO2018150774A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control

Definitions

  • the present disclosure relates to an audio signal processing device and an audio signal processing system.
  • Non-patent Document 1 a technique for reproducing multi-channel sound image localization using a small number of speakers has been studied.
  • Japanese Patent Publication “JP 2013-055439 A” (published March 21, 2013) Japanese Patent Publication “Japanese Patent Laid-Open No. 11-113098” (April 23, 1999)
  • Non-Patent Document 1 Vector Base Amplitude Panning (VBAP) and sound pressure panning shown in Non-Patent Document 1 are, for example, a group of three speakers 1302, 1303, and 1304 as shown in (a) of FIG.
  • the sound pressure is controlled based on the positional relationship between the pair of speakers 1306 and 1307 as shown in b) and the sound image 1301 or 1305 to be reproduced, and at any position within the range surrounded by the pair of speakers.
  • This technology reproduces sound images. Since the technique can reproduce a sound image within a range surrounded by a set of speakers even when there are a plurality of sound images, a multi-channel audio (for example, 22.2 ch or 5.1 ch) signal is reduced in a speaker. Can be reproduced with numbers.
  • VBAP and sound pressure panning can reproduce a sound image only within a range surrounded by a set of speakers. Therefore, if the speaker cannot be installed in an area where the speaker cannot be installed, for example, a position close to the ceiling surface, in the user's viewing environment, a sound image in the height direction cannot be reproduced.
  • Non-Patent Document 2 or Patent Document 2 if the transoral technique shown in Non-Patent Document 2 or Patent Document 2 is used, three-dimensional sound image control can be performed using at least two speakers. Therefore, for example, there is an advantage that sound image localization at an arbitrary position around the user can be reproduced only by using two speakers installed in front of the user.
  • this technique is a technique that assumes a specific listening area in principle and obtains a sound effect in that area, if the listener is removed from the listening area, the sound image is located at an unexpected position. It may happen that the camera is localized, or the localization is not felt in the first place.
  • One embodiment of the present disclosure is to realize an audio signal processing device capable of presenting audio rendered by a suitable rendering method to a user in a viewing state, and an audio signal processing system including the device. Objective.
  • an audio signal processing device that renders audio signals of one or more audio tracks and outputs the audio signals to a plurality of audio output devices.
  • a reproduction position specifying unit for specifying a reproduction position of the audio signal of the audio track, and position information of each audio output device
  • a position information acquisition unit to be acquired, and one rendering method is selected from a plurality of rendering methods based on the reproduction position and the position information, and the sound corresponding to the reproduction position is selected using the one rendering method.
  • a processing unit for rendering the audio signal of the track may include a plurality of audio channels. However, in the present disclosure, it is assumed that one audio channel is included for each audio track for easy understanding. .
  • an audio signal processing system includes the audio signal processing device having the above-described configuration and the plurality of audio output devices. Yes.
  • FIG. 1 is a block diagram illustrating a main configuration of an audio signal processing system according to Embodiment 1 of the present disclosure. It is a figure showing an example of track information used with an audio signal processing system concerning Embodiment 1 of this indication. It is a figure which shows the coordinate system used for description of this indication.
  • FIG. 3 is a block diagram illustrating a main configuration of a rendering switching signal generation unit according to Embodiment 1 of the present disclosure.
  • FIG. 6 is a diagram illustrating a processing flow of a rendering switching signal generation unit according to the first embodiment of the present disclosure. It is the figure which showed the relationship between the arrangement position of a speaker, and a sound image position.
  • FIG. 6 is a diagram illustrating a processing flow of a rendering unit according to the first embodiment of the present disclosure. It is a block diagram which shows the principal part structure of the audio
  • Embodiment 1 Hereinafter, an embodiment of the present disclosure will be described with reference to FIGS. 1 to 8.
  • FIG. 1 is a block diagram showing the main configuration of the audio signal processing system 1 according to the first embodiment.
  • the audio signal processing system 1 according to the first embodiment includes an audio signal processing unit 10 (audio signal processing device) and an audio output unit 20 (a plurality of audio output devices).
  • the audio signal processing unit 10 is an audio signal processing apparatus that renders audio signals of one or a plurality of audio tracks using two different rendering methods.
  • the audio signal after the rendering process is output from the audio signal processing unit 10 to the audio output unit 20.
  • the audio signal processing unit 10 includes a content analysis unit 101 (reproduction position specifying unit) that specifies the sound image position (reproduction position) of the audio signal of the audio track based on the input audio signal or on information accompanying the input audio signal, A rendering switching signal generation unit 102 (position information acquisition unit, processing unit) that acquires position information of the audio output unit 20, and one selected from a plurality of rendering methods based on a sound image position (playback position) and position information A rendering unit 103 (processing unit) that renders an audio signal of an audio track corresponding to the sound image position using a rendering method;
  • the audio signal processing unit 10 includes a storage unit 104 as shown in FIG.
  • the storage unit 104 stores various parameters required by the rendering switching signal generation unit 102 and the rendering unit 103 or generated various parameters.
  • the content analysis unit 101 analyzes an audio track included in video content or audio content recorded on a disc medium such as a DVD or a BD, an HDD (Hard Disc Drive), or any metadata (information) associated therewith. Then, the pronunciation object position information is obtained. The pronunciation object position information is sent from the content analysis unit 101 to the rendering switching signal generation unit 102 and the rendering unit 103.
  • the audio content received by the content analysis unit 101 is an audio content including two or more audio tracks.
  • this audio track may be a “channel-based” audio track employed in stereo (2ch), 5.1ch, etc., and each sound generation object unit is set as one track, and this position / volume It may be an “object-based” audio track to which accompanying information (metadata) describing a change in the environment is added.
  • the audio track based on the object base is recorded on each track for each sounding object, that is, recorded without mixing, and these sounding objects are appropriately rendered on the player (playing device) side.
  • each of these pronunciation objects is associated with metadata such as when, where, and at what volume the player should pronounce.
  • the “channel-based” audio track is employed in conventional surround sound (for example, 5.1ch surround), and is presupposed to be sounded from a predetermined playback position (speaker placement position). This is a track recorded in a state where individual sound generation objects are mixed.
  • FIG. 2 conceptually shows the configuration of the track information 201 obtained by analysis by the content analysis unit 101.
  • the content analysis unit 101 analyzes all the audio tracks included in the content and reconstructs the track information 201 shown in FIG. In the track information 201, the ID of each audio track and the type of the audio track are recorded.
  • the audio track is an object-based track
  • one or more pronunciation object position information is attached as metadata.
  • the pronunciation object position information is composed of a pair of a reproduction time and a sound image position (reproduction position) at the reproduction time.
  • the audio track is a channel-based track
  • a pair of a playback time and a sound image position (playback position) at the playback time is recorded. Is from the start to the end of the content, and the sound image position at the playback time is based on the playback position defined in advance on the channel base.
  • the sound image position (playback position) recorded as a part of the pronunciation object position information is expressed in the coordinate system shown in FIG. Further, it is assumed that the track information 201 is described in a markup language such as XML (Extensible Markup Language).
  • the rendering switching signal generation unit 102 generates a rendering method switching instruction signal based on information related to the viewing environment and the track information 201 (FIG. 2) obtained by the content analysis unit 101. Details of the rendering switching signal generation unit 102 will be described with reference to FIG.
  • FIG. 4 is a block diagram illustrating a configuration of the rendering switching signal generation unit 102.
  • the rendering switching signal generation unit 102 includes an environment information acquisition unit 10201 (position information acquisition unit) and a rendering switching instruction signal calculation unit 10202 (processing unit).
  • the environment information acquisition unit 10201 is configured to acquire information on the environment in which the user views the content (hereinafter referred to as environment information).
  • the environment information is assumed to be the number of speakers connected to the audio signal processing unit 10 as the audio output unit 20, the position of the speaker, and the type of the speaker.
  • the speaker type is information indicating which of a plurality of rendering methods used in this system can be used. As described in the first embodiment, when the audio signal processing unit 10 uses two types of rendering methods, information on whether or not each speaker can be used for either or both of the methods at the position where each speaker is arranged. Is the speaker type.
  • Environmental information is recorded in the storage unit 104 in advance. Therefore, the environment information acquisition unit 10201 reads information from the storage unit 104 as necessary.
  • the environment information recorded in the storage unit 104 may be recorded as metadata information described according to an arbitrary format, for example, a format such as XML.
  • the environment information acquisition unit 10201 may be used. Decodes as appropriate to extract information.
  • the sound image position and the speaker position are shown in a coordinate system as shown in FIG.
  • the coordinate system used here is centered on the origin O as shown in the top view of FIG. 3A, the distance from the origin O is the radius r, the front of the origin O is 0 °, the right position, The azimuth angle ⁇ with the left position being 90 ° and ⁇ 90 °, respectively, and the front of the origin O is 0 ° and the position just above the origin O is 90 ° as shown in the side view of FIG.
  • the elevation angle ⁇ is assumed, and the sound image position and the speaker position are expressed as (r, ⁇ , ⁇ ).
  • the coordinate system of FIG. 3 is used for the sound image position and the speaker position.
  • the environment information is acquired in advance and recorded in the storage unit 104.
  • information may be input in real time through an information input terminal (not shown in the first embodiment) such as a tablet terminal.
  • image processing is performed from an image taken by a camera installed at an arbitrary position in the viewing environment (for example, a marker is attached to the audio output unit 20 and this is recognized by a camera installed on the ceiling of the room). It is good also as a structure.
  • a device that transmits position information to the audio output unit 20 itself may be used to acquire various information.
  • the rendering switching instruction signal calculation unit 10202 is provided for each audio track based on the environment information obtained from the environment information acquisition unit 10201 and the sounding object position information of the track information 201 (FIG. 2) obtained by the content analysis unit 101.
  • the audio signal is determined to be rendered by any of a plurality of rendering methods, and the information is output to the rendering unit 103.
  • the rendering unit 103 simultaneously drives two types of rendering methods (rendering algorithms), that is, the rendering method A and the rendering method B, in order to make the description easier to understand.
  • FIG. 5 is a flowchart for explaining the operation of the rendering switching instruction signal calculation unit 10202.
  • the rendering switching instruction signal calculation unit 10202 When the rendering switching instruction signal calculation unit 10202 receives the above-described environment information and track information 201 (FIG. 2), it starts a rendering method selection process (step S101).
  • step S102 it is confirmed whether or not rendering method selection processing has been performed for all audio tracks. If the rendering method selection process after step S103 is completed for all audio tracks (YES in step S102), the rendering method selection process is terminated (step S106). On the other hand, if there is an audio track that has not been subjected to rendering method selection processing (NO in step S102), the process proceeds to step S103.
  • step S103 the sound generation object position information corresponding to the unprocessed audio track is referred to from the acquired track information 201 (FIG. 2), and the sound image position recorded as a part of the sound generation object position information is rendered. It is determined whether or not the image is included in the rendering processable range in the method A.
  • the rendering processable range indicates a range in which a sound source can be arranged in a specific rendering method, and information (position information) indicating the position of the speaker obtained as part of the environment information as necessary. To be determined.
  • the determination of the rendering processable range does not necessarily require reference to the environment information (that is, information acquired by using some means regarding the current environment). For example, when the speaker position is determined on the system in advance and the user places the speaker at this position in accordance with an instruction from the system, it is not necessary to acquire the information. It is also possible to define a rendering processable range regardless of the position of the speaker (as will be described later, if the rendering process is a downmix to a monaural signal, the entire area can be defined as the processable range. it can).
  • FIG. 6 A more specific example will be described with reference to FIG. Assume that a user (listener) 601 exists at the position of the origin O, and speakers (sound output devices) 602, 603, 604, and 605 are arranged around the user (listener) 601. The speakers 602, 603, 604, 605 are arranged at the same height as the position of the viewer's head.
  • (A) in FIG. 6 is a diagram when the layout is viewed from above, and
  • FIG. 6 is a diagram when viewed from the side.
  • Reference numerals 606, 607, 608, and 609 denote positions (sound image positions) where sound images based on the sound signals of the respective sound tracks should be localized.
  • the sound image positions 606, 607, and 608 are at the same height as the position of the viewer's head, and the sound image position 609 is higher than the position of the viewer's head.
  • the rendering method A is VBAP (first rendering method)
  • the rendering method B is transoral (second rendering method)
  • the speakers usable for VBAP are 602, 603, 604, 605, and transoral.
  • Speakers 602 and 603 that can be used in
  • the rendering processable range in the rendering method A (VBAP) is a range sandwiched between adjacent speakers, specifically a range sandwiched between speakers 602 and 603, a range sandwiched between 603 and 605, A range between 604 and 605 and a range between 602 and 604.
  • audio signals to be localized at the sound image positions 606, 607, and 608 included in this range can be processed by the rendering method A (VBAP).
  • the sound image position 609 shown in FIG. 6 is higher than the position of the speaker and is not included in the rendering processable range in the rendering method A (VBAP) (NO in step S103 of FIG. 5).
  • the sound image position 609 is a sound signal rendering process by a rendering method B which is a rendering method (trans-oral) that can localize a sound image regardless of the position of the speaker.
  • step S103 if the sound image position of the unprocessed audio track is included in the rendering processable range in the rendering method A (YES in step S103), the process proceeds to step S104.
  • step S103 if the sound image position of the unprocessed audio track is not included in the rendering processable range in the rendering method A (NO in step S103), the process proceeds to step S105.
  • step S104 an instruction signal (rendering switching signal) for rendering the audio signal of the unprocessed audio track using the rendering method A is output to the rendering unit 103.
  • step S105 an instruction signal (rendering switching signal) for rendering the audio signal of the unprocessed audio track using the rendering method B is output to the rendering unit 103.
  • the sound image positions of all the audio tracks are described as being within the rendering processable range of either the rendering method A or the rendering method B. However, if this is not the case, that is, if there is a possibility that it does not fall within the rendering processable range of either rendering system A or rendering system B, the rendering system selection process is performed according to the flow shown in FIG. Also good.
  • FIG. 7 is a modification of the flow shown in FIG.
  • the rendering switching instruction signal calculation unit 10202 receives the environment information and the track information 201 (FIG. 2), and the rendering method selection process starts (step S111).
  • step S112 it is confirmed whether or not rendering method selection processing has been performed for all audio tracks (step S112). If rendering method selection processing in step S113 and subsequent steps has been completed for all audio tracks (step S112). In step S118, the rendering method selection process is terminated. On the other hand, if there is an unprocessed audio track for which the rendering method selection process is not performed (NO in step S112), the sound generation object position information corresponding to the unprocessed audio track is referred to from the acquired track information 201 (FIG. 2). Similarly to step S103 described above, whether or not the sound image position recorded as part of the sound generation object position information corresponding to the unprocessed audio track is included in the rendering processable range in the rendering method A is determined. It discriminate
  • step S113 If the result of determination in step S113 is that the sound image position is within the rendering processable range in rendering method A (YES in step S113), the process proceeds to step S114.
  • step S ⁇ b> 114 an instruction signal for rendering the audio signal of the unprocessed audio track using the rendering method A is output to the rendering unit 103.
  • step S113 if the sound image position is not included in the rendering processable range in the rendering method A (NO in step S113), the process proceeds to step S115.
  • step S115 it is determined whether or not the sound image position is included in the rendering processable range in the rendering method B.
  • step S115 determines whether the sound image position is within the rendering processable range in rendering method B (YES in step S115). If the result of determination in step S115 is that the sound image position is within the rendering processable range in rendering method B (YES in step S115), the process proceeds to step S116. On the other hand, as a result of the determination, if the sound image position is not included in the rendering processable range in the rendering method B (NO in step S115), the process proceeds to step S117. That is, if the sound image position is not included in the rendering processable range of the rendering method A and the rendering method B, the process proceeds to step S117.
  • step S116 an instruction signal for rendering the audio signal of the unprocessed audio track using the rendering method B is output to the rendering unit 103.
  • step S117 an instruction is issued not to render the audio signal of the unprocessed audio track.
  • the instruction signal is output to the rendering unit 103.
  • the selectable rendering methods are described as two types, but it goes without saying that three or more types of rendering methods may be selected.
  • the expression that the rendering switching instruction signal calculation unit 10202 is for instructing switching of the rendering method is used.
  • the expression “instructing switching” here is used.
  • the mode of instructing to switch the rendering mode from A to B or from B to A the mode of instructing to use the rendering mode A in the next track of the track using the rendering mode A (also for the mode B) The same).
  • the rendering unit 103 constructs an audio signal to be output from the audio output unit 20 based on the input audio signal and the instruction signal output from the rendering switching instruction signal calculation unit 10202 of the rendering switching signal generation unit 102.
  • the rendering unit 103 simultaneously drives two types of rendering algorithms, switches the rendering algorithm to be used based on the instruction signal output from the rendering switching instruction signal calculation unit 10202, and renders the audio signal.
  • rendering means performing processing for converting an audio signal (input audio signal) included in the content into a signal to be output from the audio output unit 20.
  • FIG. 8 is a flowchart showing the operation of the rendering unit 103.
  • the rendering unit 103 When the rendering unit 103 receives the input audio signal and the instruction signal from the rendering switching instruction signal calculation unit 10202 of the rendering switching signal generation unit 102, the rendering unit 103 starts the rendering process (step S201).
  • step S202 it is confirmed whether rendering processing has been performed for all audio tracks (step S202).
  • step S202 if the rendering process after step S203 has been completed for all audio tracks (YES in step S202), the rendering process is terminated (S208).
  • step S208 if there is an unprocessed audio track (NO in step S202), rendering is performed using a rendering method based on the instruction signal from the rendering switching instruction signal calculation unit 10202 of the rendering switching signal generation unit 102.
  • the instruction signal indicates the rendering method A (rendering method A in step S203)
  • parameters necessary for rendering the audio signal using the rendering method A are read from the storage unit 104 (step S204). Rendering based on this is performed (step S205).
  • step S203 when the instruction signal indicates the rendering method B (rendering method B in step S203), parameters necessary for rendering the audio signal in the rendering method B are read from the storage unit 104 (step S206). Rendering based on is performed (step S207). If the instruction signal indicates no rendering based on the flow of FIG. 7 (no rendering in step S203), the corresponding track is not rendered and is not included in the output audio.
  • the storage unit 104 is configured by a secondary storage device for recording various data used in the rendering switching signal generation unit 102 and the rendering unit 103.
  • the storage unit 104 is configured by, for example, a magnetic disk, an optical disk, a flash memory, and the like, and more specific examples include an HDD, an SSD (Solid State Drive), an SD memory card, a BD, a DVD, and the like.
  • the rendering switching signal generation unit 102 and the rendering unit 103 read data from the storage unit 104 as necessary.
  • various parameter data including the coefficient calculated by the rendering switching signal generation unit 102 can be recorded in the storage unit 104.
  • the audio output unit 20 outputs the audio obtained by the rendering unit 103.
  • the audio output unit 20 includes a plurality of independent speakers, and each speaker includes a speaker unit and an amplifier (amplifier) that drives the speaker unit.
  • the environment information acquisition unit 10201 acquires the position information of each speaker configured in the audio output unit 20. Then, the rendering switching instruction signal calculation unit 10202 selects a rendering method based on a plurality of pieces of position information acquired by the environment information acquisition unit 10201.
  • a suitable rendering method considering sound image localization is automatically calculated according to the arrangement of speakers arranged by the user and information obtained from the content, and audio reproduction is performed.
  • a suitable rendering method considering sound image localization is automatically calculated according to the arrangement of speakers arranged by the user and information obtained from the content, and audio reproduction is performed.
  • content including a plurality of audio tracks is targeted for reproduction.
  • the present disclosure is not limited to this, and content including one audio track may be targeted for reproduction.
  • a suitable rendering method for the one audio track is selected from a plurality of rendering methods.
  • rendering method In the first embodiment, a rendering method of VBAP, trans-oral, and downmixing to a monaural signal has been described. However, the present disclosure is not limited to these rendering methods.
  • a rendering method similar to VBAP may be employed, in which an audio signal is output from each audio output unit at a sound pressure ratio corresponding to the sound image position (playback position).
  • a rendering method similar to transaural in which an audio signal processed according to the sound image position (reproduction position) is output from each audio output unit, may be employed.
  • the sound image position is included in the range defined by the arrangement positions of the plurality of sound output units, the sound quality is improved by adopting a rendering method that outputs from each sound output unit at a sound pressure ratio according to the sound image position.
  • An audio environment where emphasis is placed on can be realized.
  • a rendering method that is processed according to the sound image position (reproduction position) such as transaural, it is possible to localize the sound image without being restricted by the arrangement of the sound output unit.
  • downmixing to a stereo signal can be adopted as one of rendering methods.
  • FIG. 9 is a block diagram illustrating a main configuration of the audio signal processing system 1a according to the second embodiment of the present disclosure.
  • the audio signal processing system 1a according to the second embodiment is different only in the behavior of the rendering switching signal generation unit 102 in the audio signal processing system 1 shown in the first embodiment, and other processing units are used. Since these are the same, the description of the other configuration is the same as that described in the first embodiment unless described below.
  • the audio signal processing unit 10a of the audio signal processing system 1a according to the second embodiment is replaced with the rendering switching signal generation unit 102a (acquisition) instead of the rendering switching signal generation unit 102 of the audio signal processing unit 10 described in the first embodiment. Part).
  • the rendering switching signal generation unit 102a further acquires viewing position information indicating the viewing position of the user in addition to the track information and environment information (speaker position information) acquired by the rendering switching signal generation unit 102 of the first embodiment. .
  • the rendering switching signal generation unit 102a selects one rendering method from among a plurality of rendering methods based on the track information, the position information, and the viewing position information. Details will be described below. In the second embodiment as well, for convenience of explanation, an appropriate selection is made from two types of rendering methods.
  • the rendering switching signal generation unit 102a based on the information related to the viewing environment, the track information 201 (FIG. 2) obtained by the content analysis unit 101, and the viewing position information, is used as a rendering method switching instruction signal Is generated. Details of the rendering switching signal generation unit 102a will be described with reference to FIG.
  • FIG. 10 is a block diagram showing a configuration of the rendering switching signal generation unit 102a.
  • the rendering switching signal generation unit 102a includes an environment information acquisition unit 10201a and a rendering switching instruction signal calculation unit 10202a.
  • the environment information acquisition unit 10201a is configured to acquire information on an environment in which the user views content (hereinafter referred to as environment information).
  • environment information information on an environment in which the user views content
  • information viewing environment information
  • information indicating the viewing position of the user is added to the number, position, and type of speakers connected to the system as the audio output unit 20 shown in the first embodiment. It shall be a thing.
  • viewing environment information is acquired / updated in real time, and a marker is attached in advance by a camera (not shown) installed at an arbitrary position in the viewing environment and connected to the environment information acquisition unit 10201a.
  • the user and the speaker (sound output unit 20) are photographed, the three-dimensional position is acquired, and the viewing environment information is updated.
  • the user position may be acquired by using face recognition from information obtained from a camera that is also installed.
  • the rendering switching instruction signal calculation unit 10202a is provided for each audio track based on the environment information obtained from the environment information acquisition unit 10201a and the sound generation object position information of the track information 201 (FIG. 2) obtained by the content analysis unit 101.
  • the audio signal is determined to be rendered by any of a plurality of rendering methods, and the information is output to the rendering unit 103.
  • the rendering switching instruction signal calculation unit 10202a upon receiving the above-described environment information and track information 201 (FIG. 2), the rendering switching instruction signal calculation unit 10202a starts a rendering method selection process (step S301).
  • step S302 it is confirmed whether or not rendering method selection processing has been performed for all audio tracks (S302), and if rendering method selection processing in step S303 and subsequent steps has been completed for all audio tracks (YES in step S302). ), The rendering method selection process is terminated (step S310). On the other hand, if there is an audio track that has not been subjected to the rendering method selection process (NO in step S302), the process proceeds to step S303.
  • step S303 the sound object position recorded as part of the sounding object position information is rendered by referring to the sounding object position information corresponding to the unprocessed audio track from the acquired track information 201 (FIG. 2). If it is included in the rendering processable range in method A (YES in step S303), and the current position of the user is within the viewing effective range of rendering method A based on the viewing position information (YES in step S304), An instruction signal for rendering the audio signal of the audio track by the rendering method A is output (step S305).
  • step S303 when the sound image position recorded as a part of the pronunciation object position information is not included in the rendering processable range in the rendering method A (NO in step S303), the user is based on the viewing position information. Is outside the effective viewing range of the rendering method A (NO in step S304), the process proceeds to step S306, and whether or not rendering by the rendering method B is possible is confirmed.
  • the sound image position recorded as part of the pronunciation object position information is included in the rendering processable range by the rendering method B (YES in step S306), and based on the viewing position information, the user's current position
  • an instruction signal for rendering the audio signal of the audio track by the rendering method B is output (step S308).
  • the sound image position recorded as a part of the pronunciation object position information is not included in the rendering processable range in the rendering method B (NO in step S306), or the current position of the user is the rendering method. If it is out of the viewing effective range of B (NO in step S307), an instruction is issued not to render the audio signal of the audio track (step S310).
  • the rendering processable range indicates a range in which sound sources can be arranged in a specific rendering method as described in the first embodiment.
  • the viewing effective range is a recommended viewing area where the effect can be enjoyed in each rendering method (for example, as shown in FIG. 12, the viewing effective range of the rendering method A is represented as 1202, and the viewing effective range of the rendering method B is displayed.
  • the range is represented as 1203), and what is recorded in the storage unit 104 in advance for each rendering method is appropriately read.
  • a suitable rendering method that takes into account sound image localization according to the position of the speaker arranged by the user, the information obtained from the content, and the viewing position information of the user is provided. By calculating and performing sound reproduction, it is possible to deliver sound with good localization to the user.
  • the rendering method A is VBAP and the rendering method B is trans-oral.
  • the rendering method A is trans-oral and the rendering method B is VBAP.
  • the transoral in accordance with the operation flow shown in FIG.
  • the transoral can localize the sound image without being limited to the range of the speaker arrangement position, whereas in the case of VBAP, the sound image position depends on the speaker arrangement position. Therefore, in the case of the aspect of the first embodiment in which it is first determined whether or not the audio track can be processed by VBAP, and if it cannot be processed, the rendering method changes in the content. As a result, there is a possibility that the user feels uncomfortable.
  • the processing is first determined whether or not the processing can be performed by transoral (rendering method A) that does not depend on the speaker arrangement position.
  • rendering method A rendering based on a rendering method that can cover a wide range of sound image positions occupies most of the content, and it is difficult to give the above-mentioned uncomfortable feeling.
  • VBAP Compared to transoral, VBAP has better sound quality because it localizes the sound image within the range of the speaker placement position. Therefore, it can be said that the aspect of the first embodiment that first determines whether processing is possible with VBAP emphasizes sound quality.
  • the audio signal processing device (audio signal processing unit 10, 10a) according to aspect 1 of the present disclosure renders audio signals of one or a plurality of audio tracks, and outputs a plurality of audio output devices (audio output unit 20 (speaker 602). 603, 604, 605)), which specifies the playback position of the audio signal of the audio track based on the audio track or information associated with the audio track. And a position information acquisition unit (rendering switching signal generation unit) that acquires position information of each of the audio output devices (audio output unit 20 (speakers 602, 603, 604, 605)). 102, 102a) and a rendering method selected from a plurality of rendering methods based on the playback position and the position information.
  • a processing unit for rendering the audio signal of the audio track corresponding to the reproduction position (the rendering unit 103, the rendering switching signal generator 102, 102a), a.
  • a suitable rendering method is selected from a plurality of rendering methods based on the position of each sound output device and the reproduction position (sound image position) of the sound signal of the sound track. To do.
  • the input audio signal includes a plurality of audio tracks
  • rendering is performed for each audio track, or if the input audio signal includes one audio track, rendering is performed using a rendering method suitable for the one audio track.
  • the position information acquisition unit stores the viewing position information indicating the viewing position of the user. Further, the processing unit (the rendering switching signal generation unit 102a and the rendering unit 103) acquires the one rendering method from the plurality of rendering methods based on the reproduction position, the position information, and the viewing position information. The audio signal of the audio track corresponding to the reproduction position may be rendered using the one rendering method.
  • the rendering method can be selected in consideration of the viewing position information of the user, and the sound image localization can be reproduced more suitably.
  • the audio signal processing device (audio signal processing unit 10, 10a) according to aspect 3 of the present disclosure is the above-described aspect 1 or 2, wherein the reproduction position specifying unit (content analysis unit 101) is connected to the audio track or the audio track. It may be configured to analyze the accompanying information and generate track information indicating the reproduction position.
  • the track information is analyzed by analyzing the audio track or the information associated therewith by the reproduction position specifying unit. Can be generated.
  • the audio signal processing device (audio signal processing unit 10, 10a) according to aspect 4 of the present disclosure requires the processing unit (rendering unit 103, rendering switching signal generation unit 102, 102a) in the above-described aspects 1 to 3.
  • the configuration may further include a storage unit (104) for storing parameters.
  • the plurality of rendering methods may be configured such that the audio signal has a sound pressure ratio according to a reproduction position. From each audio output device, the first rendering method output from each audio output device (audio output unit 20 (speakers 602, 603, 604, 605)) and the audio signal processed according to the reproduction position are output from each audio output device. And a second rendering method to be output.
  • the first rendering method is VBAP
  • the second rendering method is transoral. It may be.
  • the audio signal processing device (audio signal processing unit 10, 10a) according to aspect 7 of the present disclosure is the above-described aspect 1 to 6, and the processing unit (rendering unit 103, rendering switching signal generation unit 102, 102a) It is determined whether or not the reproduction position is included in a range defined by the arrangement position of the plurality of audio output devices (audio output unit 20 (speakers 602, 603, 604, 605)), and according to the determination result
  • the above-described one rendering method may be selected.
  • the audio signal processing device (audio signal processing unit 10, 10a) according to aspect 8 of the present disclosure is the above-described aspect 2, and the processing unit (rendering unit 103, rendering switching signal generation unit 102, 102a)
  • the effective viewing range of the system is specified, and it is determined whether or not the reproduction position is included in a range defined by the plurality of audio output devices (audio output unit 20 (speakers 602, 603, 604, 605)). And determining whether or not the viewing position of the user indicated by the viewing position information is included in the effective viewing range and selecting the one rendering method according to the determination result. Good.
  • the audio signal processing system (audio signal processing system 1, 1a) according to aspect 9 of the present disclosure includes the audio signal processing apparatus (audio signal processing units 10, 10a) according to aspects 1 to 8 and the plurality of audio output apparatuses.
  • Sound output unit 20 (speakers 602, 603, 604, 605)).
  • Audio signal processing system 10 10a Audio signal processing unit (audio signal processing device) 20 Audio output unit (multiple audio output devices) 101 Content analysis unit (playback position specifying unit) 102, 102a Rendering switching signal generation unit (position information acquisition unit, processing unit) 103 Rendering unit (processing unit) 104 Storage Unit 201 Track Information 602, 603, 604, 605 Speaker (Audio Output Device) 606, 607, 608, 609 Sound image position (playback position) 10201, 10201a Environmental information acquisition unit (position information acquisition unit) 10202, 10202a Rendering switching instruction signal calculation unit (processing unit)

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)

Abstract

La présente invention aborde le problème de présenter à un utilisateur une voix restituée par un schéma de restitution, lequel schéma de restitution est préférable dans une situation de visualisation de l'utilisateur. Selon un mode de réalisation, la présente invention concerne un système de traitement de signal vocal (1) pourvu d'une unité de traitement de signal vocal (10) qui sélectionne un schéma de restitution parmi une pluralité de schémas de restitution sur la base d'informations de position d'un dispositif de sortie vocale et d'informations de suivi indiquant une position de retour de lecture pour un signal vocal d'entrée, et restitue le signal vocal d'entrée à l'aide du schéma de restitution.
PCT/JP2018/000736 2017-02-17 2018-01-15 Dispositif de traitement de signal vocal et système de traitement de signal vocal WO2018150774A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017-028396 2017-02-17
JP2017028396 2017-02-17

Publications (1)

Publication Number Publication Date
WO2018150774A1 true WO2018150774A1 (fr) 2018-08-23

Family

ID=63170536

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/000736 WO2018150774A1 (fr) 2017-02-17 2018-01-15 Dispositif de traitement de signal vocal et système de traitement de signal vocal

Country Status (1)

Country Link
WO (1) WO2018150774A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020227140A1 (fr) * 2019-05-03 2020-11-12 Dolby Laboratories Licensing Corporation Rendu d'objets audio avec de multiples types de restituteurs
JP7470695B2 (ja) 2019-01-08 2024-04-18 テレフオンアクチーボラゲット エルエム エリクソン(パブル) 仮想現実のための効率的な空間的にヘテロジーニアスなオーディオ要素

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016525813A (ja) * 2014-01-02 2016-08-25 コーニンクレッカ フィリップス エヌ ヴェKoninklijke Philips N.V. オーディオ装置及びそのための方法
JP2016165117A (ja) * 2011-07-01 2016-09-08 ドルビー ラボラトリーズ ライセンシング コーポレイション オーディオ信号処理システム及び方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016165117A (ja) * 2011-07-01 2016-09-08 ドルビー ラボラトリーズ ライセンシング コーポレイション オーディオ信号処理システム及び方法
JP2016525813A (ja) * 2014-01-02 2016-08-25 コーニンクレッカ フィリップス エヌ ヴェKoninklijke Philips N.V. オーディオ装置及びそのための方法

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7470695B2 (ja) 2019-01-08 2024-04-18 テレフオンアクチーボラゲット エルエム エリクソン(パブル) 仮想現実のための効率的な空間的にヘテロジーニアスなオーディオ要素
US11968520B2 (en) 2019-01-08 2024-04-23 Telefonaktiebolaget Lm Ericsson (Publ) Efficient spatially-heterogeneous audio elements for virtual reality
WO2020227140A1 (fr) * 2019-05-03 2020-11-12 Dolby Laboratories Licensing Corporation Rendu d'objets audio avec de multiples types de restituteurs
CN113767650A (zh) * 2019-05-03 2021-12-07 杜比实验室特许公司 使用多种类型的渲染器渲染音频对象
JP2022530505A (ja) * 2019-05-03 2022-06-29 ドルビー ラボラトリーズ ライセンシング コーポレイション 複数のタイプのレンダラーを用いたオーディオ・オブジェクトのレンダリング
JP7157885B2 (ja) 2019-05-03 2022-10-20 ドルビー ラボラトリーズ ライセンシング コーポレイション 複数のタイプのレンダラーを用いたオーディオ・オブジェクトのレンダリング
JP2022173590A (ja) * 2019-05-03 2022-11-18 ドルビー ラボラトリーズ ライセンシング コーポレイション 複数のタイプのレンダラーを用いたオーディオ・オブジェクトのレンダリング
CN113767650B (zh) * 2019-05-03 2023-07-28 杜比实验室特许公司 使用多种类型的渲染器渲染音频对象
EP4236378A3 (fr) * 2019-05-03 2023-09-13 Dolby Laboratories Licensing Corporation Reproduction des objets audio selon multiple types de rendu
JP7443453B2 (ja) 2019-05-03 2024-03-05 ドルビー ラボラトリーズ ライセンシング コーポレイション 複数のタイプのレンダラーを用いたオーディオ・オブジェクトのレンダリング
US11943600B2 (en) 2019-05-03 2024-03-26 Dolby Laboratories Licensing Corporation Rendering audio objects with multiple types of renderers

Similar Documents

Publication Publication Date Title
Rumsey Spatial audio
US9299353B2 (en) Method and apparatus for three-dimensional acoustic field encoding and optimal reconstruction
RU2617553C2 (ru) Система и способ для генерирования, кодирования и представления данных адаптивного звукового сигнала
AU2008295723B2 (en) A method and an apparatus of decoding an audio signal
KR101381396B1 (ko) 입체음향 조절기를 내포한 멀티 뷰어 영상 및 3d 입체음향 플레이어 시스템 및 그 방법
JP2016518067A (ja) 没入型オーディオの残響音場を管理する方法
US20200280815A1 (en) Audio signal processing device and audio signal processing system
KR100739723B1 (ko) 오디오 썸네일 기능을 지원하는 오디오 재생 방법 및 장치
JP6868093B2 (ja) 音声信号処理装置及び音声信号処理システム
KR20190109019A (ko) 가상 공간에서 사용자의 이동에 따른 오디오 신호 재생 방법 및 장치
JP6663490B2 (ja) スピーカシステム、音声信号レンダリング装置およびプログラム
JPWO2017110882A1 (ja) スピーカの配置位置提示装置
JP5338053B2 (ja) 波面合成信号変換装置および波面合成信号変換方法
WO2018150774A1 (fr) Dispositif de traitement de signal vocal et système de traitement de signal vocal
CN114915874B (zh) 音频处理方法、装置、设备及介质
JP5743003B2 (ja) 波面合成信号変換装置および波面合成信号変換方法
JP5590169B2 (ja) 波面合成信号変換装置および波面合成信号変換方法
KR20070081735A (ko) 오디오 신호의 인코딩/디코딩 방법 및 장치
Ando Preface to the Special Issue on High-reality Audio: From High-fidelity Audio to High-reality Audio
RU2779295C2 (ru) Обработка монофонического сигнала в декодере 3d-аудио, предоставляющая бинауральный информационный материал
Brandenburg et al. Audio Codecs: Listening pleasure from the digital world
JP2008147840A (ja) 音声信号生成装置、音場再生装置、音声信号生成方法およびコンピュータプログラム
JP2007180662A (ja) 映像音声再生装置、方法およびプログラム
KR102058619B1 (ko) 예외 채널 신호의 렌더링 방법
Stevenson Spatialisation, Method and Madness Learning from Commercial Systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18754453

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18754453

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载