Brown et al., 2021 - Google Patents

Playing a part: Speaker verification at the movies

Brown et al., 2021

Document ID: 6735644142058147606
Author: Brown A; Huh J; Nagrani A; Chung J; Zisserman A
Publication year: 2021
Publication venue: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

External Links

Cited by

Snippet

The goal of this work is to investigate the performance of popular speaker recognition models on speech segments from movies, where often actors intentionally disguise their voice to play a character. We make the following three contributions:(i) We collect a novel …

Continue reading at arxiv.org (PDF) (other versions)

238000011156 evaluation 0 abstract description 25

Classifications

- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30781—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F17/30784—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre
- G06F17/30799—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre using low-level visual features of the video content
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/04—Training, enrolment or model building
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00624—Recognising scenes, i.e. recognition of a whole field of perception; recognising scene-specific objects
- G06K9/00711—Recognising video content, e.g. extracting audiovisual features from movies, extracting representative key-frames, discriminating news vs. sport content
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/19—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier

Similar Documents

Publication	Publication Date	Title
Brown et al.	2021	Playing a part: Speaker verification at the movies
Nagrani et al.	2020	Voxsrc 2020: The second voxceleb speaker recognition challenge
Chung et al.	2020	Spot the conversation: speaker diarisation in the wild
Makino et al.	2019	Recurrent neural network transducer for audio-visual speech recognition
Lotfian et al.	2017	Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings
Korshunov et al.	2018	Speaker inconsistency detection in tampered video
Chung et al.	2018	Voxceleb2: Deep speaker recognition
Sebastian et al.	2019	Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts.
US20190043500A1 (en)	2019-02-07	Voice based realtime event logging
Chung et al.	2016	Out of time: automated lip sync in the wild
Li et al.	2020	Speaker-invariant affective representation learning via adversarial training
Lakomkin et al.	2018	On the robustness of speech emotion recognition for human-robot interaction with deep neural networks
Harwath et al.	2015	Deep multimodal semantic embeddings for speech and images
Povolny et al.	2016	Multimodal emotion recognition for AVEC 2016 challenge
Pan et al.	2023	Integrating deep facial priors into landmarks for privacy preserving multimodal depression recognition
Lingenfelser et al.	2016	Asynchronous and event-based fusion systems for affect recognition on naturalistic data in comparison to conventional approaches
Peri et al.	2020	An empirical analysis of information encoded in disentangled neural speaker representations
CN114495946A (en)	2022-05-13	Voiceprint clustering method, electronic device and storage medium
Sachidananda et al.	2022	Calm: Contrastive aligned audio-language multirate and multimodal representations
Ryumina et al.	2021	Annotation Confidence vs. Training Sample Size: Trade-Off Solution for Partially-Continuous Categorical Emotion Recognition.
Zheng et al.	2021	Emotion recognition model based on multimodal decision fusion
Azab et al.	2018	Speaker naming in movies
Sarman et al.	2018	Audio based violent scene classification using ensemble learning
Suglia et al.	2022	Going for GOAL: A resource for grounded football commentaries
Hori et al.	2017	Early and late integration of audio features for automatic video description