Brown et al., 2021 - Google Patents
Playing a part: Speaker verification at the moviesBrown et al., 2021
View PDF- Document ID
- 6735644142058147606
- Author
- Brown A
- Huh J
- Nagrani A
- Chung J
- Zisserman A
- Publication year
- Publication venue
- ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
External Links
Snippet
The goal of this work is to investigate the performance of popular speaker recognition models on speech segments from movies, where often actors intentionally disguise their voice to play a character. We make the following three contributions:(i) We collect a novel …
- 238000011156 evaluation 0 abstract description 25
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30781—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F17/30784—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre
- G06F17/30799—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre using low-level visual features of the video content
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00624—Recognising scenes, i.e. recognition of a whole field of perception; recognising scene-specific objects
- G06K9/00711—Recognising video content, e.g. extracting audiovisual features from movies, extracting representative key-frames, discriminating news vs. sport content
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/19—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Brown et al. | Playing a part: Speaker verification at the movies | |
| Nagrani et al. | Voxsrc 2020: The second voxceleb speaker recognition challenge | |
| Chung et al. | Spot the conversation: speaker diarisation in the wild | |
| Makino et al. | Recurrent neural network transducer for audio-visual speech recognition | |
| Lotfian et al. | Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings | |
| Korshunov et al. | Speaker inconsistency detection in tampered video | |
| Chung et al. | Voxceleb2: Deep speaker recognition | |
| Sebastian et al. | Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts. | |
| US20190043500A1 (en) | Voice based realtime event logging | |
| Chung et al. | Out of time: automated lip sync in the wild | |
| Li et al. | Speaker-invariant affective representation learning via adversarial training | |
| Lakomkin et al. | On the robustness of speech emotion recognition for human-robot interaction with deep neural networks | |
| Harwath et al. | Deep multimodal semantic embeddings for speech and images | |
| Povolny et al. | Multimodal emotion recognition for AVEC 2016 challenge | |
| Pan et al. | Integrating deep facial priors into landmarks for privacy preserving multimodal depression recognition | |
| Lingenfelser et al. | Asynchronous and event-based fusion systems for affect recognition on naturalistic data in comparison to conventional approaches | |
| Peri et al. | An empirical analysis of information encoded in disentangled neural speaker representations | |
| CN114495946A (en) | Voiceprint clustering method, electronic device and storage medium | |
| Sachidananda et al. | Calm: Contrastive aligned audio-language multirate and multimodal representations | |
| Ryumina et al. | Annotation Confidence vs. Training Sample Size: Trade-Off Solution for Partially-Continuous Categorical Emotion Recognition. | |
| Zheng et al. | Emotion recognition model based on multimodal decision fusion | |
| Azab et al. | Speaker naming in movies | |
| Sarman et al. | Audio based violent scene classification using ensemble learning | |
| Suglia et al. | Going for GOAL: A resource for grounded football commentaries | |
| Hori et al. | Early and late integration of audio features for automatic video description |