1 |
Speaking Style Variability in Speaker Discrimination by Humans and Machines
|
|
|
|
Abstract:
A speaker's voice constantly varies in everyday situations, such as when talking to a friend, reading aloud, talking to pets, or narrating a happy incident. These changes in speaking style affect human and machine abilities to distinguish speakers based on their voice. This dissertation studies the effects of speaking style variability on speaker discrimination performance by humans and machines.We compare human speaker discrimination performance for read speech versus casual conversations. Listeners perform better when stimuli are style-matched, particularly in read speech -- read speech trials. They perform the worst in style-mismatched conditions. Moderate style variability affects the "same speaker" task more than the "different speaker" task. The speakers who are "easy" or "hard" to "tell together" are not the same as those who are "easy" or "hard" to "tell apart." Analysis of acoustic variability suggests that listeners find it easier to "tell speakers together" when they rely on speaker-specific idiosyncrasies and that they "tell speakers apart" based on their relative positions within a shared acoustic space.The effects of style variability on automatic speaker verification (ASV) systems are systematically analyzed using the UCLA Speaker Variability database, which comprises multiple speaking styles per speaker. The performance is better when enrollment and test utterances are of the same style, but it substantially degrades when styles are mismatched. We hypothesize that between-frame entropy can capture style-related spectral and temporal variations. We propose an entropy-based variable frame rate (VFR) technique to address style variability in two different approaches: data augmentation and self-attentive conditioning. Both approaches improve performance in style-mismatch scenarios and are comparable in performance.Furthermore, humans and machines seem to employ different approaches to speaker discrimination. In an attempt to improve ASV performance in the presence of style variability, insights learnt from the human speaker perception experiments are used to design a training loss function, referred to as "CllrCE loss". CllrCE loss focuses on both speaker-specific idiosyncrasies and relative acoustic distances between the speakers to train the ASV system. This loss function improves ASV performance in case of style variability, especially in the case of moderate style variations from conversational speech.
|
|
Keyword:
Acoustic space analysis; Computer engineering; Electrical engineering; Human speaker perception; Self-attention conditioning; Speaker verification; Speaking style; Variable frame rate
|
|
URL: https://escholarship.org/uc/item/3zh346jm
|
|
BASE
|
|
Hide details
|
|
2 |
An Efficient Method for Biomedical Entity Linking Based on Inter- and Intra-Entity Attention
|
|
|
|
In: Applied Sciences; Volume 12; Issue 6; Pages: 3191 (2022)
|
|
BASE
|
|
Show details
|
|
3 |
ТРУДНОСТИ ПРЕПОДАВАНИЯ АНГЛИЙСКОГО ЯЗЫКА В НЕЯЗЫКОВОМ ВУЗЕ ... : CHALLENGES OF TEACHING ENGLISH IN A NON-LINGUISTIC HIGHER SCHOOL ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
A Neural N-Gram-Based Classifier for Chinese Clinical Named Entity Recognition
|
|
|
|
In: Applied Sciences ; Volume 11 ; Issue 18 (2021)
|
|
BASE
|
|
Show details
|
|
5 |
A Transformer-Based Neural Machine Translation Model for Arabic Dialects That Utilizes Subword Units
|
|
|
|
In: Sensors ; Volume 21 ; Issue 19 (2021)
|
|
BASE
|
|
Show details
|
|
6 |
High-intensity interval training upon cognitive and psychological outcomes in youth : a systematic review
|
|
|
|
BASE
|
|
Show details
|
|
7 |
When to Make the Sensory Social: Registering in Face-to-Face Openings
|
|
|
|
In: Faculty Publications (2020)
|
|
BASE
|
|
Show details
|
|
8 |
Extractive summarization using siamese hierarchical transformer encoders
|
|
|
|
BASE
|
|
Show details
|
|
9 |
Using Bidirectional Encoder Representations from Transformers for Conversational Machine Comprehension ; Användning av BERT-språkmodell för konversationsförståelse
|
|
|
|
BASE
|
|
Show details
|
|
10 |
Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
|
|
|
|
In: Applied Sciences ; Volume 9 ; Issue 21 (2019)
|
|
BASE
|
|
Show details
|
|
11 |
Exploring Efficient Neural Architectures for Linguistic–Acoustic Mapping in Text-To-Speech
|
|
|
|
In: Applied Sciences ; Volume 9 ; Issue 16 (2019)
|
|
BASE
|
|
Show details
|
|
12 |
When to make the sensory social: Registering in copresent openings
|
|
|
|
In: Communication Scholarship (2019)
|
|
BASE
|
|
Show details
|
|
13 |
Bridging the gap: attending to discontinuity in identification of multiword expressions
|
|
|
|
BASE
|
|
Show details
|
|
14 |
Deep neural networks for natural language processing and its acceleration
|
|
|
|
BASE
|
|
Show details
|
|
15 |
Light and heavy drinking in jurisdictions with different alcohol policy environments.
|
|
|
|
In: The International journal on drug policy, vol. 65, pp. 86-96 (2019)
|
|
BASE
|
|
Show details
|
|
16 |
Testing the Bilingual Advantage Hypothesis: Language Balance and Self-Regulation
|
|
|
|
BASE
|
|
Show details
|
|
17 |
Propuesta de intervención psicoeducativa en un caso de dislexia
|
|
|
|
BASE
|
|
Show details
|
|
18 |
Examining links between anxiety, reinvestment and walking when talking by older adults during adaptive gait
|
|
|
|
BASE
|
|
Show details
|
|
19 |
Orthographic learning during reading: the role of whole-word visual processing
|
|
|
|
In: ISSN: 0141-0423 ; EISSN: 1467-9817 ; Journal of Research in Reading ; https://hal.archives-ouvertes.fr/hal-01218316 ; Journal of Research in Reading, Wiley, 2015, 38, pp.141-158. ⟨10.1111/j.1467-9817.2012.01551.x⟩ (2015)
|
|
BASE
|
|
Show details
|
|
20 |
INDIVIDUAL DIFFERENCES IN PREDICTIVE PROCESSING: EVIDENCE FROM SUBJECT FILLED-GAP EFFECTS IN NATIVE AND NONNATIVE SPEAKERS OF ENGLISH
|
|
|
|
BASE
|
|
Show details
|
|
|
|