Overview of Speaker Recognition

Chia sẻ: Ngo The Anh Tuan | Ngày: | Loại File: PDF | Số trang:109

Thêm vào BST

Báo xấu

90
lượt xem 10
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

An introduction to automatic speaker recognition is presented in this chapter. The identifying characteristics of a person’s voice that make it possible to automatically identify a speaker are discussed. Subtasks such as speaker identiﬁcation, veriﬁcation, and detection are described. An overview of the techniques used to build speaker models as well as issues related to system performance are presented. Finally, a few selected applications of speaker recognition are introduced to demonstrate the wide range of applications of speaker recognition technologies. Details of text-dependent and text-independent speaker recognition and their applications are covered in the following two chapters. ...

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Overview of Speaker Recognition

725 Overview of S 36. Overview of Speaker Recognition Part F 36 A. E. Rosenberg, F. Bimbot, S. Parthasarathy 36.2 Measuring Speaker Features .................. 729 An introduction to automatic speaker recognition 36.2.1 Acoustic Measurements ................ 729 is presented in this chapter. The identifying char- 36.2.2 Linguistic Measurements .............. 730 acteristics of a person’s voice that make it possible to automatically identify a speaker are discussed. 36.3 Constructing Speaker Models ................. 731 Subtasks such as speaker identiﬁcation, veriﬁca- 36.3.1 Nonparametric Approaches .......... 731 tion, and detection are described. An overview of 36.3.2 Parametric Approaches ................ 732 the techniques used to build speaker models as well as issues related to system performance are 36.4 Adaptation .......................................... 735 presented. Finally, a few selected applications of speaker recognition are introduced to demonstrate 36.5 Decision and Performance ..................... 735 the wide range of applications of speaker recog- 36.5.1 Decision Rules ............................ 735 nition technologies. Details of text-dependent 36.5.2 Threshold Setting and text-independent speaker recognition and and Score Normalization .............. 736 their applications are covered in the following two 36.5.3 Errors and DET Curves................... 736 chapters. 36.6 Selected Applications for Automatic Speaker Recognition ........ 737 36.1 Speaker Recognition ............................. 725 36.6.1 Indexing Multispeaker Data.......... 737 36.1.1 Personal Identity Characteristics.... 725 36.6.2 Forensics.................................... 737 36.1.2 Speaker Recognition Deﬁnitions.... 726 36.6.3 Customization: SCANmail .............. 738 36.1.3 Bases for Speaker Recognition ...... 726 36.1.4 Extracting Speaker Characteristics 36.7 Summary ............................................. 739 from the Speech Signal ................ 727 36.1.5 Applications ............................... 728 References .................................................. 739 36.1 Speaker Recognition 36.1.1 Personal Identity Characteristics authorized individuals. In surveillance applications, the goal is to detect and track a target individual among Human beings have many characteristics that make it a set of nontarget individuals. In forensic applications possible to distinguish one individual from another. a sample of biometric measurements is obtained from Some individuating characteristics can be perceived an unknown individual, the perpetrator. The task is to very readily such as facial features and vocal qual- compare this sample with a database of similar mea- ities and behavior. Others, such as ﬁngerprints, iris surements from known individuals to ﬁnd a match. patterns, and DNA structure are not readily perceived Many personal identifying characteristics are based and require measurements, often quite complex mea- on physiological properties, others on behavior, and surements, to capture distinguishing characteristics. In some combine physiological and behavioral proper- recent years biometrics has emerged as an applied sci- ties. From the point of view of using personal identity entiﬁc discipline with the objective of automatically characteristics as a biometric for security, physiological capturing personal identifying characteristics and using characteristics may offer more intrinsic security since the measurements for security, surveillance, and forensic they are not subject to the kinds of voluntary varia- applications [36.1]. Typical applications using biomet- tions found in behavioral features. Voice is an example rics secure transactions, information, and premises to of a biometric that combines physiological and behav-
726 Part F Speaker Recognition ioral characteristics. Voice is attractive as a biometric required, so speaker veriﬁcation performance is inde- Part F 36.1 for many reasons. It can be captured non-intrusively pendent of the size of the speaker population. and conveniently with simple transducers and recording A third speaker recognition task has been deﬁned devices. It is particularly useful for remote-access trans- in recent years in National Institute of Standards and actions over telecommunication networks. A drawback Technology (NIST) speaker recognition evaluations; it is that voice is subject to many sources of variabil- is generally referred to as speaker detection [36.2, 3]. ity, including behavioral variability, both voluntary and The NIST task is an open-set identiﬁcation decision as- involuntary. An example of involuntary variability is sociated exclusively with conversational speech. In this a speaker’s inability to repeat utterances precisely the task an unknown voice sample is provided and the task same way. Another example is the spectral changes that is to determine whether or not one of a speciﬁed set of occur when speakers vary their vocal effort as back- known speakers is present in the sample. A complicat- ground noise increases. Voluntary variability is an issue ing factor for this task is that the unknown sample may when speakers attempt to disguise their voices. Other contain speech from more than one speaker, such as in sources of variability include physical voice variations the summed two sides of a telephone conversation. In due to respiratory infections and congestion. External this case, an additional task called speaker tracking is sources of variability are especially problematic, includ- deﬁned, in which it is required to determine the inter- ing variations in background noise, and transmission and vals in the test sample during which the detected speaker recording characteristics. is talking. In other applications where the speech sam- ples are multispeaker, speaker tracking has also been 36.1.2 Speaker Recognition Deﬁnitions referred to as speaker segmentation, speaker indexing, and speaker diarization [36.4–10]. It is possible to cast Different tasks are deﬁned under the general heading the speaker segmentation task as an acoustical change of speaker recognition. They differ mainly with respect detection task without creating models. The time instants to the kind of decision that is required for each task. where a signiﬁcant acoustic change occurs are assumed In speaker identiﬁcation a voice sample from an un- to be the boundaries between different speaker segments. known speaker is compared with a set of labeled speaker In this case, in the absence of speaker models, speaker models. When it is known that the set of speaker models segmentation would not be considered a speaker recog- includes all speakers of interest the task is referred to nition task. However, in most reported approaches to this as closed-set identiﬁcation. The label of the best match- task some sort of speaker modeling does take place. The ing speaker is taken to be the identiﬁed speaker. Most task usually includes labeling the speaker segments. In speaker identiﬁcation applications are open-set, mean- this case the task falls unambiguously under the speaker ing that it is possible that the unknown speaker is not recognition heading. included in the set of speaker models. In this case, if no In addition to decision modes, speaker recognition satisfactory match is obtained, a no-match decision is tasks can be categorized by the kind of speech that is provided. input. If the speaker is prompted or expected to provide In a speaker veriﬁcation trial an identity claim is a known text and if speaker models have been trained provided or asserted along with the voice sample. In explicitly for this text, the input mode is said to be text this case, the unknown voice sample is compared only dependent. If, on the other hand, the speaker cannot be with the speaker model whose label corresponds to the expected to utter speciﬁed texts the input mode is text identity claim. If the quality of the comparison is sat- independent. In this case speaker models are not trained isfactory, the identity claim is accepted; otherwise the on explicit texts. claim is rejected. Speaker veriﬁcation is a special case of open-set speaker identiﬁcation with a one-speaker target 36.1.3 Bases for Speaker Recognition set. The speaker veriﬁcation decision mode is intrinsic to most access control applications. In these applications, The principal function associated with the transmission it is assumed that the claimant will respond to prompts of a speech signal is to convey a message. However, cooperatively. along with the message, additional kinds of informa- It can readily be seen that in the speaker identiﬁca- tion are transmitted. These include information about tion task performance degrades as the number of speaker the gender, identity, emotional state, health, etc. of the models and the number of comparisons increases. In speaker. The source of all these kinds of information a speaker veriﬁcation trial only one comparison is lie in both physiological and behavioral characteristics.
Overview of Speaker Recognition 36.1 Speaker Recognition 727 The physiological features are shown in Fig. 36.1 show- speech sounds or segments and also with suprasegmental Part F 36.1 ing a cross-section of the human vocal tract. The shape characteristics governing how individual speech sounds of the vocal tract, determined by the position of articula- are strung together to form words. Higher-level speaking tors, the tongue, jaw, lips, teeth, and velum, creates a set behavior is associated with choices of words and syntac- of acoustic resonances in response to periodic puffs of tic units. Variations in fundamental frequency or pitch air generated by the glottis for voiced sounds or ape- and rhythm are also higher-level features of the speech riodic excitation caused by air passing through tight signal along with such qualities as breathiness, strength constrictions in the vocal tract. The spectral peaks asso- of vocal effort, etc. All of these vary signiﬁcantly from ciated with periodic resonances are referred to as speech speaker to speaker. formants. The locations in frequency and, to a lesser de- gree, the shapes of the resonances distinguish one speech 36.1.4 Extracting Speaker Characteristics sound from another. In addition, formant locations and from the Speech Signal bandwidths and spectral differences associated with the overall size of the vocal tract serve to distinguish the A perceptual view classiﬁes speech as containing low- same sounds spoken by different speakers. The shape level and high-level kinds of information. Low-level of the nasal tract, which determines the quality of nasal features of speech are associated with the periphery in sounds, also varies signiﬁcantly from speaker to speaker. the brain’s perception of speech and are relatively ac- The mass of the glottis is associated with the basic funda- cessible from the speech signal. High-level features are mental frequency for voiced speech sounds. The average associated with more-central locations in the perception basic fundamental frequency is approximately 100 Hz mechanism. Generally speaking, low-level speaker fea- for adult males, 200 Hz for adult females, and 300 Hz tures are easier to extract from the speech signal and for children. It also varies from individual to individual. model than high-level features. Many such features are Speech signal events can be classiﬁed as segmen- associated with spectral correlates such as formant loca- tal or suprasegmental. Generally, segmental refers to tions and bandwidths, pitch periodicity, and segmental the features of individual sounds or segments, whereas timings. High-level features include the perception of suprasegmental refers to properties that extend over sev- words and their meaning, syntax, prosody, dialect, and eral speech sounds. Speaking behavior is associated with idiolect. the individual’s control of articulators for individual It is not easy to extract stable and reliable for- mant features explicitly from the speech signal. In most instances it is easier to carry out short-term spectral amplitude measurements that capture low-level speaker characteristics implicitly. Short-term spectral measure- Hard palate ments are typically carried out over 20–30 ms windows and advanced every 10 ms. Short speech sounds have du- Soft palate Nasal cavity rations less than 100 ms whereas stressed vowel sounds (velum) can last for 300 ms or more. Advancing the time win- Nostril Lip dow every 10 ms enables the temporal characteristics of Pharyngeal cavity Tongue individual speech sounds to be tracked and the 30 ms Teeth analysis window is usually sufﬁcient to provide good Larynx spectral resolution of these sounds and at the same time Oral cavity Esophagus short enough to resolve signiﬁcant temporal character- Jaw istics. There are two principal methods of short-term Trachea spectral analysis, ﬁlter bank analysis and linear pre- dictive coding (LPC) analysis. In ﬁlter bank analysis Lung the speech signal is passed through a bank of band- pass ﬁlters covering a range of frequencies consistent with the transmission characteristics of the signal. The Diaphragm spacing of the ﬁlters can be uniform or, more likely, spaced nonuniformly, consistent with perceptual cri- Fig. 36.1 Physiology of the human vocal tract (Reproduced teria such as the mel or bark scale [36.12], which with permission from L. H. Jamieson [36.11]) provides a linear spacing in frequency below 1000 Hz
728 Part F Speaker Recognition and logarithmic spacing above. The output of each ﬁl- one token of an utterance to another. Conversely, co- Part F 36.1 ter is typically implemented as a windowed, short-term operative speakers can control such variability. More Fourier transform using fast Fourier transform (FFT) difﬁcult to deal with are the variability and distortion techniques. This output is subject to a nonlinearity associated with recording environments, microphones, and low-pass ﬁlter to provide an energy measurement. and transmission media. The most severe kinds of vari- LPC-derived features almost always include regression ability problems occur when utterances used to train measurements that capture the temporal evolution of models are recorded under one set of conditions and test these features from one speech segment to another. It is utterances are recorded under another. no accident that short-term spectral measurements are A block diagram of a speaker recognition is shown also the basis for speech recognizers. This is because in Fig. 36.2, showing the basic elements discussed in this an analysis that captures the differences between one section. A sample of speech from an unknown speaker speech sound and another can also capture the differ- is input to the system. If the system is a speaker veri- ence between the same speech sound uttered by different ﬁcation system, an identity claim or assertion is also speakers, often with resolutions surpassing human per- input. The speech sample is recorded, digitized, and an- ception. alyzed. The analysis is typically some sort of short-term Other measurements that are often carried out are spectral analysis that captures speaker-sensitive features correlated with prosody such as pitch and energy track- as described earlier in this section. These features are ing. Pitch or periodicity measurements are relatively compared with prototype features compiled into the easy to make. However, periodicity measurement is models of known speakers. A matching process is in- meaningful only for voiced speech sounds so it is neces- voked to compare the sample features and the model sary also to have a detector that can discriminate voiced features. In the case of closed-set speaker identiﬁcation, from unvoiced sounds. This complication often makes it the match is assigned to the model with the best matching difﬁcult to obtain reliable pitch tracks over long-duration score. In the case of speaker veriﬁcation, the matching utterances. score is compared with a predetermined threshold to Long-term average spectral and fundamental fre- decide whether to accept or reject the identity claim. quency measurements have been used in the past for For open-set identiﬁcation, if the matching score for speaker recognition, but since these measurements pro- the best matching model does not pass a threshold test, vide feature averages over long durations they are not a no-match decision is made. capable of resolving detailed individual differences. Although computational ease is an important 36.1.5 Applications consideration for selecting speaker-sensitive feature measurements, equally important considerations are the As mentioned, the most widespread applications for au- stability of the measurements, including whether they tomatic speaker recognition are for security. These are are subject to variability, noise, and distortions from typically speaker veriﬁcation applications intended to one measurement of a speaker’s utterances to another. control access to privileged transactions or information One source of variability is the speaker himself. Fea- remotely over a telecommunication network. These are tures that are correlated with behavior such as pitch usually conﬁgured in a text-dependent mode in which contours – pitch measured as a function of time over customers are prompted to speak personalized veriﬁ- speciﬁed utterances – can be consciously varied from cation phrases such as personal identiﬁcation numbers Speech sample from an Speech signal Decision Feature extraction Pattern match unknown processing speaker Identity claim Speaker models Fig. 36.2 Block diagram of a speaker recognition system
Overview of Speaker Recognition 36.2 Measuring Speaker Features 729 (PINs) spoken as a string of digits. Typically, PIN utter- for efﬁcient, versatile, and accurate data mining tools Part F 36.2 ances are decoded using a speaker-independent speech for extracting useful information content from the data. recognizer to provide an identity claim. The utterances A typical need is to search or browse through the data, are then processed in a speaker recognition mode and scanning for speciﬁed topics, words, phrase, or speak- compared with speaker models associated with the iden- ers. Most of this data is multispeaker data, collected tity claim. Speaker models are trained by recording and from broadcasts, recorded meetings, telephone conver- processing prompted veriﬁcation phrases in an enroll- sations, etc. The process of obtaining a list of speaker ment session. segments from such data is referred to as speaker index- In addition to security applications, speaker veriﬁ- ing, segmentation, or diarization. A more-general task cation may be used to offer personalized services to of annotating audio data from various audio sources users. For example, once a speaker veriﬁcation phrase including speakers has been referred to as audio diariza- is authenticated, the user may be given access to a per- tion [36.10]. sonalized phone book for voice repertory dialing. Still another speaker recognition application is to A forensic application is likely to be an open-set improve automatic speech recognition by adapting identiﬁcation or veriﬁcation task. A sample of speech ex- speaker-independent speech models to speciﬁed speak- ists from an unknown perpetrator. A suspect is required ers. Many commercial speech recognizers do adapt their to speak utterances contained in the suspect speech speech models to individual users, but this cannot be sample in order to train a model. The suspect speech regarded as a speaker recognition application unless sample is compared both with the suspect and nonsus- speaker models are constructed and speaker recognition pect models to decide whether to accept or reject the is a part of the process. Speaker recognition can also hypothesis that the suspect and perpetrator voices are be used to improve speech recognition for multispeaker the same. data. In this situation speaker indexing can provide a ta- In surveillance applications the input speech mode ble of speech segments assigned to individual speakers. is most likely to be text independent. Since the speaker The speech data in these segments can then be used to may be unaware that his voice is being monitored, he adapt speech models to each speaker. Speech recognition cannot be expected to speak speciﬁed texts. The decision of multispeaker speech samples can be improved in an- task is open-set identiﬁcation or veriﬁcation. other way. Errors and ambiguities in speech recognition Large amounts of multimedia data, including speech, transcripts can be corrected using the knowledge pro- are being recorded and stored on digital media. The ex- vided by speaker segmentation assigning the segments istence of such large amounts of data has created a need to the correct speakers. 36.2 Measuring Speaker Features 36.2.1 Acoustic Measurements representation provides limited spectral resolution above 2 kHz, which might be detrimental in speaker recog- As mentioned in Sect. 36.1, low-level acoustic features nition. However, somewhat counterintuitively, MFCCs such as short-time spectra are commonly used in speaker have been found to be quite effective in speaker recog- modeling. Such features are useful in authentication sys- nition. tems because speakers have less control over spectral There are many minor variations in the deﬁnition details than higher-level features such as pitch. of MFCC but the essential details are as follows. Let {S(k), 0 ≤ k < K } be the discrete Fourier transform Short-Time Spectrum (DFT) coefﬁcients of a windowed speech signal s(t). ˆ There are many ways of representing the short-time A set of triangular ﬁlters are deﬁned such that spectrum. A popular representation is the mel-frequency ⎧ ⎪ (k/K ) f s − fc j−1 , ⎪ f −f lj ≤ k ≤ cj , cepstral coefﬁcients (MFCC), which were originally ⎪ c j c j−1 ⎨ developed for speaker-independent speech recognition. w j (k)= f c − k f s / f c − f c , c j < k ≤ u j , The choice of center frequencies and bandwidths of the ⎪ j+1 ⎪ K j+1 j ⎪ ⎩0 , ﬁlter bank used in MFCC were motivated by the prop- elsewhere , erties of the human auditory system. In particular, this (36.1)
730 Part F Speaker Recognition where f c j−1 and f c j+1 are the lower and upper limits of and Gaussianization [36.16], map the observed feature Part F 36.2 the pass band for ﬁlter j with f c0 = 0 and f c j < f s /2 for distribution to a normal distribution over a sliding win- all j, and l j , c j and u j are the DFT indices corresponding dow, and have been shown to be useful in speaker to the lower, center, and upper limits of the pass band recognition. for ﬁlter j. The log-energy at the outputs for the J ﬁlters It has been long established that incorporating dy- are given by namic information is useful for speaker recognition and ⎡ ⎤ speech recognition [36.17]. The dynamic information is typically incorporated by extending the static cepstral ⎢ uj ⎥ ⎢ 1 ⎥ vectors by their ﬁrst and second derivatives computed as: e( j) = ln ⎢ ⎢ uj S(k) w j (k)⎥ , (36.2) 2 ⎥ ⎣ w j (k) k=l j ⎦ l tct+k k=l j t=−l ΔCk = , (36.4) l and the MFCC coefﬁcients are the discrete cosine trans- |t| form of the ﬁlter energies computed as t=−l l J 1 π t 2 ct+k C(k) = e( j) cos k j− , t=−l 2 J ΔΔCk = . (36.5) j=0 l k = 1, 2, . . . , K . (36.3) t2 t=−l The zeroth coefﬁcient C(0) is set to be the average log- energy of the windowed speech signal. Typical values of Pitch the various parameters involved in the MFCC computa- Voiced sounds are produced by a quasiperiodic opening tion are as follows. A cepstrum vector is calculated using and closing of the vocal folds in the larynx at a fun- a window length of 20 ms and updated every 10 ms. The damental frequency that depends on the speaker. Pitch center frequencies f c j are uniformly spaced from 0 to is a complex auditory attribute of sound that is closely 1000 Hz and logarithmically spaced above 1000 Hz. The related to this fundamental frequency. In this chapter, number of ﬁlter energies is typically 24 for telephone- the term pitch is used simply to refer to the measure of band speech and the number of cepstrum coefﬁcients periodicity observed in voiced speech. used in modeling varies from 12 to 18 [36.13]. Prosodic information represented by pitch and en- Cepstral coefﬁcients based on short-time spec- ergy contours has been used successfully to improve tra estimated using linear predictive analysis and the performance of speaker recognition systems [36.18]. perceptual linear prediction are other popular represen- There are a number of techniques for estimating pitch tations [36.14]. from the speech signal [36.19] and the performance Short-time spectral measurements are sensitive to of even simple pitch-estimation techniques is adequate channel and transducer variations. Cepstral mean sub- for speaker recognition. The major failure modes oc- traction (CMS) is a simple and effective method to cur during speech segments that are at the boundaries compensate for convolutional distortions introduced by of voiced and unvoiced sounds and can be ignored for slowly varying channels. In this method, the cepstral speaker recognition. A more-signiﬁcant problem with vectors are transformed such that they have zero mean. using pitch information for speaker recognition is that The cepstral average over a sufﬁciently long speech speakers have a fair amount of control over it, which signal approximates the estimate of a stationary chan- results in large intraspeaker variations and mismatch nel [36.14]. Therefore, subtracting the mean from the between enrollment and test utterances. original vectors is roughly equivalent to normalizing the effects of the channel, if we assume that the aver- 36.2.2 Linguistic Measurements age of the clean speech signal is zero. Cepstral variance normalization, which results in feature vectors with unit In traditional speaker authentication applications, the variance, has also been shown to improve performance in enrollment data is limited to a few repetitions of a pass- text-independent speaker recognition when there is more word, and the same password is spoken to gain access than a minute of speech for enrollment. Other feature to the system. In such cases, speaker models based on normalization methods, such as feature warping [36.15] short-time spectra are very effective and it is difﬁcult to
Overview of Speaker Recognition 36.3 Constructing Speaker Models 731 extract meaningful high-level or linguistic features. In recognizer to produce the phone sequences and the ro- Part F 36.3 applications such as indexing broadcasts by speaker and bustness of the speaker models of phone sequences. For passive surveillance, a signiﬁcant amount of enrollment example, the use of lexical constraints in the automatic data, perhaps several minutes, may be available. In such speech recognition (ASR) reproduces phone sequences cases, the use of linguistic features has been shown to found in a predetermined dictionary and prevents phone be beneﬁcial [36.18]. sequences that may be characteristic of a speaker but not represented in the dictionary. Word Usage The phone accuracy computed using one-best output Features such as vocabulary choices, function word fre- phone strings generated by ASR systems without lexical quencies, part-of-speech frequencies, etc., have been constraints is typically not very high. On the other hand, shown to be useful in speaker recognition [36.20]. In the correct phone sequence can be found in a phone addition to words, spontaneous speech contains ﬁllers lattice output by an ASR with a high probability. It has and hesitations that can be characterized by statistical been shown that it is advantageous to construct speaker models and used for identifying speakers [36.20, 21]. models based on phone-lattice output rather than the There are a number of issues with speaker recognition one-best phone sequence [36.22]. Systems based on one- systems based on lexical features: they are susceptible best phone sequences use the counts of a term such as to errors introduced by large-vocabulary speech recog- a phone unigram or bigram in the decoded sequence. In nizers, a signiﬁcant amount of enrollment data is needed the case of lattice outputs, these raw counts are replaced to build robust models, and the speaker models are likely by the expected counts given by to characterize the topic of conversation as well as the speaker. E[C(τ|X)] = p(Q|X)C(τ|Q) , (36.6) Q Phone Sequences and Lattices where Q is a path through the phone lattice for the Models of phone sequences output by speech recog- utterance X with associated probability p(Q|X), and nizers using phonotactic grammars, typically phone C(τ|Q) is the count of the term τ in the path Q. unigrams, can be used to represent speaker character- istics [36.22]. It is assumed that these models capture Other Linguistic Features speaker-speciﬁc pronunciations of frequently occurring A number of other features that have been found to words, choice of words, and also an implicit charac- be useful for speaker modeling are (a) pronunciation terization of the acoustic space occupied by the speech modeling of carefully chosen words, and (b) prosodic signal from a given speaker. It turns out that there is statistics such as pitch and energy contours as well as an optimal tradeoff between the constraints used in the durations of phones and pauses [36.23]. 36.3 Constructing Speaker Models A speaker recognition system provides the ability to eling follows. The methods can be broadly classiﬁed construct a model λs for speaker s using enrollment utter- as nonparametric or parametric. Nonparametric models ances from that speaker, and a method for comparing the make few structural assumptions and are effective when quality of match of a test utterance to the speaker model. there is sufﬁcient enrollment data that is matched to the The choice of models is determined by the application test data. Parametric models allow a parsimonious rep- constraints. In applications in which the user is expected resentation of the structural constraints and can make to say a ﬁxed password each time, it is beneﬁcial to effective use of the enrollment data if the constraints are develop models for words or phrases to capture the tem- appropriately chosen. poral characteristics of speech. In passive surveillance applications, the test utterance may contain phonemes or 36.3.1 Nonparametric Approaches words not seen in the enrollment data. In such cases, less- detailed models that model the overall acoustic space of Templates the user’s utterances tend to be effective. A survey of This is the simplest form of speaker modeling and is general techniques that have been used in speaker mod- appropriate for ﬁxed-password speaker veriﬁcation sys-
732 Part F Speaker Recognition tems [36.24]. The enrollment data consists of a small empirically that a score deﬁned as Part F 36.3 number of repetitions of the password spoken by the 1 target speaker. Each enrollment utterance X is a se- snn (Y; X) = min y j − xi 2 T −1 Ny xi ∈X quence of feature vectors {xt }t=0 generated as described y j ∈Y in Sect. 36.2, and serves as the template for the password 1 as spoken by the target speaker. A test utterance Y con- + min yi − x j 2 Nx yi ∈Y T −1 x j ∈X sisting of vectors {yt }t=0 , is compared to each of the 1 enrollment utterances and the identity claim is accepted − min yi − y j 2 if the distance between the test and enrollment utterances Ny yi ∈Y; j=i y j ∈Y is below a decision threshold. The comparison is done 1 as follows. Associated with each pair of vectors, xi and − min xi − x j 2 (36.10) Nx xi ∈X; j=i y j , is a distance, d(xi , y j ). The feature vectors of X and x j ∈X Y are aligned using an algorithm referred to as dynamic gives much better performance than snn (Y; X). time warping to minimize an overall distance deﬁned as the average intervector distance d(xi , y j ) between the 36.3.2 Parametric Approaches aligned vectors [36.12]. This approach is effective in simple ﬁxed-password Vector Quantization Modeling applications in which robustness to channel and trans- Vector quantization constructs a set of representative ducer differences are not an issue. This technique is samples of the target speaker’s enrollment utterances described here mostly for historical reasons and is rarely by clustering the feature vectors. Although a variety of used in real applications today. clustering techniques exist, the most commonly used is k-means clustering [36.14]. This approach partitions N Nearest-Neighbor Modeling feature vectors into K disjoint subsets S j to minimize Nearest-neighbor models have been popular in non- an overall distance such as parametric classiﬁcation [36.25]. This approach is often J thought of as estimating the local density of each class D= (xi − μ j ) , (36.11) by a Parzen estimate and assigning the test vector to the j=1 xi ∈S j class with the maximum local density. The local den- sity of a class (speaker) with enrollment data X at a test where μ j = (1/N j ) xi ∈S j xi is the centroid of the N j vector y is deﬁned as samples in the j-th cluster. The algorithm proceeds in two steps: 1 pnn ( y; X) = , (36.7) 1. Compute the centroid of each cluster using an initial V [dnn ( y, X)] assignment of the feature vectors to the clusters. where dnn ( y, X) = minx j ∈X y − x j is the nearest- 2. Reassign xi to that cluster whose centroid is closest neighbor distance and V (r) is the volume of a sphere to it. of radius r in the D-dimensional feature space. Since These steps are iterated until successive steps do not V (r) is proportional to r D , reassign samples. This algorithm assumes that there exists an initial ln[ pnn ( y; X)] ≈ −D ln[dnn ( y, X)] . (36.8) clustering of the samples into K clusters. It is difﬁcult The log-likelihood score of the test utterances Y with to obtain a good initialization of K clusters in one step. respect to a speaker speciﬁed by enrollment X is given In fact, it may not even be possible to reliably estimate by K clusters because of data sparsity. The Linde–Buzo– Gray (LBG) algorithm [36.27] provides a good solution snn (Y; X) ≈ − ln[dnn ( y, X)] , (36.9) for this problem. Given m centroids, the LBG algorithm y j ∈Y produces additional centroids by perturbing one or more of the centroids using a heuristic. One common heuristic and the speaker with the greatest s(Y; X) is identiﬁed. is to choose the μ for the cluster with the largest variance A modiﬁed version of the nearest-neighbor model, and produce two centroids μ and μ + . The enrollment motivated by the discussion above, has been success- feature vectors are assigned to the resulting m + 1 cen- fully used in speaker identiﬁcation [36.26]. It was found troids. The k-means algorithm described previously can
Overview of Speaker Recognition 36.3 Constructing Speaker Models 733 then be applied to reﬁne the centroid estimates. This pro- it is difﬁcult to estimate full covariance matrices reliably. Part F 36.3 cess can be repeated until m = M or the cluster sizes fall In practice, {Σk }k=1 are assumed to be diagonal. K below a threshold. The LBG algorithm is usually ini- Given the enrollment data X, the maximum- tialized with m = 1 and computes the centroid of all the likelihood estimates of the λ can be obtained using enrollment data. There are many variations of this algo- the expectation-maximization (EM) algorithm [36.12]. rithm that differ in the heuristic used for perturbing the The K -means algorithm can be used to initialize the centroids, the termination criteria, and similar details. parameters of the component densities. The poste- In general, this algorithm for generating VQ models has rior probability that xt is drawn from the component been shown to be quite effective. The choice of K is pm (xt |λm ) can be written a function of the size of enrollment data set, the applica- wm pm (xt |λm ) tion, and other system considerations such as limits on P(m|xt , λ) = . (36.14) computation and memory. p(xt |λ) Once the VQ models are established for a target The maximum-likelihood estimates of the parameters of speaker, scoring consists of evaluating D in (36.11) for λ in terms of P(m|xt , λ) are feature vectors in the test utterance. This approach is T general and can be used for text-dependent and text- P(m|xt , λ)xt independent speaker recognition, and has been shown t=1 μm = , (36.15) to be quite effective [36.28]. Vector quantization models T can also be constructed on sequences of feature vectors, P(m|xt , λ) t=1 which are effective at modeling the temporal structure T of speech. If distance functions and centroids are suit- P(m|xt , λ)xt xtT ably redeﬁned, the algorithms described in this section t=1 continue to be applicable. Σm = − μm μT , m (36.16) T Although VQ models are still useful in some situa- P(m|xt , λ) tions, they have been superseded by models such as the t=1 T Gaussian mixture models and hidden Markov models, 1 which are described in the following sections. wm = P(m|xt , λ) . (36.17) T t=1 Gaussian Mixture Models The two steps of the EM algorithm consist of computing In the case of text-independent speaker recognition P(m|xt , λ) given the current model, and updating the (the subject of Chap. 38) where the system has no model using the equations above. These two steps are prior knowledge of the text of the speaker’s utterance, iterated until a convergence criteria is satisﬁed. Gaussian mixture models (GMMs) have proven to be Test utterance scores are obtained as the average very effective. This can be thought of as a reﬁnement log-likelihood given by of the VQ model. Feature vectors of the enrollment ut- T terances X are assumed to be drawn from a probability 1 density function that is a mixture of Gaussians given by s(Y|λ) = log[ p( yt |λ)] . (36.18) T t=1 K Speaker veriﬁcation is often based on a likelihood- p(x|λ) = wk pk (x|λk ) , (36.12) ratio test statistic of the form p(Y|λ)/ p(Y|λbg ) where λ k=1 is the speaker model and λbg represents a background K where 0 ≤ wk ≤ 1 for 1 ≤ k ≤ K , k=1 wk = 1, and model [36.29]. For such systems, speaker models can also be trained by adapting λbg , which is generally 1 trained on a large independent speech database [36.30]. pk (x|λk ) = (2π) D/2 |Σk |1/2 There are many motivations for this approach. Gen- 1 −1 erating a speaker model by adapting a well-trained × exp − (x − μk )T Σk (x − μk ) , background GMM may yield models that are more ro- 2 (36.13) bust to channel differences, and other kinds of mismatch between enrollment and test conditions than models es- λ represents the parameters (μi , Σi , wi )i=1 of the distri- K timated using only limited enrollment data. Details of bution. Since the size of the training data is often small, this procedure can be found in Chap. 38.
734 Part F Speaker Recognition Speaker modeling using GMMs is attractive for ﬁned as Part F 36.3 text-independent speaker recognition because it is sim- ple to implement and computationally inexpensive. ds (x; Λ) = −gs (x; Λs ) + G s (x; Λ), (36.19) The fact that this model does not model tempo- ral aspects of speech is a disadvantage. However, where Λ is the set of all speaker models and G s (x; Λ) is it has been difﬁcult to exploit temporal structure to the antidiscriminant function for speaker s. G s (x; Λ) is improve speaker recognition performance when the deﬁned so that ds (x; Λ) is positive only if x is incorrectly linguistic content of test utterances does not overlap classiﬁed. In speech recognition problems, G s (x; Λ) is signiﬁcantly with the linguistic content of enrollment usually deﬁned as a collective representation of all com- utterances. peting classes. In the speaker identiﬁcation task, it is often advantageous to construct pairwise misclassiﬁca- Hidden Markov Models tion measures such as In applications where the system has prior knowledge dss (x; Λ) = −gs (x; Λs ) + gs x; Λs , (36.20) of the text and there is signiﬁcant overlap of what was said during enrollment and testing, text-dependent sta- with respect to a set of competing speakers s , a sub- tistical models are much more effective than GMMs. set of the S speakers. Each misclassiﬁcation measure is An example of such applications is access control to embedded into a smooth empirical loss function personal information or bank accounts using a voice password. Hidden Markov models (HMMs) [36.12] 1 lss (x; Λ) = , (36.21) for phones, words, or phrases, have been shown to 1 + exp(−αdss (x; Λ)) be very effective [36.31, 32]. Passwords consisting which approximates a loss directly related to the number of word sequences drawn from specialized vocabu- of classiﬁcation errors, and α is a smoothness parameter. laries such as digits are commonly used. Each word The loss functions can then be combined into an overall can be characterized by an HMM with a small num- loss given by ber of states, in which each state is represented by a Gaussian mixture density. The maximum-likelihood l(x; Λ) = lss (x; Λ)δs (x) , (36.22) estimates of the parameters of the model can be s s ∈Sc obtained using a generalization of the EM algo- rithm [36.12]. where δs (x) is an indicator function which is equal to 1 The ML training aims to approximate the underly- when x is uttered by speaker s and 0 otherwise, and Sc is ing distribution of the enrollment data for a speaker. the set of competing speakers. The total loss, deﬁned as The estimates deviate from the true distribution due the sum of l(x; Λ) over all training data, can be optimized to lack of sufﬁcient training data and incorrect mod- with respect to all the model parameters using a gradient- eling assumptions. This leads to a suboptimal classiﬁer descent algorithm. A similar algorithm has been devel- design. Some limitations of ML training can be over- oped for speaker veriﬁcation in which samples from come using discriminative training of speaker models a large number of speakers in a development set is used in which an attempt is made to minimize an over- to compute a minimum veriﬁcation measure [36.36]. all cost function that depends on misclassiﬁcation or The algorithm described above is only to illustrate detection errors [36.33–35]. Discriminative training the basic principles of discriminative training for speaker approaches require examples from competing speak- identiﬁcation. Many other approaches that differ in their ers in addition to examples from the target speaker. choice of the loss function or the optimization method In the case of closed-set speaker identiﬁcation, it is have been developed and shown to be effective [36.35, possible to construct a misclassiﬁcation measure to 37]. evaluate how likely a test sample, spoken by a tar- The use of HMMs in text-dependent speaker veriﬁ- get speaker, is misclassiﬁed as any of the others. One cation is discussed in detail in Chap. 37. example of such a measure is the minimum classiﬁ- cation error (MCE) deﬁned as follows. Consider the Support Vector Modeling set of S discriminant functions {gk (x; Λs ), 1 ≤ s ≤ S}, Traditional discriminative training approaches such as where gk (x; Λs ) is the log-likelihood of observation those based on MCE have a tendency to overtrain on x given the models Λs for speaker s. A set of the training set. The complexity and generalization abil- misclassiﬁcation measures for each speaker can be de- ity of the models are usually controlled by testing on
Overview of Speaker Recognition 36.5 Decision and Performance 735 a held-out development set. Support vector machines Other Approaches Part F 36.5 (SVMs) [36.38] provide a way for training classiﬁers Most state-of-the-art speaker recognition systems use using discriminative criteria and in which the model some combination of the modeling methods described complexity that provides good generalization to test in the previous sections. Many other interesting models data is determined automatically from the training data. have been proposed and have been shown to be useful in SVMs have been found to be useful in many classiﬁca- limited scenarios. Eigenvoice modeling is an approach tion tasks including speaker identiﬁcation [36.39]. in which the speaker models are conﬁned to a low- The original formulation of SVMs was for two-class dimensional linear subspace obtained using independent problems. This seems appropriate for speaker veriﬁ- training data from a large set of speakers. This method cation in which the positive samples consist of the has been shown to be effective for speaker modeling enrollment data from a target user and the negative sam- and speaker adaptation when the enrollment data is too ples are drawn from a large set of imposter speakers. limited for the effective use of other text-independent Many extensions of SVMs to multiclass classiﬁcation approaches such as GMMs [36.40]. Artiﬁcial neural net- have also been developed and are appropriate for speaker works [36.41] have also been shown to be useful in identiﬁcation. There are many issues with SVM mod- some situations, perhaps in combination with GMMs. eling for speaker recognition, including the appropriate When sufﬁcient enrollment data is available, a method choice of features and the kernel. The use of SVMs for for speaker detection that involves comparing the test text-independent speaker recognition is the subject of segment directly to similar segments in enrollment data Chap. 38. has been shown to be effective [36.42]. 36.4 Adaptation In most speaker recognition scenarios, the speech Models can be adapted in an unsupervised way using data available for enrollment is too limited for train- data from authenticated utterances. This is common in ing models that adequately characterize the range of ﬁxed-password systems and can reduce the error rate test conditions in which the system needs to operate. signiﬁcantly. It is also necessary to update the decision For example, in ﬁxed-password speaker authentica- thresholds when the models are adapted. Since the selec- tion systems used in telephony services, enrollment tion of data for model adaptation is not supervised, there data is typically collected in a single call. The en- is the possibility that models are adapted on imposter ut- rollment and test conditions may be mismatched in terances. This can be disastrous. The details of unsuper- a number of ways: the telephone handset that is vised model and threshold adaptation and the various used, the location of the call, which determines the issues involved are explained in detail in Chap. 37. kinds of background noises, and the channel over Speaker recognition is often incorporated into other which speech is transmitted such as cellular or land- applications that involve a dialog with the user. Feedback line networks. In text-independent modeling, there from the dialog system can be used to supervise model are likely to be additional problems because of mis- adaptation. In addition, meta-information available from match in the linguistic content. A very effective a dialog system such as the history of interactions can be way to mitigate the effects of mismatch is model combined with speaker recognition to design a ﬂexible adaptation. and secure authentication system [36.43]. 36.5 Decision and Performance 36.5.1 Decision Rules the match between a given test utterance Y and a speaker model λ. Identiﬁcation systems yield a set Whether they are used for speaker identiﬁcation or of such scores corresponding to each speaker in a tar- veriﬁcation, the various models and approaches pre- get list. Veriﬁcation systems output only one score sented in Sect. 36.3 provide a score s(Y |λ) measuring using the speaker model of the claimed speaker. An
736 Part F Speaker Recognition accept or reject decision has to be made using this for a given operating condition is generally estimated ex- Part F 36.5 score. perimentally from development data that is appropriate Decision in closed-set identiﬁcation consists of for a given scenario. ˆ choosing the identiﬁed speaker S as the one that cor- responds to the maximum score: 36.5.3 Errors and DET Curves ˆ S = arg max s(Y |λ j ) , (36.23) j The performance of an identiﬁcation system is related to the probability of misclassiﬁcation, which corresponds where the index j ranges over the whole set of target to cases when the identiﬁed speaker is not the actual one. speakers. Veriﬁcation systems are evaluated based on two Decision in veriﬁcation is obtained by comparing the types of errors: false acceptance, when an impostor score computed using the model for the claimed speaker speaker succeeds in being veriﬁed with an erroneous Si given by s(Y |λi ) to a predeﬁned threshold θ. The claimed identity, and false rejection, when a target user claim is accepted if s(Y |λi ) ≥ θ, and rejected otherwise. claiming his/her genuine identity is rejected. The a pos- Open-set identiﬁcation relies on a step of closed-set teriori estimates of the probabilities pfa and pfr of these identiﬁcation eliciting the most likely identity, followed two types of errors vary in the opposite way from each by a veriﬁcation step to determine whether the hypothe- other when the decision threshold θ is varied. The sized identity match is good enough. tradeoff between pfa and pfr (sometimes mapped to 36.5.2 Threshold Setting the probability of detection pd , deﬁned as 1 − pfr ) is and Score Normalization often displayed in the form of a receiver operating char- acteristic (ROC), a term commonly used in detection Efﬁciency and robustness require that the score s(Y |λ) theory [36.44]. In speaker recognition systems a dif- be quite readily exploited in a practical application. In ferent representation of the same data, referred to as particular, the threshold θ should be as insensitive as the detection error tradeoff (DET) curve, has become possible across users and application context. popular. When the score is obtained in a probabilistic frame- The DET curve [36.47] is the standard way to depict work or can be interpreted as a (log) likelihood ratio the system behavior in terms of hypotheses separability (LLR), Bayesian decision theory [36.44] states that an by plotting pfa as a function of pfr . Rather than the prob- optimal threshold for veriﬁcation can be theoretically abilities themselves, the normal deviates corresponding set once the desired false acceptance cfa and false rejec- to the probabilities are plotted. For a particular threshold tion cfr , and the a priori probability pimp of an impostor value, the corresponding error rates pfa and pfr appear trying to enter the system, are speciﬁed. The optimal as a speciﬁc point on this DET curve. A popular point is choice of the threshold is given by: the one where pfa = pfr , which is called the equal error cfa pimp rate (EER). Plotting DET curves is a good way to com- θ∗ = . (36.24) pare the potential of two methods in a laboratory but it cfr 1 − pimp is not suited for predicting accurately the performance In practice, however, the score s(Y |λ) does not be- of a system when deployed in real-life conditions. have as theory would predict since the statistical models The decision threshold θ is often chosen to optimize are not ideal. Various normalization procedures have a cost that is a function of the probability of false accep- been proposed to alleviate this problem. Initial work by tance and false rejection as well as the prior probability Li and Porter [36.45] has inspired a number of score of an imposter attack. One such function is called the normalization techniques that intend to make the statis- detection cost function (DCF), deﬁned as [36.48] tical distribution of s(Y |λ) as independent as possible across speakers, acoustic conditions, linguistic content, C = pimp cfa pfa + (1 − pimp )cfr pfr . (36.25) etc. This has lead to a number of threshold normal- ization schemes, such as the Z-norm, H-Norm, and The DCF is indeed a way to evaluate a system under T-norm, which use side information, the distance be- a particular operating condition and to summarize into tween models, and speech material from a development a single ﬁgure its estimated performance in a given ap- set to determine the normalization parameters. These plication scenario. It has been used as the primary ﬁgure normalization procedures are discussed in more detail in of merit for the evaluation of systems participating in the Chaps. 37, 38 and [36.46]. Even so, the optimal threshold yearly NIST speaker recognition evaluations [36.48].
Overview of Speaker Recognition 36.6 Selected Applications for Automatic Speaker Recognition 737 36.6 Selected Applications for Automatic Speaker Recognition Part F 36.6 Text-dependent and text-independent speaker recogni- one speaker or the other. These segment clusters can tion technology and their applications are discussed in then be used to construct protospeaker models, typically detail in the following two Chaps. 37 and 38. A few inter- GMMs. Each of these models is then used to resegment esting, but perhaps not primary, applications of speaker the data to provide an improved segmentation which, in recognition technology are described in this section. turn, will provide improved speaker models. The process These applications were chosen to demonstrate the wide can be iterated until no further signiﬁcant improvement range of applications of speaker recognition. is obtained. It then remains to apply speaker labels to the models and segmentations. Some independent 36.6.1 Indexing Multispeaker Data knowledge is required to accomplish this. As mentioned earlier, the speakers in the telephone conversation may Speaker indexing can be approached as either a super- be known, but some additional information is required vised or unsupervised task. Supervised means that prior to assign labels to the correct models and segmentations. speaker models exist for the speakers of interest included in the data. The data can then be scanned and processed 36.6.2 Forensics to determine the segments associated with each of these speakers. Unsupervised means that prior speaker models The perspective of being able to identify a person on the do not exist. The type of approach taken depends on the basis of his or her voice has received signiﬁcant interest type and amount of prior knowledge available for par- in the context of law enforcement. In many situations, ticular applications. There may be knowledge of the a voice recording is a key element, and sometimes the identities of the participating speakers and there may only one available, for proceeding with an investigation, even be independent labeled speech data available for identifying or clearing a suspect, and even supporting an constructing models for these speakers, such as in the accusation or defense in a court of law. case of some broadcast news applications [36.6, 49, 50]. The public perception is that voice identiﬁcation is In this situation the task is supervised and the techniques a straightforward task, and that there exists a reliable for speaker segmentation or indexing are basically the voiceprint in much the same way as there are ﬁnger- same as used for speaker detection [36.9, 50, 51]. prints or genetic (DNA) prints. This is not true in general A more-challenging task is unsupervised segmenta- because the voice of an individual has a strong behav- tion. An example application is the segmentation of the ioral component, and is only partly based on anatomical speakers in a two-person telephone conversation [36.4,9, properties. Moreover, the conditions under which the 52,53]. The speaker identities may or may not be known test utterance is recorded are generally not known or but independent labelled speech data for constructing controlled. The test voice sample might be from an speaker models is generally not available. Following anonymous call, wiretapping, etc. For these reasons, is a possible approach to the unsupervised segmenta- the use of voice recognition in the context of forensic tion problem. The ﬁrst task is to construct unlabeled applications must be approached with caution [36.59]. single-speaker models from the current data. An initial The four procedures that are generally followed in segmentation of the data is carried out with an acoustic the forensic context are described below. change detector using a criterion such as the generalized likelihood ratio (GLR) [36.4,5] or Bayesian information Nonexpert Speaker Recognition criterion (BIC) [36.8,54,55]. The hypothesis underlying by Lay Listener(s) this process is that each of the resulting segments will This procedure is used in the context of a voice lineup be single-speaker segments. These segments are then when a victim or a witness has had the opportunity of clustered using an agglomerative clustering algorithm hearing a voice sample and is asked to say whether with a criterion for measuring the pairwise similarity he or she recognizes this voice, or to determine if between segments [36.56–58]. Since in the cited appli- this voice sample matches one of a set of utterances. cation the number of speakers is known to be two, the Since it is difﬁcult to set up such a test in a con- clustering terminates when two clusters are obtained. trolled way and calibrate to the matching criteria an If the acoustic change criterion and the matching cri- individual subject may use, such procedures can be used terion for the clustering perform well the two clusters only to suggest a possible course of action during an of segments will each contain segments mostly from investigation.
738 Part F Speaker Recognition Expert Speaker Recognition awareness of the need for systematic evaluation have Part F 36.6 Expert study of a voice sample might include one or constituted signiﬁcant contributions to these exchanges. more of aural–perceptual approaches, linguistic analy- sis, and spectrogram examination. In this context, the 36.6.3 Customization: SCANmail expert takes into account several levels of speaker char- acterization such as pitch, timbre, diction, style, idiolect, Customization of services and applications to the user is and other idiosyncracies, as well as a number of physi- another class of applications of speaker recognition tech- cal measurements including fundamental frequencies, nology. An example of a customized messaging system segment durations, formants, and jitter. Experts pro- is one where members of a family share a voice mail- vide a decision on a seven-level scale speciﬁed by the box. Once the family members are enrolled in a speaker International Association for Identiﬁcation (IAI) stan- recognition system, there is no need for them to identify dard [36.60] on whether two voice samples (the disputed themselves when accessing their voice mail. A com- recording and a voice sample of the suspect) are more or mand such as Get my messages spoken by a user can less likely to have been produced by a the same person. be used to identify and authenticate the user, and pro- Subjective heterogeneous approaches coexist between vide only those messages left for that user. There are forensic practitioners and, although the technical inva- many such applications of speaker recognition technol- lidity of some methods has been clearly established, ogy. An interesting and successful application of caller they are still used by some. The expert-based approach identiﬁcation to a voicemail browser is described in this is therefore generally used with extreme caution. section. SCANMail is a system developed for the purpose Semiautomatic Methods of providing useful tools for managing and searching This category refers to systems for which a super- through voicemail messages [36.61]. It employs ASR vised selection of speech segments is conducted prior to provide text transcriptions, information retrieval on to a computer-based analysis of the selected material. the transcriptions to provide a weighted set of search Whereas a calibrated metric can be used to evaluate the terms, information extraction to obtain key informa- similarity of speciﬁc types of segments such as words tion such as telephone numbers from transcription, as or phrases, these systems tend to suffer from a lack of well as automatic speaker recognition to carry out caller standardization. identiﬁcation by processing the incoming messages. A graphical user interface enables the user to exercise the Automatic Methods features of the system. The caller identiﬁcation function Fully automated methods using state-of-the-art tech- is described in more detail below. niques offer an attractive paradigm for forensic speaker Two types of processing requests are handled by the veriﬁcation. In particular, these automatic approaches caller identiﬁcation system (CIS). The ﬁrst type of re- can be run without any (subjective) human interven- quest is to assign a speaker label to an incoming message. tion, they offer a reproducible procedure, and they lend When a new message arrives, ASR is used to produce themselves to large-scale evaluation. Technological im- a transcription. The transcription as well as the speech provements over the years, as well as progress in the signal is transmitted to the CIS for caller identiﬁcation. presentation, reporting, and interpretation of the results, The CIS compares the processed speech signal with the have made such methods attractive. However, levels of model of each caller in the recipient’s address book. performance remain highly sensitive to a number of ex- The recipient’s address book is populated with speaker ternal factors ranging from the quality and similarity of models when the user adds a caller to the address book recording conditions, the cooperativeness of speakers, by providing a label to a received message. A matching and the potential use of technologies to fake or disguise score is obtained for each of the caller models and com- a voice. pared to a caller-dependent rejection threshold. If the Thanks to a number of initiatives and workshops (in matching score exceeds the threshold, the received mes- particular the series of ISCA and IEEE Odyssey work- sage is assigned a speaker label. Otherwise, CIS assigns shops), the past decade has seen some convergence in an unknown label to the message. terms of formalism, interpretation, and methodology be- The second type of request originates with the user tween forensic science and engineering communities. In action of adding a caller to an address book as mentioned particular, the interpretation of voice forensic evidence earlier. In the course of reviewing a received message, in terms of Bayesian decision theory and the growing the user has the capability to supply a caller label to the
Overview of Speaker Recognition References 739 message. The enrollment module in the CIS attempts be augmented with models based on meta-information, Part F 36 to construct a speaker model for a new user using that which may include personal information such as the message. The acoustic models are trained using text- caller’s name or contact information left in the message, independent speaker modeling. Acoustic models can or the calling history. 36.7 Summary Identifying speakers by voice was originally inves- The modeling techniques that are applicable, and tigated for applications in speaker authentication. the nature of the problems, vary depending on the ap- Over the last decade, the ﬁeld of speaker recogni- plication scenario. An important dichotomy is based on tion has become much more diverse and has found whether the content (text) of the speech during training numerous applications. An overview of the technol- and testing overlaps signiﬁcantly and is known to the ogy and sample applications were presented in this system. These two important cases are the subject of the chapter. next two chapters. References 36.1 J.S. Dunn, F. Podio: Biometrics Consortium website, 36.12 L.R. Rabiner, B.-H. Juang: Fundamentals of Speech http://www.biometrics.org (2007) Recognition (Prentice-Hall, Englewood Cliffs 1993) 36.2 M.A. Przybocki, A.F. Martin: The 1999 NIST speaker 36.13 S. Davis, P. Mermelstein: Comparison of parametric recognition evaluation, using summed two- representation for monosyllable word recog- channel telephone data for speaker detection and nition in continuously spoken sentences, IEEE speaker tracking, Eurospeech 1999 Proceedings Trans. Acoust. Speech Signal Process. 28, 357–366 (1999) pp. 2215–2218, http://www.nist.gov/speech/ (1980) publications/index.htm 36.14 X. Huang, A. Acero, H.-W. Hon: Spoken Language 36.3 M.A. Przybocki, A.F. Martin: Nist speaker recogni- Processing: A Guide to Theory, Algorithm and Sys- tion evaluation chronicles, Odyssey Workshop 2004 tem Development (Prentice-Hall, Englewood Cliffs Proc. (2004) pp. 15–22 2001) 36.4 H. Gish, M.-H. Siu, R. Rohlicek: Segregation of 36.15 J. Pelecanos, S. Sridharan: Feature warping for speakers for speech recognition and speaker iden- robust speaker veriﬁcation, Proc. ISCA Workshop tiﬁcation, Proc. ICASSP (1991) pp. 873–876 on Speaker Recognition - 2001: A Speaker Odyssey 36.5 L. Wilcox, F. Chen, D. Kimber, V. Balasubramanian: (2001) Segmentation of speech using speaker identiﬁca- 36.16 B. Xiang, U. Chaudhari, J. Navratil, G. Ramaswamy, tion, Proc. ICASSP (1994) pp. 161–164 R. Gopinath: Short-time Gaussianization for ro- 36.6 J.-L. Gauvain, L. Lamel, G. Adda: Partitioning and bust speaker veriﬁcation, Proc. ICASSP, Vol. 1 (2002) transcription of broadcast news data, Proc. of ICSLP pp. 681–684 (1998) pp. 1335–1338 36.17 S. Furui: Comparison of speaker recognition 36.7 S.E. Johnson: Who spoke when? - automatic seg- methods using static features and dynamic fea- mentation and clustering for determining speaker tures, IEEE Trans. Acoust. Speech Signal Process. 29, turns, Proc. Eurospeech (1999) pp. 2211–2214 342–350 (1981) 36.8 P. Delacourt, C.J. Wellekens: Distbic: A speaker- 36.18 J.P. Campbell, D.A. Reynolds, R.B. Dunn: Fusing based segmentation for audio data indexing, high- and log-level features for speaker recogni- Speech Commun. 32, 111–126 (2000) tion, Proc. Eurospeech, Vol. 1 (2003) 36.9 R.B. Dunn, D.A. Reynolds, T.F. Quatieri: Approaches 36.19 W. Hess: Pitch Determination of Speech Signals to speaker detection and tracking in conversational (Springer, Berlin, Heidelberg 1983) speech, Digital Signal Process. 10, 93–112 (2000) 36.20 G. Doddington: Speaker recognition based on 36.10 S.E. Tranter, D.A. Reynolds: An overview of au- idiolectal differences between speakers, Proc. Eu- tomatic speaker diarization systems, IEEE Trans. rospeech (2001) pp. 2521–2524 Speech Audio Process. 14, 1557–1565 (2006) 36.21 W.D. Andrews, M.A. Kohler, J.P. Campbell, J.J. God- 36.11 L.H. Jamieson: Course notes for speech process- frey: Phonetic, idiolectal, and acoustic speaker ing by computer, http://cobweb.ecn.purdue.edu recognition, Proceedings of Odyssey Workshop ee649/notes/ (2007) Chap. 1 (2001)
740 Part F Speaker Recognition 36.22 A. Hatch, B. Peskin, A. Stolcke: Improved phonetic 36.38 V.N. Vapnik: The Nature of Statistical Learning The- Part F 36 speaker recognition using lattice decoding, Proc. ory (Springer, New York 1995) ICASSP, Vol. 1 (2005) 36.39 W.M. Campbell, D.A. Reynolds, J.P. Campbell: Fus- 36.23 D. Reynolds, W. Andrews, J. Campbell, J. Navratil, ing discriminative and generative methods for B. Peskin, A. Adami, Q. Jin, D. Klusacek, J. Abram- speaker recognition: experiments on switchboard son, R. Mihaescu, J. Godfrey, D. Jones, B. Xiang: The and NFI/TNO ﬁeld data, Proc. ODYSSEY 2004 – SuperSID project: Exploiting high-level information The Speaker and Language Recognition Workshop for high-accuracy speaker recognition, Proc. ICASSP (2004) pp. 41–44 (2003) pp. 784–787 36.40 O. Thyes, R. Kuhn, P. Nguyen, J.-C. Junqua: Speaker 36.24 A.E. Rosenberg: Automatic speaker veriﬁcation: A identiﬁcation and veriﬁcation using eigenvoices, review, Proc. IEEE 64, 475–487 (1976) Proc. ICASSP (2000) pp. 242–245 36.25 K. Fukunaga: Introduction to Statistical Pattern 36.41 K.R. Farrell, R. Mammone, K. Assaleh: Speaker Recognition, 2nd edn. (Elsevier, New York 1990) recognition using neural networks and conven- 36.26 A.L. Higgins, L.G. Bahler, J.E. Porter: Voice identiﬁ- tional classiﬁers, IEEE Trans. Speech Audio Process. cation using nearest-neighbor distance measure, 2, 194–205 (1994) Proc. ICASSP (1993) pp. 375–378 36.42 D. Gillick, S. Stafford, B. Peskin: Speaker detection 36.27 Y. Linde, A. Buzo, R.M. Gray: An algorithm for vec- without models, Proc. ICASSP (2005) tor quantization, IEEE Trans. Commun. 28, 94–95 36.43 G.N. Ramaswamy, R.D. Zilca, O. Alecksandrovich: A (1980) programmable policy manager for conversational 36.28 F.K. Soong, A.E. Rosenberg, L.R. Rabiner, biometrics, Proc. Eurospeech (2003) B.H. Juang: A vector quantization approach to 36.44 H.V. Poor: An Introduction to Signal Detection and speaker recognition, Proc. IEEE ICASSP (1985) Estimation (Springer, Berlin, Heidelberg 1994) pp. 387–390 36.45 K.P. Li, J.E. Porter: Normalizations and selection of 36.29 D.A. Reynolds, R.C. Rose: Robust text indepen- speech segments for speaker recognition scoring, dent speaker identiﬁcation using Gaussian mixture Proc. IEEE ICASSP (1988) pp. 595–598 speaker models, IEEE Trans. Speech Audio Process. 36.46 F. Bimbot: A tutorial on text-independent speaker 3, 72–83 (1995) veriﬁcation, EURASIP J. Appl. Signal Process. 4, 430– 36.30 D.A. Reynolds, T.F. Quatieri, R.B. Dunn: Speaker 451 (2004) veriﬁcation using adapted Gaussian mixture 36.47 A. Martin, G. Doddington, T. Kamm, M. Ordowski, models, Digital Signal Process. 10, 19–41 (2000) M. Przybocki: The det curve in assessment of de- 36.31 A.E. Rosenberg, S. Parthasarathy: Speaker back- tection task performance, Proc. Eurospeech (1997) ground models for connected digit password pp. 1895–1898 speaker veriﬁcation, Proc. ICASSP (1996) pp. 81– 36.48 A. Martin, M. Przybocki: The NIST 1999 speaker 84 recognition evaluation – an overview, Digital Sig- 36.32 S. Parthasarathy, A.E. Rosenberg: General phrase nal Process. 10, 1–18 (2000) speaker veriﬁcation using sub-word background 36.49 M.A. Siegler, U. Jain, B. Raj, R.M. Stern: Auto- models and likelihood-ratio scoring, Proc. Int. matic segmentation, classiﬁcation, and clustering Conf. Spoken Language Processing (1996) pp. 2403– of broadcast news data, Proc. DARPA Speech Recog- 2406 nition Workshop (1997) pp. 97–99 36.33 O. Siohan, A.E. Rosenberg, S. Parthasarathy: 36.50 A.E. Rosenberg, I. Magrin-Chagnolleau, S. Partha- Speaker identiﬁcation using minimum classiﬁca- sarathy, Q. Huang: Speaker detection in broadcast tion error training, Proc. ICASSP (1998) pp. 109– news databases, Proc. Int. Conf. on Spoken Lang. 112 Processing (1998) pp. 1339–1342 36.34 A.E. Rosenberg, O. Siohan, S. Parthasarathy: 36.51 J.-F. Bonastre, P. Delacourt, C. Fredouille, T. Merlin, Small group speaker identiﬁcation with common C. Wellekens: A speaker tracking system based on password phrases, Speech Commun. 31, 131–140 speaker turn detection for nist evaluation, Proc. (2000) ICASSP (2000) pp. 1177–1180 36.35 L. Heck, Y. Konig: Discriminative training of mini- 36.52 A.G. Adami, S.S. Kajarekar, H. Hermansky: A new mum cost speaker veriﬁcation systems, Proc. RLA2C speaker change detection method for two-speaker - Speaker Recognition Workshop (1998) pp. 93– segmentation, Proc. ICASSP (2002) pp. 3908– 96 3911 36.36 A. Rosenberg, O. Siohan, S. Parthasarathy: Speaker 36.53 A.E. Rosenberg, A. Gorin, Z. Liu, S. Parthasarathy: veriﬁcation using minimum veriﬁcation error Unsupervised segmentation of telephone conver- training, Proc. ICASSP (1998) pp. 105–108 sations, Proc. Int. Conf. on Spoken Lang. Processing 36.37 J. Navratil, G. Ramaswamy: Detac - a discrimi- (2002) pp. 565–568 native criterion for speaker veriﬁcation, Proc. Int. 36.54 S.S. Chen, P.S. Gopalakrishnan: Speaker, en- Conf. Spoken Language Processing (2002) vironment and channel change detection and
Overview of Speaker Recognition References 741 clustering via the bayesian information cri- 36.58 D. Liu, F. Kubala: Online speaker clustering, Proc. Part F 36 terion, Proc. DARPA Broadcast News Tran- ICASSP (2003) pp. 572–575 scription and Understanding Workshop (1998), 36.59 J.-F. Bonastre, F. Bimbot, L.-J. Boë, J. Campbell, http://www.nist.gov/speech/publications/ D. Reynolds, I. Magrin-Chagnolleau: Person au- darpa98/index.htm thentication by voice: a need for caution, Proc. 36.55 A. Tritschler, R. Gopinath: Improved speaker seg- Eurospeech (2003) pp. 33–36 mentation and segments clustering using the 36.60 Voice Identiﬁcation and Acoustic Analysis Sub- bayesian information criterion, Proc. Eurospeech committee of the International Association for (1999) Identiﬁcation: Voice comparison standards, J. 36.56 A.D. Gordon: Classiﬁcation: Methods for the Ex- Forensic Identif. 41, 373–392 (1991) ploratory Analysis of Multivariate Data (Chapman 36.61 A.E. Rosenberg, S. Parthasarathy, J. Hirschberg, Hall, Englewood Cliffs 1981) S. Whittaker: Foldering voicemail messages by 36.57 F. Kubala, H. Jin, R. Schwartz: Automatic speaker caller using text independent speaker recognition, clustering, Proc. DARPA Speech Recognition Work- Proc. Int. Conf. on Spoken Language Processing shop (1997) pp. 108–111 (2000)
743 Text-Depend 37. Text-Dependent Speaker Recognition M. Hébert 37.1 Brief Overview...................................... 743 Text-dependent speaker recognition characterizes Part F 37 37.1.1 Features..................................... 744 a speaker recognition task, such as veriﬁcation 37.1.2 Acoustic Modeling ....................... 744 or identiﬁcation, in which the set of words (or 37.1.3 Likelihood Ratio Score ................. 745 lexicon) used during the testing phase is a sub- 37.1.4 Speaker Model Training................ 746 set of the ones present during the enrollment 37.1.5 Score Normalization and Fusion .... 746 phase. The restricted lexicon enables very short 37.1.6 Speaker Model Adaptation ........... 747 enrollment (or registration) and testing sessions to deliver an accurate solution but, at the same time, 37.2 Text-Dependent Challenges................... 747 37.2.1 Technological Challenges.............. 747 represents scientiﬁc and technical challenges. Be- 37.2.2 Commercial Deployment cause of the short enrollment and testing sessions, Challenges ................................. 748 text-dependent speaker recognition technol- ogy is particularly well suited for deployment in 37.3 Selected Results ................................... 750 large-scale commercial applications. These are 37.3.1 Feature Extraction ....................... 750 the bases for presenting an overview of the state 37.3.2 Accuracy Dependence on Lexicon .. 751 of the art in text-dependent speaker recogni- 37.3.3 Background Model Design ............ 752 tion as well as emerging research avenues. In this 37.3.4 T-Norm in the Context of Text-Dependent Speaker chapter, we will demonstrate the intrinsic depen- Recognition ................................ 753 dence that the lexical content of the password 37.3.5 Adaptation of Speaker Models ...... 753 phrase has on the accuracy. Several research re- 37.3.6 Protection Against Recordings....... 757 sults will be presented and analyzed to show key 37.3.7 Automatic Impostor Trials techniques used in text-dependent speaker recog- Generation ................................. 759 nition systems from different sites. Among these, we mention multichannel speaker model synthe- 37.4 Concluding Remarks ............................. 760 sis and continuous adaptation of speaker models References .................................................. 760 with threshold tracking. Since text-dependent speaker recognition is the most widely used voice results drawn from realistic deployment scenarios biometric in commercial deployments, several are also included. 37.1 Brief Overview There exists signiﬁcant overlap and fundamental dif- icon. This limitation does not exist for text-independent ferences between text-dependent and text-independent speaker recognition where any word can be uttered dur- speaker recognition. The underlying technology and ing enrollment and testing. The known overlap between algorithms are very often similar. Advances in one the enrollment and testing phase results in very good ﬁeld, frequently text-independent speaker recognition accuracy with a limited amount of enrollment mater- because of the NIST evaluations [37.1], can be applied ial (typically less than 8 s of speech). In the case of with success in the other ﬁeld with only minor mod- unknown-text speaker recognition, much more enroll- iﬁcations. The main difference, as pointed out by the ment material is required (typically more than 30 s) nomenclature, is the lexicon allowed by each. Although to achieve similar accuracy. The theme of lexical con- not restricted to a speciﬁc lexicon for enrollment, text- tent of the enrollment and testing sessions is central to dependent speaker recognition assumes that the lexicon text-dependent speaker recognition and will be recurrent active during the testing is a subset of the enrollment lex- during this chapter.
744 Part F Speaker Recognition Traditionally, text-independent speaker recognition Table 37.1 Effect of different mismatch types on the EER was associated with speaker recognition on entire con- for a text-dependent speaker veriﬁcation task (after [37.4]). versations. Lately, work from Sturim et al. [37.2] and The corpus is from a pilot with 120 participants (gender others [37.3] has helped bridge the gap between text- balanced) using a variety of handsets. Signal-to-noise ratio dependent and text-independent speaker recognition by (SNR) mismatch is calculated using the difference between using the most frequent words in conversational speech the SNR during enrollment and testing (veriﬁcation). For and applying text-dependent speaker recognition tech- the purposes of this table, an absolute value of this dif- Part F 37.1 niques to these. They have shown the beneﬁts of using ference of more than 10 db was considered mismatched. text-dependent speaker recognition techniques on a text- Channel mismatch is encountered when the enrollment independent speaker recognition task. and testing sessions are not on the same channel. Fi- Table 37.1 illustrates the challenges encoun- nally, lexical mismatch is introduced when the lexicon tered in text-dependent speaker recognition (adapted used during the testing session is different from the en- from [37.4]). It can be seen that the two main sources rollment lexicon. In this case, the password phrase was of degradation in the accuracy are channel and lexical always a three-digit string. LD0 stands for a lexical match mismatch. Channel mismatch is present in both text- such that the enrolment and testing were performed on the dependent and text-independent speaker recognition, same digit string. In LD2, only two digits are common but mismatch in the lexical content of the enrollment between the enrollment and testing; in LD4 there is only and testing sessions is central to text-dependent speaker one common digit. For LD6 (complete lexical mismatch), recognition. the enrollment lexicon is disjoint from the testing lexicon. Throughout this chapter, we will try to quantify Note that, when considering a given type of mismatch, accuracy based on application data (from trial data col- the conditions are matched for the other types. At EERs lections, comparative studies or live data). We will favor around 8%, the 90% conﬁdence interval on the measures live data because of its richness and relevance. Spe- is 0.8% cial care will be taken to reference accuracy on publicly Type of mismatch Accuracy (EER) (%) available data sources (some may be available for a fee), No mismatch 7.02 but in some other cases an explicit reference is im- SNR mismatch 7.47 possible to preserve contractual agreements. Note that Channel mismatch 9.76 a comparative study of off-the-shelf commercial text- Lexical mismatch (LD2) 8.23 dependent speaker veriﬁcation systems was presented at Lexical mismatch (LD4) 13.4 Odyssey 2006 [37.5]. Complete lexical mismatch (LD6) 36.3 This chapter is organized as follows. The rest of this section explains at a high-level the main compo- nents of a speaker recognition system with an emphasis not restricted to the text-dependent speaker recognition on particularities of text-dependent speaker recogni- ﬁeld, nor is it intended as an exhaustive list. Feature sets tion. The reader is strongly encouraged, for the sake of usually come in two ﬂavors: MEL [37.8] or LPC (lin- completeness, to refer to the other chapters on speaker ear predictive coding) [37.6, 9] cepstra. Cepstral mean recognition. Section 37.2 presents the main technical subtraction and feature warping have proved effective and commercial deployment challenges. Section 37.3 is on cellular data [37.10] and are generally accepted formed by a collection of selected results to illustrate the as an effective noise robustness technique. The posi- challenges of Sect. 37.2. Concluding remarks are found tive role of dynamic features in text-dependent speaker in Sect. 37.4. recognition has recently been reported in [37.11]. Fi- nally, a feature mapping approach [37.12] has been 37.1.1 Features proposed as an equivalent to speaker model synthe- sis [37.13]; this is an effective channel robustness The ﬁrst text-dependent speaker recognition system de- technique. scriptions that incorporate the main features of the current state of the art date back to the early 1990s. 37.1.2 Acoustic Modeling In [37.6] and [37.7], systems have feature extraction, speaker models and score normalization using a like- Several modeling techniques and their associated scor- lihood ratio scheme. Since then, several groups have ing schemes have been investigated over the years. explored different avenues. The work cited below is By far the most common modeling scheme across