Phonotactic language recognition using dynamic pronunciation and language branch discriminative informationby Xianliang Wang, Yulong Wan, Lin Yang, Ruohua Zhou, Yonghong Yan

Speech Communication


Discriminating languages by speech-reading

Salvador Soto-Faraco, Jordi Navarra, Whitney M. Weikum, Athena Vouloumanos, Núria Sebastián-Gallés, Janet F. Werker


n sc a titut orm e 14

Abstract tral features (Martınez et al., 2011; Dehak et al., 2011; Li n-gram lexicon model is applied to model the phoneme sequences or lattices. Phoneme recognizer followed by language model (PRLM) (Zissman and Singer, 1994) is a classic approach using phoneme recognizer containing

Artificial Neural Network (ANN) (Bourlard and ⇑ Corresponding author.

E-mail addresses: (X. Wang), wanyulong@ (Y. Wan), (L. Yang), zhouruohua@ (R. Zhou), (Y. Yan).

Available online at


Speech Communication 75Language recognition aims to determine the language identity given a segment of speech. Two representative approaches have been widely used, which are based on phonotactic features and spectral features.

In acoustic feature systems, Gaussian mixture models (GMM) and support vector machines (SVM) have been the usual choices (Burget et al., 2006; Campbell et al., 2006), gradually outperforming the phonotactic systems (Dehak et al., 2011; Glembek et al., 2012). Recently, ivector based on factor analysis has provided significant and Narayanan, 2013). It defines a low-dimensional ivector space modeling both speaker and channel variabilities.

Low-dimensional ivector is obtained by mapping a sequence of speech frames onto the ivector space. Previous results in 2011 National Institute of Standards and Technology (NIST) Language Recognition Evaluation (LRE) (Greenberg et al., 2012) presented the superiority of ivector system over phonotactic systems (Singer et al., 2012).

In phonotactic approaches, speech utterances are first tokenized into phoneme sequences or lattices, and thenThis paper presents our study of phonotactic language recognition system using dynamic pronunciation and language branch discriminative information. The theory of language branch in linguistics is introduced to language recognition, and phonotactic language branch variability (PLBV) method based on factor analysis is proposed. In our work, phoneme variability factor containing dynamic pronunciation information is investigated firstly. By concatenating low-dimensional phoneme variability factors in the language branch spaces, phonotactic language branch variability factor is obtained. Language models are trained within and between language branches with support vector machine (SVM). The proposed method uses dynamic and discriminative pronunciation phonotactic characteristics while it doesn’t involve fallible phoneme sequences. Results on 2011 NIST Language Recognition Evaluation (LRE) 30 s data set show that the proposed method outperforms parallel phoneme recognizer followed by vector space models (PPRVSM) and ivector systems, and obtains relative improvement of 28.2–72.0% in EER, minDCF and language-pair performance metrics significantly.  2015 Elsevier B.V. All rights reserved.

Keywords: Phonotactic language branch variability; Factor analysis; Dynamic pronunciation information; Discriminative information; Language recognition 1. Introduction improvements to language recognition systems using spec-Phonotactic language recognitio and language branch di

Xianliang Wang, Yulong Wan, Lin Y

Key Laboratory of Speech Acoustics and Content Understanding, Ins

Received 22 June 2014; received in revised f

Available onlin 0167-6393/ 2015 Elsevier B.V. All rights reserved.using dynamic pronunciation riminative information ng, Ruohua Zhou ⇑, Yonghong Yan e of Acoustics, Chinese Academy of Sciences, Beijing 100190, China 11 August 2015; accepted 5 October 2015

October 2015 (2015) 50–61 mmuMorgan, 1994) and Viterbi decoder, and n-gram language model to model discriminative gram statistics.

There are several significant improvements on PRLM.

In W. M. Campbell’s work (Campbell et al., 2007), representative n-gram phone statistics were selected and SVM was used to model the statistics. In H. Li’ work (Li et al., 2007), parallel phoneme recognizer followed by vector space models (PPRVSM) was proposed to language recognition based on vector space modeling and obtained excellent performance. Mikolov T proposed PCA-based feature extraction method for phonotactic language recognition (Glembek et al., 2010). The system reduced the dimension of trigram soft-counts using PCA. Mehdi Soufifar proposed ivector approach to phonotactic language recognition (Soufifar et al., 2011), and obtained comparable performance to PCA-based feature extraction approach.

And in DHaro’s work (DHaro et al., 2012), ivector based on trigram counts was proposed using multinomial subspace model. The proposed phonotactic approaches were proven to be efficient for language recognition task and achieved comparable performance to the ivector in spectral level. Ming Li (Li and Liu, 2014) also presented a generalized i-vector framework with phonotactic tokenizations and tandem features for speaker verfication as well as language identification. Howbeit these methods model the n-gram derived from phoneme recognizer, and may suffer from some problems. For example, the model size grows exponentially with the model order n, and selection of representative n-gram may also bring loss of discriminative information inevitably. In the meanwhile, it is vulnerable to the errors induced by phoneme recognizer.

In H.Wang’s work (Wang et al., 2013), shift-delta multilayer perceptron (SDMLP) features based on phoneme posterior probabilities were introduced to

GMM-based language recognition system. The feature was not dependent on n-gram phone statistics and achieved good performance for the rich pronunciation information of phonotactic features (Zissman, 2001; Lei et al., 2014).

Even though it obtained significant improvement, dimension of SDMLP features was usually high and the features were impractical for well performed factor analysis.

With the development of language recognition, it is more concerned about the discrimination between the pairs of languages as is emphasized in the NIST 2011 LRE. In