Unsupervised language model adaptation using LDA-based mixture models and latent semantic marginalsby Md. Akmal Haidar, Douglas O'Shaughnessy

Computer Speech & Language


Human-Computer Interaction / Software / Theoretical Computer Science


Unsupervised identification of nonstationary dynamical systems using a Gaussian mixture model based on EM clustering of SOMs

Giorgio Biagetti, Paolo Crippa, Alessandro Curzi, Claudio Turchetti

Unsupervised automatic white matter fiber clustering using a Gaussian mixture model

Meizhu Liu, Baba C. Vemuri, Rachid Deriche

A novel approach to musical genre classification using probabilistic latent semantic analysis model

Zhi Zeng, Shuwu Zhang, Heping Li, Wei Liang, Haibo Zheng

Latent Partially Ordered Classification Models and Normal Mixtures

C. Tatsuoka, F. Varadi, J. Jaeger


Available online at www.sciencedirect.com

Computer Speech and Language 29 (2015) 20–31

Unsupervised language model adaptation using LDA-based mixture models and latent semantic marginals ∗


In this pa latent seman unigram mo model yield method is ap combination the backgro is used to fo scaling, whi model using the backgro model, the u perform exp (CSR) syste adaptation a © 2014 Else

Keywords: L 1. Introdu

LM ada topics or d

LM adapta

The idea o  This pape ∗ Correspon

E-mail ad http://dx.doi.o 0885-2308/©Md. Akmal Haidar , Douglas O’Shaughnessy

INRS-EMT, 800 de la Gauchetiere Ouest, Bureau 6900, H5A 1K6 Montreal, QC, Canada

Received 17 December 2013; received in revised form 16 June 2014; accepted 18 June 2014

Available online 2 July 2014 per, we present unsupervised language model (LM) adaptation approaches using latent Dirichlet allocation (LDA) and tic marginals (LSM). The LSM is the unigram probability distribution over words that are calculated using LDA-adapted dels. The LDA model is used to extract topic information from a training corpus in an unsupervised manner. The LDA s a document–topic matrix that describes the number of words assigned to topics for the documents. A hard-clustering plied on the document–topic matrix of the LDA model to form topics. An adapted model is created by using a weighted of the n-gram topic models. The stand-alone adapted model outperforms the background model. The interpolation of und model and the adapted model gives further improvement. We modify the above models using the LSM. The LSM rm a new adapted model by using the minimum discriminant information (MDI) adaptation approach called unigram ch minimizes the distance between the new adapted model and the other model. The unigram scaling of the adapted

LSM yields better results over a conventional unigram scaling approach. The unigram scaling of the interpolation of und and the adapted model using the LSM outperform the background model, the unigram scaling of the background nigram scaling of the adapted model, and the interpolation of the background and the adapted models respectively. We eriments using the ’87–89 Wall Street Journal (WSJ) corpus incorporating a multi-pass continuous speech recognition m. In the first pass, we used the background n-gram language model for lattice generation and then we apply the LM pproaches for lattice rescoring in the second pass. vier Ltd. All rights reserved. anguage model; Topic model; Mixture model; Speech recognition; Minimum discriminant information ction ptation plays a vital role to improve a speech recognition system’s performance. It is essential when the styles, omains of the recognition tasks are mismatched with the training set. To compensate for this mismatch, tion helps to exploit specific, albeit limited, knowledge about the recognition task (Bellegarda, 2004). f an unsupervised LM adaptation approach is to extract latent topics from the training set and then adapt r has been recommended for acceptance by R. De Mori. ding author. Tel.: +1 5149951266. dresses: haidar@emt.inrs.ca (Md.A. Haidar), dougo@emt.inrs.ca (D. O’Shaughnessy). rg/10.1016/j.csl.2014.06.002 2014 Elsevier Ltd. All rights reserved.

Md.A. Haidar, D. O’Shaughnessy / Computer Speech and Language 29 (2015) 20–31 21 topic-specific LMs with proper mixture weights, finally interpolated with a generic n-gram LM (Liu and Liu, 2007;

Haidar and O’Shaughnessy, 2010).

Statistical n-gram language models have been used successfully for speech recognition and other applications. They use local context information by modeling text as a Markovian sequence and capture only the local dependencies between w tations of t by using b n-gram mo to overcom appeared p observed w such as late and Hofma a training c can be igno ment is mo parameters there is no introduced the corpus. vocabulary test docum been used 2008; Haid used for th model (Tam topics (Ram achieve pe for LM ad class (TDC manner. He 2012).

The sim

Ostendorf, topics (Liu mixture we local lexica of the laten unigram co shown bett unigram w where the hard-cluste semantic m model usin et al., 1997 approach t to a constra we present scaling tec the previou million woords. They suffer from insufficiencies of the training data, which limit model generalization. Due to limihe amount of training data, statistical n-gram LMs encounter a data sparseness problem, which is handled ackoff smoothing approaches with lower-order language models (Chen and Goodman, 1999). Moreover, dels cannot capture the long-range information of natural language. Several methods have been investigated e this weakness. A cache-based language model is an earlier approach that is based on the idea that if a word reviously in a document it is more likely to occur again. It helps to increase the probability of previously ords in a document when predicting a future word (Kuhn and Mori, 1990). Recently, various techniques nt semantic analysis (LSA) (Deerwester et al., 1990; Bellegarda, 2000), probabilistic LSA (PLSA) (Gildea nn, 1999), and LDA (Blei et al., 2003) have been investigated to extract the latent topic information from orpus. All of these methods are based on a bag-of-words assumption, i.e., the word-order in a document red. In LSA, a word–document matrix is used to extract the semantic information. In PLSA, each docudeled by its own mixture weights and there is no generative model for these weights. So, the number of grows linearly when increasing the number of documents, which leads to an overfitting problem. Also, method to assign probability for a document outside the training set. On the contrary, the LDA model was where a Dirichlet distribution is applied on the topic mixture weights corresponding to the documents in

Therefore, the number of model parameters is dependent only on the number of topic mixtures and the size. Thus, LDA is less prone to overfitting and can be used to compute the probabilities of unobserved ents. However, the LDA model can be viewed as a set of unigram latent topic models. The LDA model has successfully in recent research work for LM adaptation (Tam and Schultz, 2005, 2006; Liu and Liu, 2007, ar and O’Shaughnessy, 2010, 2011, 2012b,a). In Tam and Schultz (2006), a unigram scaling approach is e LDA adapted unigram model to minimize the distance between the adapted model and the background and Schultz, 2006). The LDA model is also used as a clustering algorithm to cluster training data into abhadran et al., 2007; Heidel and Lee, 2007). The LDA model can be merged with n-gram models and rplexity reduction (Sethy and Ramabhadran, 2008). A non-stationary version of LDA can be developed aptation in speech recognition (Chueh and Chien, 2009). A topic-dependent LM, called topic dependent ) based n-gram LM, was proposed in Naptali et al. (2012), where the topic is decided in an unsupervised re, the LSA method was used to reveal latent topic information from noun–noun relations (Naptali et al., ple technique to form a topic from an unlabeled corpus is to assign one topic label to a document (Iyer and 1996). This hard-clustering strategy is used with leveraging LDA and named entity information to form and Liu, 2007, 2008). Here, topic-specific n-gram language models are created and joined with proper ights for adaptation. The adapted model is then interpolated with the background model to capture the l regularities. The component weights of the n-gram topic models were created by using the word counts t topic of the LDA model. However, these counts are best suited for the LDA unigram topic models. A unt weighting approach (Haidar and O’Shaughnessy, 2010) for the topics generated by hard-clustering has er performance over the weighting approach described in Liu and Liu (2007, 2008). An extension of the eighting approach (Haidar and O’Shaughnessy, 2010) was proposed in Haidar and O’Shaughnessy (2011) weights of the n-gram topic models are computed by using the n-gram count of the topics generated by a ring method. The adapted n-gram model is scaled by using the LDA-adapted unigram model called latent arginals (LSM) (Tam and Schultz, 2006) and outperforms a traditional unigram scaling of the background g the above marginals (Haidar and O’Shaughnessy, 2012a). Here, the unigram scaling technique (Kneser ) is applied where a new adapted model is formed by using a minimum discriminant information (MDI) hat minimizes the KL divergence between the new adapted model and the adapted n-gram model, subject int that the marginalized unigram distribution of the new adapted model is equal to the LSM. In this paper, an extension to the previous works (Haidar and O’Shaughnessy, 2011, 2012a) where we apply the unigram hnique to the interpolation of the background and the adapted n-gram model and note better results over s works. In addition, we perform all the experiments using different corpus sizes (’87 WSJ corpus (17 rds) and ’87–89 WSJ corpus (37 million words)) instead of using only the 1 million words WSJ training 22 Md.A. Haidar, D. O’Shaughnessy / Computer Speech and Language 29 (2015) 20–31 transcriptio topic sets i