Building a Statistical Language Model

Update: I finished my script for creating custom language models. See here: https://github.com/umhau/vmc.

There's a summary at the end with what I figured out. Most of this is me thinking on paper.

The statistical language model is used for helping CMU Sphinx know what words exist, and what the order the words exist in (the grammar and syntax structure). The intro website to all this is here.

I'm trying to decide between the SRILM and the MITLM packages [subsequent edit: also the logios package and the quicklm pearl script - these are referenced in hard-to-find places on the CMU website; see here and here, respectively] [another subsequent edit: looks like I found a link to the official CMU Statistical Language Model toolkit - it was buried in the QuickLM script]. S- is easier to use, apparently, and the CMU site provides example commands. M-, however, seems more likely to stick around and be accessible on github for the long-term. Plus, I forked it.

[sorry, blogger's formatting broke and I had to convert everything to plaintext and start over...lost the links.]

Only downside is, the main contributor to MITLM stopped work on it about 6 mos ago, and started dealing with Kaldi instead. Guess he figured the newer tech was more worth his time. Still, dinosaurs have their place; just watch Space Cowboys to get the picture.

MITLM

Just to be sure that the software doesn't go anywhere, code is downloaded from my repository.

Update: Thanks to Qi Wang's comment below there's an extra dependency to install:

sudo apt-get install autoconf-archive

Installation of MITLM:

cd ~/tools
git clone https://github.com/umhau/mitlm.git
cd ./mitlm
./autogen.sh
./configure
make
make install

~~So, turns out that there's some weird problems with the installation. Something changed, or something isn't being installed properly. The compilation seems to fail with these errors:~~

./configure: line 19641: AX_CXX_HEADER_TR1_UNORDERED_MAP: command not found
./configure: line 19642: syntax error near unexpected token `noext,'
./configure: line 19642: `AX_CXX_COMPILE_STDCXX_11(noext, optional)'

~~g++ wasn't installed, but even after that was added it still wouldn't work.~~

Update: Unfortunately, I've lost track of other dependencies involved - at some point, I'll make a list of all the stuff I've installed while working on this project. Had to install libtool (or similar?) to get here. Mental note:

libtoolize:   error: Failed to create 'build-aux'

But, that's because I'm trying to do this on a different Mint installation from my usual - on my default workstation, that dependency is installed (no idea what it is, except that it's probably listed somewhere on this blog).

After installing the extra dependency, the installation works! So this is a viable avenue thus far to get the LM working. I've already made it past where I need the MITLM, though, so I'm going to let it be for now. Might have to come back for it.

SRILM

Ok, let's see what SRILM has to offer us. It's more inconvenient to install; ya have to go through a license agreement to download it, so I can't just stick a bash command here.

...unless I put the code on my github. In which case, it's easy to get a copy of. Too bad there's too many files to put up an extracted version, and too bad the compressed version is more than 25mb. Time to split up the tar.gz file again; for my own records, here's how I split it. All I need for getting and using it is the reconstruction bit.

The splitting part, given the archive file:

split -b 24m -d srilm-1.7.1.tar.gz srilm-1.7.1.tar.gz.part-

Alright. Once the file is on github, it's just more copy-pasting.

cd ~/tools
git clone https://github.com/umhau/srilm.git
cd ./srilm
cat srilm-1.7.1.tar.gz.part-* | tar -xz

By the way, WOW. The installation process for this software is not straightforward. See the install file for the instructions on installation - read for background, then copy-paste below as usual.

gedit ./INSTALL

Step 2 - swap out the SRILM variable for one delimiting the root directory of the package. Source.

sed -i '7s#.*#SRILM = ~/tools/srilm#' ./Makefile

For now, assuming that the variables are all good. I don't know if I want maximum entropy models, though it sounds useful...I'll see what happens if I don't prep them.

Installing John Ousterhout's TCL toolkit - we're past the required v7.3, and up to 8.6: hope this still works. I'm compiling from source rather than using the available binaries 'cause they come with some kind of non-commercial/education license, which I don't like being tied down by.

cd ~/tools
git clone https://github.com/umhau/tcl-tk.git
cd ./tcl-tk
gunzip < tcl8.6.6-src.tar.gz | tar xvf -
gunzip < tk8.6.6-src.tar.gz | tar xvf -

Install TCL:

cd tcl8.6.6/unix
# chmod +x configure
configure --enable-threads
make -j 3
make test 
sudo make -j 3 install

Let's try running the rest without the TK stuff...even though John says it's needed. Heh. Leeeroooy Jenkins!

cd ../../../srilm
make World

...aaaaaaaand, Fail.

This is going nowhere fast. We're in dependency hell. Let's try the perl script CMU uses (it's the backend to the online service they officially reference).

The Perl Script

Thankfully, Mint comes with perl installed. So, the question is how to use the script.

cd ~/tools
mkdir ./CMU_LMtool && cd ./CMU_LMtool
wget http://www.speech.cs.cmu.edu/tools/download/quick_lm.pl

The only thing left here is to figure out how to use the script...having never used perl, this could be interesting. Dug this nugget out of the script:

usage: quick_lm -s <sentence_file> [-w <word_file>] [-d discount]

So, the idea with the LMtool is to process sentences that the decoder should recognize - it doesn't need to be an exhaustive list, however, because the decoder will allow fragments to recombine in the detection phase. As a corpus example (from the CMU website), here's the following:

THIS IS AN EXAMPLE SENTENCE
EACH LINE IS SOMETHING THAT YOU'D WANT YOUR SYSTEM TO RECOGNIZE
ACRONYMS PRONOUNCED AS LETTERS ARE BEST ENTERED AS A T_L_A
NUMBERS AND ABBREVIATIONS OUGHT TO BE SPELLED OUT FOR EXAMPLE
TWO HUNDRED SIXTY THREE ET CETERA
YOU CAN UPLOAD A FEW THOUSAND SENTENCES
BUT THERE IS A LIMIT

We'll use this sentence collection to test the perl script:

cd ~/tools/CMU_LMtool
wget https://raw.githubusercontent.com/umhau/misc-LMtools/master/ex-corpus.txt
perl quick_lm.pl -s ex-corpus.txt

Well, it did exactly nothing. No terminal output, no new files created in the directory, and no errors. Time to search the script for other possible output locations. How weird can it be?

...

Ok, solved the problem. Thank goodness for auto highlighting in Gedit. The authors used some kind of weird system for comments that I'm guessing was retired since this script was written. It seems to have been throwing the compiler for a loop:

=POD
/*
[some text wrapped by those comment markers]
*/
[more text, only wrapped by the '=' things]
=END

So, I re-commented all the introductory stuff, and put the fixed version in the github repo.

Summary of the Perl script

So, here's how it works: download the fixed script, give it a sentence list, and run the command. Simple. And, looking at the output, the function it performs is pretty simple too. Makes a list of all the 1, 2 and 3 - word groupings in the list.

Here's what to do:

mkdir ~/tools/CMU_LMtool && cd ~/tools/CMU_LMtool
wget https://raw.githubusercontent.com/umhau/misc-LMtools/master/ex-corpus.txt
wget https://raw.githubusercontent.com/umhau/misc-LMtools/master/quick_lm.pl
perl quick_lm.pl -s ex-corpus.txt

Still not sure what that does for me, but I have my LM!

Notes: I think the word list option in the command refers to the possibility of a limited vocabulary...not sure how that relates to words outside that list used in the sentence list. The discount in the command, however, is fixed at 0.5. Apparently Greg and Ben did some experiments to discover that's definitely the optimal setting.

Second Note: based on readings from the CMU website, this LM isn't good for much more than command-and-control - it can successfully detect short phrases accurately, but not long, drawn-out sentences. So it'll be good for most of what I want, but anything complex will need to be done with the CMULMTK package.

Hold on - the [-w <word_file>] option for a dictionary might be a request for output - not an extra input. And given that I do need an explicit dictionary for transcription, that's probably what it does. That would be wonderful. I can even use that sentence list for voice training - which would be a fabulous way to ensure accuracy.

Unfortunately, that's not the case. Oh, well.

The official CMU Statistical Language Model toolkit

Ok, maybe this'll do it for me. Here's the link to the source. The Perl script doesn't make all the different files I need - especially the pronunciation dictionary.

mkdir ./tools/CMUSLM
cd ./tools/CMUSLM
wget http://www.speech.cs.cmu.edu/SLM/CMU-Cam_Toolkit_v2.tar.gz
gunzip < CMU-Cam_Toolkit_v2.tar.gz | tar xv
cd ./CMU-Cam_Toolkit_v2

Wow, this is old. You have to uncomment something if your computer isn't running HP-UX, IRIX, SunOS, or Solaris. I'm pretty sure anything build in this decade needs uncomment, but if you're unsure the README mentions a script you can run to check for yourself:

bash endian.sh

Ok, uncomment:

sed -i '37s/#//' ./src/Makefile
cd src
make install

Hard to tell if this was successful. I get the impression watching this compile that it was written in the 80s, and updated for compatibility with something advertising a max capacity of 512 Mb of random access memory.

Time to dive into the html documentation, and figure out usage. The goal is to create the LM and DIC files - and a nice perk would be the other stuff produced by the online LM generator.

Turns out, there doesn't seem to be any kind of pronunciation dictionary produced by this tool. So it's no good.

The Logios Package

This seems to be the tool CMU claims was actually used in their website - and, indeed, some of their tools within the package are designed for use in a webform. So I might be on the right track. The only problem is, the input is not a list of sentences: it's a grammar file built by the Phoenix tool. No idea what that is or how it works.

CMU, get your act together! The website is nice, but I've got no recourse if it goes down. I want an independent system!

Here goes. Goal: LM and DIC files. Starting point: list of sentences.

Download the package. Even this isn't user-friendly - the folder structure is in html. I used wget recursively to download the webpages. See here for source on the command.

CMUDict

Actually, it seems like I could just use the dictionary directly. The whole problem is one of how to get the entries from this file into a subset file that holds just what I want - so I'll just write a small script to do just that. What a pain.

wget http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/sphinxdict/cmudict_SPHINX_40

I'll post the script soon - it's being added to a larger package that should make the process of getting a personal language model pretty painless. That'd be nice.

the notebook

memories of bygone projects

categories