[shameless copy] Offline Language Model Creation for PocketSphinx
Normally, I'd be writing these myself. But this time, the explanation was so unusually good that I don't feel the need to simplify it. It's fantastic for my purposes as-is. Source.
The purpose here is to create the statistical language model that pocketsphinx uses to convert phonetics into words. The model is based entirely on what type of sentences it expects to encounter, as defined by the input reference text.
I need this running as a self-contained script in order to make language model generation a seamless part of my project. All the user should have to do is provide a ready-made reference text, and the script should generate the rest.
The purpose here is to create the statistical language model that pocketsphinx uses to convert phonetics into words. The model is based entirely on what type of sentences it expects to encounter, as defined by the input reference text.
I need this running as a self-contained script in order to make language model generation a seamless part of my project. All the user should have to do is provide a ready-made reference text, and the script should generate the rest.
ARPA model training with CMUCLMTK
You need to download and install cmuclmtk. See CMU Sphinx Downloads for details.
The process for creating a language model is as follows:
1) Prepare a reference text that will be used to generate the language model. The language model toolkit expects its input to be in the form of normalized text files, with utterances delimited by
<s>
and </s>
tags. A number of input filters are available for specific corpora such as Switchboard, ISL and NIST meetings, and HUB5 transcripts. The result should be the set of sentences that are bounded by the start and end sentence markers: <s> and </s>. Here's an example:<s> generally cloudy today with scattered outbreaks of rain and drizzle persistent and heavy at times </s>
<s> some dry intervals also with hazy sunshine especially in eastern parts in the morning </s>
<s> highest temperatures nine to thirteen Celsius in a light or moderate mainly east south east breeze </s>
<s> cloudy damp and misty today with spells of rain and drizzle in most places much of this rain will be
light and patchy but heavier rain may develop in the west later </s>
More data will generate better language models. The
weather.txt
file from sphinx4 (used to generate the weather language model) contains nearly 100,000 sentences.2) Generate the vocabulary file. This is a list of all the words in the file:
text2wfreq < weather.txt | wfreq2vocab > weather.tmp.vocab
3) You may want to edit the vocabulary file to remove words (numbers, misspellings, names). If you find misspellings, it is a good idea to fix them in the input transcript.
4) If you want a closed vocabulary language model (a language model that has no provisions for unknown words), then you should remove sentences from your input transcript that contain words that are not in your vocabulary file.
5) Generate the arpa format language model with the commands:
% text2idngram -vocab weather.vocab -idngram weather.idngram < weather.closed.txt
% idngram2lm -vocab_type 0 -idngram weather.idngram -vocab \
weather.vocab -arpa weather.lm
6) Generate the CMU binary form (BIN)
sphinx_lm_convert -i weather.lm -o weather.lm.bin
The CMUCLTK tools and commands are documented at The CMU-Cambridge Language Modeling Toolkit page.