This is being installed on machines running Ubuntu Server 16.04.1 LTS.  Does not work on Linux Mint (the torch install script doesn't detect that OS).

Most of the following installations have to be performed on each computer.  I didn't re-download everything, since it was going to be put in the same place, but I did cd in and re-run the installation procedure.  That ensured the necessary files were added to all the right places elsewhere in the system.

Here, I'm walking through the process of running Torch on a cluster.  CPUs, not GPUs.  The performance benefit comes from the slave nodes being allowed greater latitude in searching for local optima to 'solve' the neural net.  Every so often, they 'touch base' with the master node and synchronize the result of their computations.  Read the abstract of Sixin Zhang's paper to get a more detailed idea of what's happening.  As far as the implementation goes, "the idea is to transform the torch data structure (tensor, table etc) into a storage (contiguous in memory) and then send/recv [sic] it." src.

Background Sources

Keep track of where I found the info I used to figure this out.

https://bbs.archlinux.org/viewtopic.php?id=159999
http://torch.ch/docs/getting-started.html
https://groups.google.com/forum/#!topic/torch7/Xs814a5_xgI

Set up MPI (beowulf cluster)

Follow the instructions in these two posts first.  They get you to the point of a working cluster, starting from a collection of unused PCs and the relevant hardware.

https://nixingaround.blogspot.com/2017/01/a-homebrew-beowulf-cluster-part-1.html
https://nixingaround.blogspot.com/2017/01/a-homemade-beowulf-cluster-part-2.html

prevent SSH from losing connection

I had some trouble here, where I was trying to use ssh over the same wires that were providing MPI communication in the cluster.  I kept losing connection after initializing the computations.  It may not be necessary, so I wouldn't do this unless you run into trouble of that sort.  

https://nixingaround.blogspot.com/2017/01/internet-via-ethernet-ssh-via-wireless.html

Ok, that's not an optimal solution. Better to initialize a virtual terminal and run the computations in that.  When the connection is inevitably dropped, just recover that terminal.

http://unix.stackexchange.com/questions/22781/how-to-recover-a-shell-after-a-disconnection

Install Torch

Note: it may be useful to install the MKL library ahead of torch.  It accelerates the math routines that I assume will be present in the computations I'm going to perform.  

This provides dependencies needed to install the mpiT package that lets Torch7 work with MPI.  Start in the breca home directory.  On the master node, run the following.
cd
git clone https://github.com/torch/distro.git ~/torch --recursive
Then, on all nodes (master and slave), run the following from the breca account:
cd ~/torch; bash install-deps
./install.sh
[I'm not sure, but I think MPICH has to be reinstalled after GCC 4.x is installed with the dependencies.  Leaving this note here in case of future problems.]

After the install script finished running, it told me that it had not updated my shell profile.  So, we're adding a line to the ~/.profile script.  (we're using that, and not the bashrc file, because when logging on to the breca account bash isn't automatically run.  If I ever forget and try to use Torch without bash, I could run into problems this can avoid.)

Do the following on all nodes:
echo ". /mirror/breca/torch/install/bin/torch-activate" | sudo tee -a /mirror/breca/.profile
Now re-run the file, so the code you added is executed.
source ~/.profile
Installing this way allows you to only download the package once, but use it to install the software to all nodes in the cluster.  (and as a side note, the install-deps script doesn't detect Linux Mint - it's one of the reasons this walk-through is using Ubuntu Server)

Test that Torch has been installed:
th
Close the program
exit

MPI compatibility with Torch

Source: https://github.com/sixin-zh/mpiT

Do this on the master node. You'll be able to access the downloaded files from all the nodes - they're going in the /mirror directory. Download from github and install.
cd ~/
mkdir -p tools && cd tools
git clone https://github.com/sixin-zh/mpiT
cd
Now Do the rest of the steps on all the nodes, master and slave.
cd 
cd tools/mpiT
By default, MPI_PREFIX should be set to /usr.  See link.
export MPI_PREFIX="/usr"
echo "export MPI_PREFIX='/usr'" >> ~/.profile
Since I'm working with MPICH rather than OpenMPI (see cluster installation notes above),
luarocks make mpit-mvapich-1.rockspec

Tests

First, figure out how many processors you have.  You did already; that's the sum of the numbers in your machinefile in the /mirror directory.  We'll say you have 12 cores.  Since our counting starts at 0, tell the computer you have 11.  Adjust according to your actual situation. 

Next, use a bunch of terminals and log into each of your nodes simultaneously.  Install:
sudo apt-get install htop 
And run
htop
on each machine and watch the CPU usage as you perform the following tests.  If only the master node shows activity, you have a problem.  

Create ./data/torch7 in the home directory, and then download the test data to that location.  Ensure you're logged in as the MPI user.
mkdir -p ~/data/torch7/mnist10/ && cd ~/data/torch7/mnist10
wget http://cs.nyu.edu/~zsx/mnist10/train_32x32.th7
wget http://cs.nyu.edu/~zsx/mnist10/test_32x32.th7
cd ~/tools/mpiT
Now run the tests. Sanity check: did mpiT install successfully? Note: I ran into an 'error 75' at this point, and the solution was to explicitly define the location of the files involved starting from the root directory. 
mpirun -np 11 /mirror/machinefile th /mirror/breca/tools/mpiT/test.lua
Check that the MPI integration is working.  Move down to the folder with the asynchronous algorithms.
cd asyncsgd
I think this test only needs to run on the master node - as long as you've installed everything to all the nodes (as appropriate), it doesn't need to be run everywhere.  I think it's just checking that Torch is successfully configured to run on a CPU.
th claunch.lua
Test bandwidth: I have no idea what this does, but it fails if the requested number of processors is odd.  I'm sticking with the default of 4 processors, which (I'm guessing) is the number on a single node.  As long as it works...?  It seems to be checking the bandwidth through the cluster.  There isn't a whole lot of documentation.
mpirun -np 4 -f ../../../../machinefile th ptest.lua 
Try parallel mnist training - this is the one that should tell you what's up.  AFAIK, you'll probably end up using a variant of this code to run whatever analysis you have planned.  If you look inside, you'll notice that what you're running is some kind of abstraction - the algorithm (such as it is for a test run) seems to be implemented in goot.lua.  In fact, this is a 'real-world' test of sorts - the MNIST data set is the handwritten character collection researchers like to use for testing their models.
mpirun -np 11 -f ../../../../machinefile th mlaunch.lua
and this is as far as I've actually made it without errors (up to this point, barring abnormalities in the PCs used, everything works perfectly for me).

Install Word RNN

Clone the software from github.
mkdir ~/projects
cd projects
git clone https://github.com/larspars/word-rnn.git
That's actually all there is to it.  Now cd into the word-rnn directory to run the test stuff.  Before the tests and tools, though, there's a fix that you have to perform.