Most of the following installations have to be performed on each computer. I didn't re-download everything, since it was going to be put in the same place, but I did cd in and re-run the installation procedure. That ensured the necessary files were added to all the right places elsewhere in the system.
Here, I'm walking through the process of running Torch on a cluster. CPUs, not GPUs. The performance benefit comes from the slave nodes being allowed greater latitude in searching for local optima to 'solve' the neural net. Every so often, they 'touch base' with the master node and synchronize the result of their computations. Read the abstract of Sixin Zhang's paper to get a more detailed idea of what's happening. As far as the implementation goes, "the idea is to transform the torch data structure (tensor, table etc) into a storage (contiguous in memory) and then send/recv [sic] it." src.
Background SourcesKeep track of where I found the info I used to figure this out.
Set up MPI (beowulf cluster)
Follow the instructions in these two posts first. They get you to the point of a working cluster, starting from a collection of unused PCs and the relevant hardware.https://nixingaround.blogspot.com/2017/01/a-homebrew-beowulf-cluster-part-1.html
prevent SSH from losing connection
I had some trouble here, where I was trying to use ssh over the same wires that were providing MPI communication in the cluster. I kept losing connection after initializing the computations. It may not be necessary, so I wouldn't do this unless you run into trouble of that sort.
Ok, that's not an optimal solution. Better to initialize a virtual terminal and run the computations in that. When the connection is inevitably dropped, just recover that terminal.
Install TorchNote: it may be useful to install the MKL library ahead of torch. It accelerates the math routines that I assume will be present in the computations I'm going to perform.
This provides dependencies needed to install the mpiT package that lets Torch7 work with MPI. Start in the breca home directory. On the master node, run the following.
cdThen, on all nodes (master and slave), run the following from the breca account:
git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; bash install-deps[I'm not sure, but I think MPICH has to be reinstalled after GCC 4.x is installed with the dependencies. Leaving this note here in case of future problems.]
After the install script finished running, it told me that it had not updated my shell profile. So, we're adding a line to the ~/.profile script. (we're using that, and not the bashrc file, because when logging on to the breca account bash isn't automatically run. If I ever forget and try to use Torch without bash, I could run into problems this can avoid.)
Do the following on all nodes:
echo ". /mirror/breca/torch/install/bin/torch-activate" | sudo tee -a /mirror/breca/.profileNow re-run the file, so the code you added is executed.
source ~/.profileInstalling this way allows you to only download the package once, but use it to install the software to all nodes in the cluster. (and as a side note, the install-deps script doesn't detect Linux Mint - it's one of the reasons this walk-through is using Ubuntu Server)
Test that Torch has been installed:
thClose the program
MPI compatibility with TorchSource: https://github.com/sixin-zh/mpiT
Do this on the master node. You'll be able to access the downloaded files from all the nodes - they're going in the /mirror directory. Download from github and install.
cd ~/Now Do the rest of the steps on all the nodes, master and slave.
mkdir -p tools && cd tools
git clone https://github.com/sixin-zh/mpiT
cdBy default, MPI_PREFIX should be set to /usr. See link.
export MPI_PREFIX="/usr"Since I'm working with MPICH rather than OpenMPI (see cluster installation notes above),
echo "export MPI_PREFIX='/usr'" >> ~/.profile
luarocks make mpit-mvapich-1.rockspec
First, figure out how many processors you have. You did already; that's the sum of the numbers in your machinefile in the /mirror directory. We'll say you have 12 cores. Since our counting starts at 0, tell the computer you have 11. Adjust according to your actual situation.
Next, use a bunch of terminals and log into each of your nodes simultaneously. Install:
sudo apt-get install htop
htopon each machine and watch the CPU usage as you perform the following tests. If only the master node shows activity, you have a problem.
Create ./data/torch7 in the home directory, and then download the test data to that location. Ensure you're logged in as the MPI user.
mkdir -p ~/data/torch7/mnist10/ && cd ~/data/torch7/mnist10
Now run the tests. Sanity check: did mpiT install successfully? Note: I ran into an 'error 75' at this point, and the solution was to explicitly define the location of the files involved starting from the root directory.and this is as far as I've actually made it without errors (up to this point, barring abnormalities in the PCs used, everything works perfectly for me).
mpirun -np 11 /mirror/machinefile th /mirror/breca/tools/mpiT/test.luaCheck that the MPI integration is working. Move down to the folder with the asynchronous algorithms.
cd asyncsgdI think this test only needs to run on the master node - as long as you've installed everything to all the nodes (as appropriate), it doesn't need to be run everywhere. I think it's just checking that Torch is successfully configured to run on a CPU.
th claunch.luaTest bandwidth: I have no idea what this does, but it fails if the requested number of processors is odd. I'm sticking with the default of 4 processors, which (I'm guessing) is the number on a single node. As long as it works...? It seems to be checking the bandwidth through the cluster. There isn't a whole lot of documentation.
mpirun -np 4 -f ../../../../machinefile th ptest.luaTry parallel mnist training - this is the one that should tell you what's up. AFAIK, you'll probably end up using a variant of this code to run whatever analysis you have planned. If you look inside, you'll notice that what you're running is some kind of abstraction - the algorithm (such as it is for a test run) seems to be implemented in goot.lua. In fact, this is a 'real-world' test of sorts - the MNIST data set is the handwritten character collection researchers like to use for testing their models.
mpirun -np 11 -f ../../../../machinefile th mlaunch.lua
Install Word RNNClone the software from github.
mkdir ~/projectsThat's actually all there is to it. Now cd into the word-rnn directory to run the test stuff. Before the tests and tools, though, there's a fix that you have to perform.
git clone https://github.com/larspars/word-rnn.git