Table Of Contents

Previous topic

Get Disco

Next topic

Tutorial

This Page

Setting up Disco

This document helps you to install Disco either on a single server or a cluster of servers. This requires installation of several packages, which may or may not be totally straightforward. If you want to get something working quickly, you should consider trying out Disco in Amazon EC2 (How to set up Disco on Amazon EC2) which requires no configuration on your side.

Shortcut for Debian / Ubuntu users: If you run Debian testing or some recent version of Ubuntu on the AMD64 architecture, you may try out our experimental deb-packages which are available at Disco download page. If you managed to install the packages, you can skip over the steps 0-3 below and go to 4. Configure authentication directly.

Background

You should have a quick look at Technical Overview before setting up the system, to get an idea what should go where and why. To make a long story short, Disco works as follows:

  • Disco users start Disco jobs in Python scripts.
  • Jobs requests are sent over HTTP to the master.
  • Master is an Erlang process that receives requests over HTTP.
  • Master launches another Erlang process, worker supervisor, on each node over SSH.
  • Worker supervisors run Disco jobs as Python processes.

In the following we set up SSH, Erlang, Python to work for Disco.

0. Prerequisites

You need at least one Linux/Unix server. Any distribution should work (including Mac OS X).

On each server the following applications / libraries are required:

Optionally, DISCO_PROXY needs

1. Install Disco

Download the latest Disco package from discoproject.org. Alternatively you can download the latest development snapshot from GitHub.

Extract the package (if necessary) and cd into it. We will refer to this directory as DISCO_HOME.

If you want to install Disco locally, just run make:

make

This is often the easiest and the least intrusive way to get started with Disco.

You should repeat the above command on all servers that belong to your Disco cluster. Note that Disco should be found on the same path on all the servers. Alternatively, you can use a (NFS) shared home directory on all the nodes, which makes development really straightforward.

To install system-wide, run make install as root:

make install

This will build and install Disco to your system (see Makefile for exact directories).

Note

make install installs a configuration file to /etc/disco/settings.py that is tuned for clusters, not a single machine.

By default, the settings assume that you have at least three nodes in your cluster, so DDFS can use three-way replication. If you have fewer nodes, you need to lower the number of replicas in /etc/disco/settings.py:

DDFS_TAG_MIN_REPLICAS=1
DDFS_TAG_REPLICAS=1
DDFS_BLOB_REPLICAS=1

See disco.settings for more information.

2. Prepare the runtime environment

Next we need to perform the following tasks on all servers that belong to the Disco cluster:

  • Create disco user (optional).
  • Check that the settings in settings.py are correct.

Often it is convenient to run Disco as a separate user. Amongst other reasons, this allows setting user-specific resource utilization limits for the Disco user (through limits.conf or similar mechanism). However, you can use any account for running Disco. In the following, we refer to the user that runs disco-master as the Disco user.

Open DISCO_HOME/conf/settings.py. This file sets a number of environment variables that define the runtime environment for Disco. Most likely you do not need to modify this file right away. You can change the paths if the defaults are not suitable for your system. See disco.settings for more information on the various settings and their default values.

3. Start Disco

Disco now uses a streamlined command-line interface (see discocli). On the master node, start the Disco master by executing disco start.

You can easily integrate disco into your system’s startup sequence. For instance, you can see how debian/disco-master.init and debian/disco-node.init are implemented in the Disco’s debian branch.

If Disco has started up properly, you should see beam.smp running on your master node.

An easy way to test if Disco is starting up properly is to run disco nodaemon instead of disco start. This will start the master node and bring you right to its Erlang shell, without redirecting the log to a file.

4. Configure authentication

Next we need to enable passwordless login via ssh to all servers in the Disco cluster. If you have only one machine, you need to enable passwordless login to localhost for the Disco user.

Run the following command as the Disco user, assuming that it doesn’t have valid ssh-keys already:

ssh-keygen -N '' -f ~/.ssh/id_dsa

If you have one server (or shared home directories), say:

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Otherwise, repeat the following command for all the servers nodeX in the cluster:

ssh-copy-id nodeX

Now try to login to all servers in the cluster or localhost, if you have only one machine. You should not need to give a password nor answer to any questions after the first login attempt.

As the last step, if you run Disco on many machines, you need to make sure that all servers in the Disco cluster use the same Erlang cookie, which is used for authentication between Erlang nodes. Run the following command as the Disco user on the master server:

scp ~/.erlang.cookie nodeX:

Repeat the command for all the servers nodeX.

5. Add nodes to Disco

At this point you should have Disco up and running. The final step before testing the system is to specify which servers are available for Disco. This is done on the Disco’s web interface.

Point your browser at http://master:<DISCO_PORT>, where master should be replaced with the actual hostname of your machine or localhost if you run Disco locally or through an SSH tunnel. The default port is 8989.

You should see the Disco main screen (see a screenshot). Click configure on the right side of the page. On the configuration page, click add row to add a new set of available nodes. Click the cells on the new empty row, and add hostname of an available server (or a range of hostnames, see below) in the left cell and the number of available cores (CPUs) on that server in the right cell. Once you have entered a value, click the cell again to save it.

You can add as many rows as needed to fully specify your cluster, which may have varying number of cores on different nodes. Click save table when you are done.

If you have only a single machine, the resulting table should look like this, assuming that you have two cores available for Disco:

../_images/config-localhost.png

If you run Disco in a cluster, you can specify multiple nodes on a single line, if the nodes are named with a common prefix, as here:

../_images/config-cluster.png

This table specifies that there are 30 nodes available in the cluster, from nx01 to nx30 and each node has 8 cores.

6. Test the system

Now Disco should be ready for use.

We can use the following simple Disco script that computes word frequencies in a text file to see that the system works correctly.

import sys
from disco.core import Disco, result_iterator
from disco.settings import DiscoSettings

def map(entry, params):
    for word in entry.split():
        yield word, 1

def reduce(iter, out, params):
    s = {}
    for word, freq in iter:
        s[word] = s.get(word, 0) + int(freq)
    for word, freq in s.iteritems():
        out.add(word, freq)

disco = Disco(DiscoSettings()['DISCO_MASTER'])
print "Starting Disco job.."
print "Go to %s to see status of the job." % disco.master
results = disco.new_job(name="wordcount",
                   input=["http://discoproject.org/chekhov.txt"],
                   map=map,
                   reduce=reduce).wait()
print "Job done. Results:"
for word, freq in result_iterator(results):
    print word, freq

Run the script as follows from DISCO_HOME:

python examples/utils/count_words.py

Disco attempts to use the current hostname as DISCO_MASTER_HOST if it is not defined in any settings file.

If you are runing Disco on multiple machines you must use the same version of Python for running Disco scripts as you use on the server side.

You can run the script on any machine that can access Disco on the configured DISCO_MASTER_HOST. The safest bet is to run the script on the master node itself.

If the machine where you run the script can access the master node but not other nodes in the cluster, you need to set the environment variable DISCO_PROXY=http://master:8989. The proxy address should be the same as the master’s above. This makes Disco to fetch results through the master node, instead of connecting to the nodes directly.

If the script produces some results, congratulations, you have a working Disco setup! If you are new to Disco, you might want to read Tutorial next.

If the script fails, see the section about Troubleshooting Disco installation.