What is Disco?
Disco is an open-source, large scale data analysis platform. The platform includes an implementation of MapReduce, among other things. As the original framework, Disco supports parallel computations over massive data sets, running on an unreliable cluster of computers.
The Disco core is written in Erlang, a functional language that is designed for building robust, fault-tolerant, distributed applications. Users of Disco typically write jobs in Python, making it possible to express even complex algorithms in only in tens of lines of code. This means that you can rapidly develop programs to process massive amounts of data.
Disco was started at Nokia Research Center, as a lightweight framework for distributed data processing. Disco has been succesfully used at Nokia and elsewhere for a variety of purposes: parsing, reformatting, log analysis, clustering, probabilistic modelling, data mining, full-text indexing, and machine learning. With Disco, all of these tasks can be performed as easily with terabytes of data, as they would be with only a few megabytes.
Efficient data-locality-preserving IO, either over HTTP, using any POSIX-compatible filesystem, or the builtin Disco Distributed Filesystem.
Supports profiling and debugging of mapreduce jobs.
Random access data and auxiliary results through out of band results.
Run jobs written in any language using the worker protocol.
Build and query indices with billions of keys and values, using Discodex.
...and more! See the documentation for details.
How to get started
Learn about Disco by reading the documentation. Once you are ready to give it a try, follow the setup instructions. You don't need a cluster to run it - any multi-core machine can benefit from Disco. Currently Disco runs on Linux and Mac OS X.
from disco.core import Job, result_iterator def map(line, params): for word in line.split(): yield word, 1 def reduce(iter, params): from disco.util import kvgroup for word, counts in kvgroup(sorted(iter)): yield word, sum(counts) if __name__ == '__main__': input = ["http://discoproject.org/media/text/chekhov.txt"] job = Job().run(input=input, map=map, reduce=reduce) for word, count in result_iterator(job.wait()): print word, count