What is Disco?
Disco is an open-source, large scale data analysis platform. The platform includes an implementation of MapReduce, among other things. As the original framework, Disco supports parallel computations over massive data sets, running on an unreliable cluster of computers.
The Disco core is written in Erlang, a functional language that is designed for building robust, fault-tolerant, distributed applications. Users of Disco typically write jobs in Python, making it possible to express even complex algorithms in only in tens of lines of code. This means that you can rapidly develop programs to process massive amounts of data.
Disco was started at Nokia Research Center, as a lightweight framework for distributed data processing. Disco has been succesfully used at Nokia and elsewhere for a variety of purposes: parsing, reformatting, log analysis, clustering, probabilistic modelling, data mining, full-text indexing, and machine learning. With Disco, all of these tasks can be performed as easily with terabytes of data, as they would be with only a few megabytes.
Highlights
-
Efficient data-locality-preserving IO, either over HTTP, using any POSIX-compatible filesystem, or the builtin Disco Distributed Filesystem.
-
Supports profiling and debugging of mapreduce jobs.
-
Random access data and auxiliary results through out of band results.
-
Run jobs written in any language using the worker protocol.
-
Build and query indices with billions of keys and values, using Discodex.
...and more! See the documentation for details.
How to get started
Learn about Disco by reading the documentation. Once you are ready to give it a try, follow the setup instructions. You don't need a cluster to run it - any multi-core machine can benefit from Disco. Currently Disco runs on Linux and Mac OS X.
Need help with Disco? We can be reached on our IRC channel #discoproject at Freenode or on the Disco discussion group.
Get involved
Clone your own Disco repository at GitHub and join our mailing list and IRC channel. Even if you don't want to dive into Erlang, Python or Javascript code, you can help us by giving feedback!
from disco.core import Job, result_iterator
def map(line, params):
for word in line.split():
yield word, 1
def reduce(iter, params):
from disco.util import kvgroup
for word, counts in kvgroup(sorted(iter)):
yield word, sum(counts)
if __name__ == '__main__':
input = ["http://discoproject.org/media/text/chekhov.txt"]
job = Job().run(input=input, map=map, reduce=reduce)
for word, count in result_iterator(job.wait()):
print word, count


