massive data - minimal code

Disco is a lightweight, open-source framework for distributed computing based on the MapReduce paradigm.

Disco is powerful and easy to use, thanks to Python. Disco distributes and replicates your data, and schedules your jobs efficiently. Disco even includes the tools you need to index billions of data points and query them in real-time.

Disco was born in Nokia Research Center in 2008 to solve real challenges in handling massive amounts of data. Disco has been actively developed since then by Nokia and many other companies who use it for a variety of purposes, such as log analysis, probabilistic modelling, data mining, and full-text indexing.

Fork me on GitHub

Try Disco (alpha) »

Disco in action


    from disco.core import Job, result_iterator

    def map(line, params):
        for word in line.split():
            yield word, 1

    def reduce(iter, params):
        from disco.util import kvgroup
        for word, counts in kvgroup(sorted(iter)):
            yield word, sum(counts)

    if __name__ == '__main__':
        input = ["http://discoproject.org/media/text/chekhov.txt"]
        job = Job().run(input=input, map=map, reduce=reduce)
        for word, count in result_iterator(job.wait()):
            print word, count

This is a fully working Disco script that computes word frequencies in a text corpus. Disco distributes the script automatically to a cluster, so it can utilize all available CPUs in parallel. For details, see Disco tutorial.

Highlights

...and more! See the documentation for details.

Need help with Disco? We can be reached on our IRC channel #discoproject at Freenode or on the Disco discussion group, or by opening an issue at Disco repository at GitHub.

Who is using Disco?

Send a note to Disco discussion group if you would like to list your company or project here.

Headquartered in Chicago, Illinois, Allston Trading, LLC is a premier high frequency market maker in over 40 financial exchanges, in 20 countries, and in nearly every product class.

At Allston Trading, Disco is used for a wide variety of historical research and real-time initiatives in the field of modern finance.

Chango is a programmatic advertising platform. Chango connects marketers with their exact target audience in real time across Display, Social, Mobile & Video.

Chango helps marketers efficiently acquire new customers, retarget their existing site visitors and build brand awareness with integrated and highly data-driven campaigns. Their ‘Universal Live Profile’ technology and exclusive data help deliver results for Fortune 500 companies like eBay, Lego and Bloomingdales.

Disco is used by Chango as a core component for analyzing and bidding on the ad market.

Disco is developed mainly at Nokia Research Center (NRC), which is chartered with exploring new frontiers for mobility, and solving challenges to transform the converging Internet and communications industries. NRC has been exploring and developing mobile technologies for over 20 years. Current research focuses on the areas of sensing and data intelligence, user interface, high performance mobile platforms, and cognitive radio.

The largest data analysis cluster at Nokia runs Disco, to perform daily analysis of Nokia's vast mobile data assets.

Zemanta is developing a content suggestion engine for bloggers and other content creators. Zemanta combs the web for the most relevant images, smart links, keywords and text, instantly serving these results to the user to enrich and inform their content.

At Zemanta, Disco is used to process contextual data about images on Wikipedia and Wikimedia Commons.