Table Of Contents

Previous topic

Disco FAQ

Next topic

Glossary

This Page

Release notes

Disco 0.3 (May 26th 2010)

New features

  • Disco Distributed Filesystem - distributed and replicated data storage for Disco.
  • Discodex - distributed indices for efficient querying of data.
  • discodb - lightning fast and scalable mapping data structure.
  • New internal data format, supporting compression and pickling of Python objects by default.
  • Clarified the partitioning logic in Disco, see Data Flow in Disco Jobs.
  • Integrated web server (Mochiweb) replaces Lighttpd, making installation easier and allows more fine-grained data flow control.
  • Chunked data transfer and improved handling of network congestion.
  • Support for partial job functions (Thanks to Jarno Seppänen)
  • Unified interface for readers and input streams, writers deprecated. See disco.core.Disco.new_job().
  • New save=True parameter for disco.core.Disco.new_job() which persists job results in DDFS.
  • New garbage collector deletes job data DISCO_GC_AFTER seconds after the job has finished (see disco.settings). Defaults to 100 years. Use save=True, if you want to keep the results permanently.
  • Support for Out-of-band (OOB) results implemented using DDFS.
  • disco-worker checks that there is enough disk space before it starts up.
  • discocli - Command line interface for Disco
  • ddfscli - Command line interface for DDFS
  • Improved load balancing in scheduler.
  • Integrated Disco proxy based on Lighttpd.
  • Debian packaging: disco-master and disco-node do not conflict anymore, making it possible to run Disco locally from Debian packages.

Deprecated

These features will be removed in the coming releases:
  • object_reader and object_writer - Disco supports now pickling by default.
  • map_writer and reduce_writer (use output streams instead).
  • nr_reduces (use partitions)
  • fun_map and input_files (use map and input)

Backwards incompatible changes

  • Experimental support for GlusterFS removed
  • homedisco removed - use a local Disco instead
  • Deprecated chunked parameter removed from disco.core.Disco.new_job().
  • If you have been using a custom output stream with the default writer, you need to specify the writer now explictly, or upgrade your output stream to support the .out(k, v)` method which replaces writers in 0.3.

Bugfixes

  • Jobs should disappear from list immediately after deleted (bug #43)
  • Running jobs with empty input gives “Jobs status dead” (bug #92)
  • Full disk may crash a job in _safe_fileop() (bug #120)
  • Eventmonitor shows each job multiple times when tracking multiple jobs (bug #94)
  • Change eventmonitor default output handle to sys.stderr (bug #83)
  • Tell user what the spawn command was if the task fails right away (bug #113)
  • Normalize pathnames on PYTHONPATH (bug #134)
  • Timeouts were handled incorrectly in wait() (bug #96)
  • Cast unicode urls to strings in comm_curl (bug #52)
  • External sort handles objects in values correctly. Thanks to Tomaž Šolc for the patch!
  • Scheduler didn’t handle node changes correctly - this solves the hanging jobs issue
  • Several bug fixes in comm_*.py
  • Duplicate nodes on the node config table crashed master
  • Handle timeout correctly in fair_scheduler_job (if system is under heavy load)

Disco 0.2.4 (February 8th 2010)

New features

  • New fair job scheduler which replaces the old FIFO queue. The scheduler is inspired by Hadoop’s Fair Scheduler. Running multiple jobs in parallel is now supported properly.
  • Scheduler option to control data locality and resource usage. See disco.core.Disco.new_job().
  • Support for custom input and output streams in tasks: See map_input_stream, map_output_stream, reduce_input_stream and reduce_output_stream in disco.core.Disco.new_job().
  • disco.core.Disco.blacklist() and disco.core.Disco.whitelist().
  • New test framework based on Python’s unittest module.
  • Improved exception handling.
  • Improved IO performance thanks to larger IO buffers.
  • Lots of internal changes.

Bugfixes

  • Set LC_ALL=C for disco worker to ensure that external sort produces consistent results (bug #36, 7635c9a)
  • Apply rate limit to all messages on stdout / stderr. (bug #21, db76c80)
  • Fixed flock error handing for OS X (b06757e4)
  • Documentation fixes (bug #34, #42 9cd9b6f1)

Disco 0.2.3 (September 9th 2009)

New features

  • The disco.settings control script makes setting up and running Disco much easier than before.
  • Console output of job events (screenshot). You can now follow progress of a job on the console instead of the web UI by setting DISCO_EVENTS=1. See disco.core.Disco.events() and disco.core.Disco.wait().
  • Automatic inference and distribution of dependent modules. See disco.modutil.
  • required_files parameter added to disco.core.Disco.new_job().
  • Combining the previous two features, a new easier way to use external C libraries is provided, see Disco External Interface.
  • Support for Python 2.6 and 2.7.
  • Easier installation of a simple single-server cluster. Just run disco master start on the disco directory. The DISCO_MASTER_PORT setting is deprecated.
  • Improved support for OS X. The DISCO_SLAVE_OS setting is deprecated.
  • Debian packages upgraded to use Erlang 13B.
  • Several improvements related to fault-tolerance of the system
  • Serialize job parameters using more efficient and compact binary format.
  • Improved support for GlusterFS (2.0.6 and newer).
  • Support for the pre-0.1 disco module, disco.job call etc., removed.

Bugfixes

  • critical External sort didn’t work correctly with non-numeric keys (5ef88ad4)
  • External sort didn’t handle newlines correctly (61d6a597f)
  • Regression fixed in disco.core.Disco.jobspec(); the function works now again (e5c20bbfec4)
  • Filter fixed on the web UI (bug #4, e9c265b)
  • Tracebacks are now shown correctly on the web UI (bug #3, ea26802ce)
  • Fixed negative number of maps on the web UI (bug #28, 5b23327 and 3e079b7)
  • The comm_curl module might return an insufficient number of bytes (761c28c4a)
  • Temporary node failure (noconnection) shouldn’t be a fatal error (bug #22, ad95935)
  • nr_maps and nr_reduces limits were off by one (873d90a7)
  • Fixed a Javascript bug on the config table (11bb933)
  • Timeouts in starting a new worker shouldn’t be fatal (f8dfcb94)
  • The connection pool in comm_httplib didn’t work correctly (bug #30, 5c9d7a88e9)
  • Added timeouts to comm_curl to fix occasional issues with the connection getting stuck (2f79c698)
  • All IOErrors and CommExceptions are now non-fatal (f1d4a127c)

Disco 0.2.2 (July 26th 2009)

New features

  • Experimental support for POSIX-compatible distributed filesystems, in particular GlusterFS. Two modes are available: Disco can read input data from a distributed filesystem while preserving data locality (aka inputfs). Disco can also use a DFS for internal communication, replacing the need for node-specific web servers (aka resultfs).

Bugfixes

  • DISCO_PROXY handles now out-of-band results correctly (commit b1c0f9911)
  • make-lighttpd-proxyconf.py now ignores commented out lines in /etc/hosts (bug #14, commit a1a93045d)
  • Fixed missing PID file in the disco-master script. The /etc/init.d/disco-master script in Debian packages now works correctly (commit 223c2eb01)
  • Fixed a regression in Makefile. Config files were not copied to /etc/disco (bug #13, commit c058e5d6)
  • Increased server.max-write-idle setting in Lighttpd config. This prevents the http connection from disconnecting with long running, cpu-intensive reduce tasks (bug #12, commit 956617b0)

Disco 0.2.1 (May 26th 2009)

New features

  • Support for redundant inputs: You can now specify many redundant addresses for an input file. Scheduler chooses the address which points at the node with the lowest load. If the address fails, other addresses are tried one by one until the task succeeds. See inputs in disco.core.Disco.new_job() for more information.
  • Task profiling: See How to profile programs in Disco?
  • Implemented an efficient way to poll for results of many concurrent jobs. See disco.core.Disco.results().
  • Support for the Curl HTTP client library added. Curl is used by default if the pycurl module is available.
  • Improved storing of intermediate results: Results are now spread to a directory hierarchy based on the md5 checkum of the job name.

Bugfixes

  • Check for ionice before using it. (commit dacbbbf785)
  • required_modules didn’t handle submodules (PIL.Image etc.) correctly (commit a5b9fcd970)
  • Missing file balls.png added. (bug #7, commit d5617a788)
  • Missing and crashed nodes don’t cause the job to fail (bug #2, commit 6a5e7f754b)
  • Default value for nr_reduces now never exceeds 100 (bug #9, commit 5b9e6924)
  • Fixed homedisco regression in 0.2. (bugs #5, #10, commit caf78f77356)

Disco 0.2 (April 7th 2009)

New features

Bugfixes

(NB: bug IDs in 0.2 refer to the old bug tracking system)

  • chunked = false mode produced incorrect input files for the reduce phase (commit db718eb6)
  • Shell enabled for the disco master process (bug #7, commit 7944e4c8)
  • Added warning about unknown parameters in new_job() (bug #8, commit db707e7d)
  • Fix for sending invalid configuration data (bug #1, commit bea70dd4)
  • Fixed missing msg, err and data_err functions (commit e99a406d)