mrjob v0.2.5 released


Jimmy Retzlaff

What is mrjob?

mrjob is a Python package that helps you write and run Hadoop Streaming jobs.

mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which
allows you to buy time on a Hadoop cluster on an hourly basis. It also
works with your own Hadoop cluster.

Some important features:

* Run jobs on EMR, your own Hadoop cluster, or locally (for testing).
* Write multi-step jobs (one map-reduce step feeds into the next)
* Duplicate your production environment inside Hadoop
* Upload your source tree and put it in your job's $PYTHONPATH
* Run make and other setup scripts
* Set environment variables (e.g. $TZ)
* Easily install python packages from tarballs (EMR only)
* Setup handled transparently by mrjob.conf config file
* Automatically interpret error logs from EMR
* SSH tunnel to hadoop job tracker on EMR
* Minimal setup
* To run on your Hadoop cluster, install simplejson and make
sure $HADOOP_HOME is set.

More info:

* Install mrjob: python install
* Documentation:
* PyPI:
* Discussion:
* Development is hosted at github:

What's new?

v0.2.5, 2011-04-29 -- Hadoop input and output formats
* Added hadoop_input/output_format options
* You can now specify a custom Hadoop streaming jar (hadoop_streaming_jar)
* extra args to hadoop now come before -mapper/-reducer on EMR, so
that e.g. -libjar will work (worked in hadoop mode since v0.2.2)
* hadoop mode now supports s3n:// URIs (Issue #53)


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question