Suggestions for Python MapReduce?

P

Phillip B Oldham

I understand that there are a number of MapReduce frameworks/tools
that play nicely with Python (Disco, Dumbo/Hadoop), however these have
"large" dependencies (Erlang/Java). Are there any MapReduce frameworks/
tools which are either pure-Python, or play nicely with Python but
don't require the Java/Erlang runtime environments?
 
C

Casey Webster

I understand that there are a number of MapReduce frameworks/tools
that play nicely with Python (Disco, Dumbo/Hadoop), however these have
"large" dependencies (Erlang/Java). Are there any MapReduce frameworks/
tools which are either pure-Python, or play nicely with Python but
don't require the Java/Erlang runtime environments?

I can't answer your question, but I would like to better understand
the
problem you are trying to solve. The Apache Hadoop/MapReduce java
application isn't really that "large" by modern standards, although it
is generally run with large heap sizes for performance (-Xmx1024m or
larger for the mapred.child.java.opts parameter).

MapReduce is designed to do extremely fast parallel data set
processing
on terabytes of data over hundreds of physical nodes; what advantage
would a pure Python approach have here?
 
P

Phillip B Oldham

I can't answer your question, but I would like to better understand
the
problem you are trying to solve.  The Apache Hadoop/MapReduce java
application isn't really that "large" by modern standards, although it
is generally run with large heap sizes for performance (-Xmx1024m or
larger for the mapred.child.java.opts parameter).

MapReduce is designed to do extremely fast parallel data set
processing
on terabytes of data over hundreds of physical nodes; what advantage
would a pure Python approach have here?

We're always taught that it is a good idea to reduce the number of
dependencies for a project. So you could use Disco, or Dumbo, or even
Skynet. But then you've introduced another system you have to manage.
Each new system will have a learning curve, which is lessened if the
system is written in an environment/language you already work with/
understand. And since a large cost with environments like erlang and
java is in understanding them any issues that are out of the ordinary
can be killer; changing the heap size as you mentioned above for Java
could be one of these issues that a non-java dev trying to use Hadoop
might come across.

With the advent of cloud computing and the new semi-structured/
document databases that are coming to the fore sometimes you need to
use MapReduce on smaller/fewer systems to the same effect. Also, you
may need only to ensure that a job is done in a timely fashion without
taking up too many resources, rather than lightening-fast. Dumbo/disco
in these situations may be considered overkill.

Implementations like BashReduce <http://blog.last.fm/2009/04/06/
mapreduce-bash-script> are perfect for such scenarios. I'm simply
wondering if there's another simpler/smaller implementation of
MapReduce that plays nicely with Python but doesn't require the setup/
knowledge overhead of more "robust" implementations such as hadoop and
disco... maybe similar to Ruby's Skynet.
 
P

python

Phillip,

We've had great success writing simple, project specific algorithms to
split content into chunks appropriate for ETL type, Python based
processing in a hosted cloud environment like Amazon EC2 or the recently
launched Rackspace Cloud Servers. Since we're purchasing our cloud
hosting time in 1 hour blocks, we divide our data into much larger
chunks than what a traditional map-reduce technique might use. For many
of our projects, the data transfer time to and from the cloud takes the
majority of clock time.

Malcolm
 
P

Paul Rubin

Phillip B Oldham said:
Implementations like BashReduce <http://blog.last.fm/2009/04/06/
mapreduce-bash-script> are perfect for such scenarios. I'm simply
wondering if there's another simpler/smaller implementation of
MapReduce that plays nicely with Python but doesn't require the setup/
knowledge overhead of more "robust" implementations such as hadoop and
disco... maybe similar to Ruby's Skynet.

I usually just spew ssh tasks across whatever computing nodes I can
get my hands on. It's less organized than something like mapreduce,
but I tend to run one-off tasks that I have to keep an eye on anyway.
I've done stuff like that across up to 100 or so machines and I think
it wouldn't be really worse if the number were a few times higher. I
don't think it would scale to really large (10,000's of nodes) clusters.
 
P

Phillip B Oldham

I usually just spew ssh tasks across whatever computing nodes I can
get my hands on.  It's less organized than something like mapreduce,
but I tend to run one-off tasks that I have to keep an eye on anyway.

I'm thinking of repeated tasks; things you run regularly, but don't
need to be blisteringly fast. I'll more than likely use Disco, but if
I were to find something more light-weight I'd take a serious look.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top