Have not said to you? Python is a primary programming language. Simply I love it. Its simplicity, correctness, and a obligated path to write good and readable code.
Really, I love the Python Principles. So, when I began to experiment with Hadoop Clusters for big data processing, I asked to myself: Well, How I can do all this
using my favorite language? Umm, let me search on the wiki and voilá: Hadoop Streaming.
MapReduce jobs with any executable or script as the mapper and/or the reducer”. All examples are based on Python!!!: Good start. But, I followed my search, and I found Dumbo, which is a Python module that allows to you to easily write
and run Hadoop programs. It’s considered a good Python API for writing MapReduce programs. On the Last.fm’s blog, they posted a minor guide to put your Hadoop
jobs to work with Dumbo. Two main things: simplicity and productivy. Now, of course, I have a advice for you, if you ar going to develop a real big data intensive
processing job, is better to use Java, because Hadoop is written on it. Test it, improve it, compare it with the execution with Python, and select the best option
for you. Then, I found this amazing blog post
from Michal Noll, explaining in deep how to use Python with Hadoop. Please, don’t forget to read it.
The other player is hadoopy. It’s built with the same purpose of Dumbo. Check it out and try it.
mrjob: The another player built by Yelp
The Yelp Engineering Team released its own Python framework for writing Hadoop Streaming jobs called
mrjob. On a
great post on its engineering blog, they explained
why developed mrjob and shared with the world its work. Thanks, guys, It would be a good project to work on my open source’s contributions.
Python Package’s Directory, so, you can install it on this way:
The documentation is here.
Other playes out of the Hadoop ecosystems based on Python
But, I followed my search, looking for a completed Pythonic solution for writing MapReduce applications, and yes, I found two projects:
Disco Project and mrjob.
The Disco Project
This projects are sponsored by Nokia Research and the Disco Project Development Team; and is a pure implementation of MapReduce for distributed processing.Disco supports parallel computations
over large data sets, stored on an unreliable cluster of computers, as in the original framework created by Google. This makes it
a perfect tool for analyzing and processing large data sets, without having to worry about difficult technicalities related
to distribution such as communication protocols, load balancing, locking, job scheduling, and fault tolerance, which are handled by Disco.
- Disco users start Disco jobs in Python scripts
- Jobs requests are sent over HTTP to the master
- Master is an Erlang process that receives requests over HTTP.
- Master launches slaves on each node over SSH
- Slaves run Disco tasks in worker processes
- What are you waiting for to try them?
- Have you considered to contribute to them?
Ok, the next step to to clone it from GiHub and to test it. Try the tutorial published on the official documentation. As you can see, there are many Pythonic solutions for write MapReduce applications, inside and outside of the Hadoop ecosystem. On the second part of this topic, I will try some examples of all these projects, not the classic Hadoop’s “Hello world” wordcount example. I promise it. Regards Tags: Data Analysis, Hadoop, Python, Dumbo, mrjob, DiscoProject, MapReduce