The Python User Group Organizers Survey

June 28, 2011Posted in UncategorizedTagged announcements, pfs, pythonLeave a comment

Today has arrived to my mailbox a Jesser Noller’s message from the Python Announcements Mailing List announcing the International Survey of the Python User Groups Organizers. If you don’t know to Jesser, he is a PFS board member and PyCon chair.

Continue reading “The Python User Group Organizers Survey” →

Good news from ASF and PGDG

June 15, 2011Posted in UncategorizedTagged apache, postgresql, releases, traffic-serverLeave a comment

Today, I received many good news on my mailbox, well, at least for me.

Continue reading “Good news from ASF and PGDG” →

How to become on a Apache Hadoop committer

June 10, 2011Posted in UncategorizedTagged apache, development, hadoop2 Comments

A well described message from Todd Lipcon, Software Engineer at Cloudera on the Apache Hadoop General Mailing list explained how to become on a committer of the project.
Continue reading “How to become on a Apache Hadoop committer” →

Apache CouchDB 1.10 has been released

June 6, 2011Posted in UncategorizedTagged announcements, apache, CouchDB, Databases, nosqlLeave a comment

Here is the message from Robert Newson announcing the new release of Apache CouchDB: Hello, Apache CouchDB 1.1.0 has been released and is available for download:
http://couchdb.apache.org/downloads.html Changes in this release:

Continue reading “Apache CouchDB 1.10 has been released” →

The Camel, The Blue Bird and the Big Eye

June 6, 2011Posted in UncategorizedLeave a comment

At this moment, I have in my mailbox two messages of the vibrant Apache Software Foundation Announcements’s Mailing List and of the Apache Cassandra’s Users list

Apache Camel 2.7.2 released

The first came from Hadrian announcing the new release of Apache Camel, which is a poweful open source integration framework based on the well known “Enterprise Integration Patterns”‘s book written by Gregor Hohpe and Bobby Woolf

This new version of the project (2.7.2) is primary focused on minor fixes and it’s a patch release that you can read them all here. I’ll provide you a minor list:

Better usability of the OSGi environments
Minor fixes for the camel-web console
Minor fix for camel-ehcache related to replication across nodes

Download from here and try it.

Apache Whirr 0.5.0-incubating released

The second message is from Tom White, the author of a”Hadoop: The Defitive Guide” 2nd Edition’s book and one of the members of the Hadoop Project PMC; describing the fifth incubating release of this useful project, which is a set of tools for running cloud services such Apache Hadoop, HBase, ZooKeeper and Cassandra.

You can download it from here and you can view the full change log here.

Apache Cassandra 0.8 released

The last new today but very important too, at least for me, is the new release of Apache Cassandra, a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure

The message from Eric Evans is very clear: I am very pleased to announce the official release of Cassandra 0.8.0. If you haven’t been paying attention to this release, this is your last chance, because by this time tomorrow all your friends are going to be raving, and you don’t want to look silly. So why am I resorting to hyperbole? Well, for one because this is the release that debuts the Cassandra Query Language (CQL). In one fell swoop Cassandra has become more than NoSQL, it’s MoSQL. Cassandra also has distributed counters now. With counters, you can count stuff, and counting stuff rocks. A kickass use-case for Cassandra is spanning data-centers for fault-tolerance and locality, but doing so has always meant sending data in the clear, or tunneling over a VPN. New for 0.8.0, encryption of intranode traffic. If you’re not motivated to go upgrade your clusters right now, you’re either not easily impressed, or you’re very lazy. If it’s the latter, would it help knowing that rolling upgrades between releases is now supported? Yeah. You can upgrade your 0.7 cluster to 0.8 without shutting it down. You see what I mean? Then go read the release notes[1] to learn about the full range of awesomeness, then grab a copy[2] and become a (fashionably )early adopter. Drivers for CQL are available in Python[3], Java[3], and Node.js[4]. As usual, a Debian package is available from the project’s APT repository[5]. Enjoy! [1]: http://goo.gl/CrJqJ (NEWS.txt) [2]: http://cassandra.debian.org/download [3]: http://www.apache.org/dist/cassandra/drivers [4]: https://github.com/racker/node-cassandra-client [5]: http://wiki.apache.org/cassandra/DebianPackaging

The new KDE 4.7 version is coming

June 3, 2011Posted in UncategorizedLeave a comment

Linux (specifically Kubuntu) is my primary Operating System and KDE is my favorite desktop environment. I don’t want to initiate a flame here, I think that one has the right to use the operating system, tool, etc, where he will be more productive, and at least for me, my productivity is increased exponentially where I’m working on Linux. For that reason, my daily work like Data Scientist and Infrastructure Developer is done using this amazing operating system.

The last May 25th, the KDE Project announced the availability of the new Beta for the 4.7 version, and these are exciting news for me.

. Some of the new features that comes with this Beta version are:

KWin: the Plasma’s Windows Manager supports now OpenGL-ES 2.0
Dolphin, the file manager has many improvements on the user interface and a better user experience seaching the file’s metadata
KDM, the login manager now supports and interfaces the Grub2 bootloader
and many improvements more

You can download the completed source code for 4.7 Beta 1 from here and the instructions for compiling and installing are here

Of course, you are not obligated to use Kubuntu like me, you can choose your favorite O.S to run KDE 4. I let you some resources about this:

I hope that it could be useful to everyone. Regards and thanks for reading

A lot of thanks to Charles M. Kozierok

June 3, 2011Posted in UncategorizedLeave a comment

Yes, I want to thank to Charles for the amazing book that wrote about TCP/IP: The TCP/IP Guide. Really, is a guide, is a completed reference to these hard topics to actually understand. And one of the things that I liked about this text, is that he explained in a good way the “what” and why “TCP/IP”. That’s very important, at least for mee, because one can understand the needs and reason why it came out.

If you want to understand TCP/IP, I want to recommend you that you should have this book in printed form, to use it like a excellent reference everyday in your work. Well, I don’t know, I have mixed role of a Jr. Data Scientist with Unix/Linux Administration Systems and Services, and this text help me everyday to face real and hard networking problems like measuring network performance (I think that’s one of the hardest).

The role of the next years: The Data Scientis Part I

June 2, 2011Posted in UncategorizedLeave a comment

The first time that I heard about this role was when I was reading the “Beautiful Data: The Stories Behind Elegant Solutions” book, which was edited by Toby Segaran and Jeff Hammerbacher. So, I said: “Why this is so important ?” OK, let’s go to search information about this. I began my search and I found a interview that was given by Hal Varian, CFO of Google, to the McKinsey Group, when he described that the role of statistician would become on the most wanted role on the next 10 years. From his own words: “The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.” Umm, let me think for a moment. All this has a strong relationship. Well, the Data Analyst has to know many things, but what if we can teach or guide to the young professionals(like me) to this hard path? In this post, I will try to give my impressions about what is needed to be a good Data Scientist.

Use the right tool for the job: R and Apache Hadoop

R

Every single day when I finish my work of the day, I realize that R and Hadoop are the right tools for me. R is a free software environment for statistical computing and graphics. It is an implementation of the S language for statistical computing and graphics. For data analysis, it can be highly efficient to use a special-purpose language like S, compared to using a general-purpose language. In a real useful talk, offered by the Ph.D. Michael E. Driscoll(http://www.dataspora.com/blog), on the OSCON 2009, called “Open Source Analytics: Visualization and Predictive Modeling of Big Data with R”; he gave several keypoints why we should use R for Data Analysis, specially focused on Big Data.

Some of these points are:

* It’s open source. Yes, that’s really cool and important for me
* We can manipulate Data
* We can build models based on statistics (The Real Wow with R)
* We can visualize that data (with many packages: ggplot2, lattice, etc)
* and it’s extensible via packages

“OK,” he said. We can find this kind of things with other languages, and he answered: “Yes, that’s true, but I give one language already with all of these things”. Actually, we can’t compite with that.

My recommended packages for big data:

plyr: The Splitter R Package
ggplot2: The Grammar of Graphics
biglm: In-Memory data frame
RApache: R for the Web
REvoAnalytics, the amazing set of routines developed by Revolution Analytics
Rcpp: The interface bewteen R and C++
and other packages for in-parallel execution of code: Rmpi, papply, snow, multicore, etc

Apache Hadoop

There is a lot of interest by many companies and organization on this project. But, the question is Why? I will try to answer it. Apache Hadoop is a creation of Doug Cutting and Mike Cafarella. If you don’t know who is Cutting, you can remember the creator of Apache Lucene, the widely used search library.

From the well known Tom White’s Book “Hadoop: The Definitive Guide 2nd Edition”, from Oreilly, published on October, 2010:

“This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis
system. The storage is provided by HDFS and analysis by MapReduce. There are other
parts to Hadoop, but these capabilities are its kernel.”

Yes, it’s cool, I know, but you can say it too: “Wow, this is the solution to my problem for big data analysis”.

Hadoop was designed based in four axioms:

System Shall Manage and Heal Itself
Performance Shall Scale Linearly
Compute Should Move to Data
Simple Core, Modular and Extensible

But, there is more:

Can operate with structured and unstructured data
Has a large community behind, and a actice ecosystem
Has many user’s cases for all kind of company size
And it’s Open Source, under the friendly Apache License 2.0

You can search on the wiki, how many companies used Hadoop today.

Actually, my friend, give a try to Hadoop. You can download it from here or you can use the Cloudera Distribution for Hadoop (CDH). CDH is based on the most recent stable version of Hadoop more several patches and updates. You can use it in many different ways:

a completed VMware image ready for use
RPM packages for Red Hat-based distributions and SUSE/OpenSUSE
deb. based packages for Debian and Ubuntu distributions
And of course, source and binary files

You can download it here or you can use the public packages’s repositories for Red Hat and Ubuntu too.

MapReduce

MapReduce is based on the principles of functional programing. In this programming model, data is explicitly passed between functions as parameters or return values which can only be changed by the active function at that moment. It’s a programming model for data processing where parallelism is inherent. It’s organized as a “map” function which transform a piece of data into some number of key/value pairs. Each of these elements will then be sorted by their key and reach to the same node, where a “reduce” function is used to merge(of the same key) into a single result.

There are a lot of resources to study on deep the MapReduce programming model. For example, on the Google Labs, is the original implementation of the model, or you can search on the wiki too, or you can read the books of Tom White and Jason Venner’s (Pro Hadoop: Build scalable, distributed applications in the cloud) from Apress.

Sorry, I forgot to give you this link: 10 MapReduce Tips from Tom.

HDFS

This is the other diamant of Hadoop: its distributed filesystem. The architecture of HDFS is described in “The Hadoop Distributed File System” by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler on the proceedings of MSST2010, in May 2010.

Some of its features:

Can store very large files (in the order of Gb, Tb and Pb)
Separates the filesystem metadata(in a node called NameNode) and the application data (on one or more nodes called DataNodes)
Asumes that hardware can fail. For that reason, it replicates the data across multiples machines in a cluster(The replication factor by default is 3)
Each file is broken into chunks (by default in a block of 64 Mb, although many users uses 128 Mb), and stored across multiple data nodes as local OS files
It’s based on the Write-Once-Read-Many-Times pattern

But, all these components are not the only pieces of the Hadoop ecosystem. There is more:

Hive: is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files
Pig: is a platform for analyzing large data sets. Pig’s language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them,
and applying functions to records or groups of records
Hadoop Streaming: is a utility that comes with the Hadoop distribution. The utility allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer
Flume: is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing
Sqoop: is an open-source tool that allows users to extract data from a relational database into Hadoop for further processing
Oozieis a tool developed by Yahoo! for write workflows for interdependent Hadoop jobs
HUE: is a User interface framework and SDK for visual Hadoop applications
Zookeeper: is a coordination service for distributed applications
HBase is the Hadoop database for random read/write access
Cascading: is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster.
All without having to ‘think’ in MapReduce.

There are many problems that has been solved using this piece of technology and its ecosystem

It’s time to enter to the Cloud

There are a lot of companies that actually are using Cloud Services in many of their processes on their businesses. GitHub, The New York Times, Hopper.Travel, RazorFish are examples of this. Now, there are big players on this movement: Amazon, Google and Microsoft.

There are many companies that uses the incredible business behind the ElasticMapReduce and S3 from Amazon. The first, like is described on its page “is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).”

And the second is: “a data storage service. You are billed monthly for storage and data transfer. Transfer between S3 and AmazonEC2 is free. This makes use of S3 attractive for Hadoop users who run clusters on EC2.”

In the second part of this article, I will focus more on three skills that I think should know too a Data Scientist:

Research, Business, everything is based on numbers: Statistics
Mining, Mining: Data Mining
Visualize it: Information Aesthetics

Python for developing MapReduce Data Analysis Applications: Part 1

June 2, 2011Posted in Uncategorized2 Comments

Have not said to you? Python is a primary programming language. Simply I love it. Its simplicity, correctness, and a obligated path to write good and readable code.
Really, I love the Python Principles. So, when I began to experiment with Hadoop Clusters for big data processing, I asked to myself: Well, How I can do all this
using my favorite language? Umm, let me search on the wiki and voilá: Hadoop Streaming.

Reading the docs on the wiki, I found: “Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run
MapReduce jobs with any executable or script as the mapper and/or the reducer”. All examples are based on Python!!!: Good start.

But, I followed my search, and I found Dumbo, which is a Python module that allows to you to easily write
and run Hadoop programs. It’s considered a good Python API for writing MapReduce programs. On the Last.fm’s blog, they posted a minor guide to put your Hadoop
jobs to work with Dumbo. Two main things: simplicity and productivy. Now, of course, I have a advice for you, if you ar going to develop a real big data intensive
processing job, is better to use Java, because Hadoop is written on it. Test it, improve it, compare it with the execution with Python, and select the best option
for you.

Then, I found this amazing blog post
from Michal Noll, explaining in deep how to use Python with Hadoop. Please, don’t forget to read it.
The other player is hadoopy. It’s built with the same purpose of Dumbo. Check it out and try it.

mrjob: The another player built by Yelp

The Yelp Engineering Team released its own Python framework for writing Hadoop Streaming jobs called
mrjob. On a
great post on its engineering blog, they explained
why developed mrjob and shared with the world its work. Thanks, guys, It would be a good project to work on my open source’s contributions.

mrjob can work with the Amazon’s ElasticMapReduce and too with you own Hadoop cluster, and is available too on the
Python Package’s Directory, so, you can install it on this way:
easy_install mrjob.
The documentation is here.

Other playes out of the Hadoop ecosystems based on Python

But, I followed my search, looking for a completed Pythonic solution for writing MapReduce applications, and yes, I found two projects:
Disco Project and mrjob.

The Disco Project

This projects are sponsored by Nokia Research and the Disco Project Development Team; and is a pure implementation of MapReduce for distributed processing.Disco supports parallel computations
over large data sets, stored on an unreliable cluster of computers, as in the original framework created by Google. This makes it
a perfect tool for analyzing and processing large data sets, without having to worry about difficult technicalities related
to distribution such as communication protocols, load balancing, locking, job scheduling, and fault tolerance, which are handled by Disco.

The basic workflow of the process is:

Disco users start Disco jobs in Python scripts
Jobs requests are sent over HTTP to the master
Master is an Erlang process that receives requests over HTTP.
Master launches slaves on each node over SSH
Slaves run Disco tasks in worker processes
- What are you waiting for to try them?
- Have you considered to contribute to them?

Ok, the next step to to clone it from GiHub and to test it. Try the tutorial published on the official documentation. As you can see, there are many Pythonic solutions for write MapReduce applications, inside and outside of the Hadoop ecosystem. On the second part of this topic, I will try some examples of all these projects, not the classic Hadoop’s “Hello world” wordcount example. I promise it. Regards Tags: Data Analysis, Hadoop, Python, Dumbo, mrjob, DiscoProject, MapReduce

The new era of a Data Scientist

June 2, 2011Posted in UncategorizedLeave a comment

The first time that I heard about this role was when I was reading the “Beautiful Data: The Stories Behind Elegant Solutions” book, which was edited by Toby Segaran and Jeff Hammerbacher.
So, I said: “Why this is so important ?” OK, let’s go to search information about this.

I began my search and I found a interview that was given by Hal Varian, CFO of Google, to the McKinsey Group, when he described that
the role of statistician would become on the most wanted role on the next 10 years. From his own words:

“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to
extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”

Umm, let me think for a moment. All this has a strong relationship. Well, the Data Analyst has to know many things,
but what if we can teach or guide to the young professionals(like me) to this hard path?

In this post, I will try to give my impressions about what is needed to be a good Data Scientist.

<h2>Use the right tool for the job: R and Apache Hadoop<h2>

In a real useful talk, offered by the Ph.D. Michael E. Driscoll(http://www.dataspora.com/blog), on the OSCON 2009,
called “Open Source Analytics: Visualization and Predictive Modeling of Big Data with R”; he gave several keypoints
why we should use R for Data Analysis, specially focused on Big Data. Some of these points are:

<ul>
<li>* It’s open source. Yes, that’s really cool and important for me</li>
<li>* We can manipulate Data</li>
<li>* We can build models based on statistics (The Real Wow with R)</li>
<li>* We can visualize that data (with many packages: ggplot2, lattice, etc)</li>
<li>* and it’s extensible via packages</li>
</ul>

“OK,” he said. We can find this kind of things with other languages, and he answered: “Yes, that’s true, but I give one language already with all of these things”.
Actually, we can’t compite with that.

R is huge, and thanks to its extensibility, you can do a lot of things with the language. At the time of this writing, there are more than 1000 packages available
for free on the <a href=” http://cran.r-project.org“>CRAN</a> site. My recommended packages for big data:
<ul>
<li><a href=“http://had.co.nz/plyr”>plyr: </a></li>
<li><a href=“http://had.co.nz/ggplot2”>ggplot2: The Grammar of Graphics</a></li>
<li><a href=“http://cran.r-project.org/web/packages/biglm/index.html”>biglm:</a> In-Memory data frame</li>
<li><a href=“http://cran.r-project.org/web/packages/glm/index.html”>glm: </a></li>
<li><a href=“http://biostat.mc.vanderbilt.edu/rapache/”>RApache: R for the Web</a></li>
<li>REvoAnalytics, the amazing set of routines (not free) developed by Revolution Analytics</li>
<li><a href=“http://cran.r-project.org/web/packages/Rcpp/index.html”>Rcpp:</a> The interface bewteen R and C++</li>
<li>and other packages for in-parallel execution of code: Rmpi, papply, snow, multicore, etc</li>
</ul>

<h3>Apache Hadoop</h3>
There is a lot of interest by many companies and organization on this project. But, the question is Why? I will try to
answer it. <a href=“http://hadoop.apache.org”>Apache Hadoop</a> is a creation of Doug Cutting and Mike Cafarella. If you don’t know who is Cutting, you can remember
the creator of Apache Lucene, the widely used search library.

From the well known Tom White’s Book “Hadoop: The Definitive Guide 2nd Edition”, from Oreilly, published on October, 2010:

Yes, it’s cool, I know, but you can say it too: “Wow, this is the solution to my problem for big data analysis”.

Hadoop was designed based in four axioms:
<ul>
<li>System Shall Manage and Heal Itself</li>
<li>Performance Shall Scale Linearly</li>
<li>Compute Should Move to Data</li>
<li>Simple Core, Modular and Extensible</li>
</u>

But, there is more:
<ul>
<li>Can operate with structured and unstructured data</li>
<li>Has a large community behind, and a actice ecosystem</li>
<li>Has many user’s cases for all kind of company size</li>
<li>And it’s Open Source, under the friendly Apache License 2.0</li>
</u>

You can search on the <a href=“http://wiki.apache.org/hadoop/PoweredBy”>wiki</a>, how many companies used Hadoop today.

Actually, my friend, give a try to Hadoop. You can download it from <a href=“http://hadoop.apache.org/releases/”>here</a>
or you can use the Cloudera Distribution for Hadoop (CDH). CDH is based on the most recent stable version of Hadoop more
several patches and updates. You can use it in many different ways:
<ul>
<li>a completed VMware image ready for use</li>
<li>RPM packages for Red Hat-based distributions and SUSE/OpenSUSE</li>
<li>deb. based packages for Debian and Ubuntu distributions</li>
<li>And of course, source and binary files</li>
</ul>

You can download it <a href=“http://www.cloudera.com/downloads”>here</a> or you can use the
public packages’s repositories for <a href=“http://archive.cloudera.com/redhat/cdh/3/”>Red Hat</a>
and <a href=“http://archive.cloudera.com/ubuntu/cdh/3/”>Ubuntu</a> too.

<h4>MapReduce</h4>

MapReduce is based on the principles of functional programing. In this programming model, data is explicitly passed between
functions as parameters or return values which can only be changed by the active function at that moment. It’s a programming
model for data processing where parallelism is inherent. It’s organized as a “map” function which transform a piece of data
into some number of key/value pairs. Each of these elements will then be sorted by their key and reach to the same node, where
a “reduce” function is used to merge(of the same key) into a single result.

There are a lot of resources to study on deep the MapReduce programming model. For example, on the Google Labs, is the
<a href=“http://labs.google.com/papers/mapreduce.html”>original implementation</a> of the model, or you can search on the
wiki <a href=“http://wiki.apache.org/hadoop/MapReduce”>too</a>, or you can read the books of Tom and Jason Venner’s
(Pro Hadoop: Build scalable, distributed applications in the cloud) from Apress.

Sorry, I forgot to give you this <a href=“http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/”>link: 10 MapReduce Tips</a>
from Tom.

<h4>HDFS</h4>
This is the other diamant of Hadoop: its distributed filesystem. The architecture of HDFS is described in
<a href=“http://storageconference.org/2010/Papers/MSST/Shvachko.pdf”>“The Hadoop Distributed File System” by
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler on the proceedings of MSST2010, in May 2010.

Some of its features:
<ul>
<li>Can store very large files (in the order of Gb, Tb and Pb)</li>
<li>Separates the filesystem metadata(in a node called NameNode) and the application data (on one or more nodes called DataNodes)</li>
<li>Asumes that hardware can fail. For that reason, it replicates the data across multiples machines in a cluster(The replication factor by default is 3)</li>
<li>Each file is broken into chunks (by default in a block of 64 Mb, although many users uses 128 Mb), and stored across multiple data nodes as local OS files</li>
<li>It’s based on the Write-Once-Read-Many-Times pattern</li>
</ul>

But, all these components are not the only pieces of the Hadoop ecosystem. There is more:
<ul>
<li><a href=“http://hadoop.apache.org/hive/”>Hive:</a> is a data warehouse infrastructure built on top of Hadoop that provides tools to enable
     easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files</li>
<li><a href=“http://pig.apache.org/”>Pig:</a> is a platform for analyzing large data sets. Pig’s language,
        Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them,
        and applying functions to records or groups of records</li>
<li>Hadoop Streaming: is a utility that comes with the Hadoop distribution. The utility allows you to create and run MapReduce jobs
        with any executable or script as the mapper and/or the reducer</li>
<li><a href=“http://github.com/cloudera/flume”>Flume</a>: is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized
     place for storage and processing</li>
<li><a href=“http://github.com/cloudera/sqoop”>Sqoop</a>: is an open-source tool that
allows users to extract data from a relational database into Hadoop for further processing</li>
<li><a href=“http://github.com/yahoo/oozie”>Oozie</a>is a tool developed by Yahoo! for write workflows for interdependent Hadoop jobs</li>
<li><a href=“http://github.com/cloudera/sqoop”>HUE</a>: is a User interface framework and SDK for visual Hadoop applications</li>
<li><a href=“http://zookeeper.apache.org”>Zookeeper</a>: is a coordination service for distributed applications</li>
<li><a href=“http://wiki.apache.org/hadoop/Hbase”>HBase</a> is the Hadoop database for random read/write access</li>
<li><a href=“http://cascading.org”>Cascading</a>: is a Data Processing API, Process Planner, and Process Scheduler used for defining
     and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster.
    All without having to ‘think’ in MapReduce.</li>
</ul>
There are many problems that has been solved using this piece of technology and its ecosystem
<h2>It’s time to enter to the Cloud</h2>

There are a lot of companies that actually are using Cloud Services in many of their processes on their businesses. GitHub, The New York Times, Hopper.Travel,
RazorFish are examples of this. Now, there are big players on this movement: Amazon, Google and Microsoft.

There are many companies that uses the incredible business behind the <ha href=“http://aws.amazon.com/elasticmapreduce/”>ElasticMapReduce</a>
and <ha href=“http://aws.amazon.com/s3/”>S3</a> from Amazon. The first, like is described on its page “is a web service that enables businesses,
researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework
running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).”

And the second is: “a data storage service. You are billed monthly for storage and data transfer. Transfer between S3 and AmazonEC2 is free.
This makes use of S3 attractive for Hadoop users who run clusters on EC2. ”

3. Research, Business, everything is based on numbers: Statistics

4. Mining, Mining: Data Mining

5. Visualize it: Information Aesthetics

-- Marcos Luis Ortiz Valmaseda Software Engineer (Distributed Systems) http://uncubanitolinuxero.blogspot.com