The first time that I heard about this role was when I was reading the “Beautiful Data: The Stories Behind Elegant Solutions” book, which was edited by Toby Segaran and Jeff Hammerbacher. So, I said: “Why this is so important ?” OK, let’s go to search information about this. I began my search and I found a interview that was given by Hal Varian, CFO of Google, to the McKinsey Group, when he described that the role of statistician would become on the most wanted role on the next 10 years. From his own words: “The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.” Umm, let me think for a moment. All this has a strong relationship. Well, the Data Analyst has to know many things, but what if we can teach or guide to the young professionals(like me) to this hard path? In this post, I will try to give my impressions about what is needed to be a good Data Scientist.
Use the right tool for the job: R and Apache Hadoop
Every single day when I finish my work of the day, I realize that R and Hadoop are the right tools for me. R is a free software environment for statistical computing and graphics. It is an implementation of the S language for statistical computing and graphics. For data analysis, it can be highly efficient to use a special-purpose language like S, compared to using a general-purpose language. In a real useful talk, offered by the Ph.D. Michael E. Driscoll(http://www.dataspora.com/blog), on the OSCON 2009, called “Open Source Analytics: Visualization and Predictive Modeling of Big Data with R”; he gave several keypoints why we should use R for Data Analysis, specially focused on Big Data.
Some of these points are:
- * It’s open source. Yes, that’s really cool and important for me
- * We can manipulate Data
- * We can build models based on statistics (The Real Wow with R)
- * We can visualize that data (with many packages: ggplot2, lattice, etc)
- * and it’s extensible via packages
“OK,” he said. We can find this kind of things with other languages, and he answered: “Yes, that’s true, but I give one language already with all of these things”. Actually, we can’t compite with that.
R is huge, and thanks to its extensibility, you can do a lot of things with the language. At the time of this writing, there are more than 1000 packages available for free on the <a href=" http://cran.r-project.org“>CRAN site.
My recommended packages for big data:
- plyr: The Splitter R Package
- ggplot2: The Grammar of Graphics
- biglm: In-Memory data frame
- RApache: R for the Web
- REvoAnalytics, the amazing set of routines developed by Revolution Analytics
- Rcpp: The interface bewteen R and C++
- and other packages for in-parallel execution of code: Rmpi, papply, snow, multicore, etc
There is a lot of interest by many companies and organization on this project. But, the question is Why? I will try to answer it. Apache Hadoop is a creation of Doug Cutting and Mike Cafarella. If you don’t know who is Cutting, you can remember the creator of Apache Lucene, the widely used search library.
From the well known Tom White’s Book “Hadoop: The Definitive Guide 2nd Edition”, from Oreilly, published on October, 2010:
“This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis
system. The storage is provided by HDFS and analysis by MapReduce. There are other
parts to Hadoop, but these capabilities are its kernel.”
Yes, it’s cool, I know, but you can say it too: “Wow, this is the solution to my problem for big data analysis”.
Hadoop was designed based in four axioms:
- System Shall Manage and Heal Itself
- Performance Shall Scale Linearly
- Compute Should Move to Data
- Simple Core, Modular and Extensible
- Can operate with structured and unstructured data
- Has a large community behind, and a actice ecosystem
- Has many user’s cases for all kind of company size
- And it’s Open Source, under the friendly Apache License 2.0
- a completed VMware image ready for use
- RPM packages for Red Hat-based distributions and SUSE/OpenSUSE
- deb. based packages for Debian and Ubuntu distributions
- And of course, source and binary files
- Can store very large files (in the order of Gb, Tb and Pb)
- Separates the filesystem metadata(in a node called NameNode) and the application data (on one or more nodes called DataNodes)
- Asumes that hardware can fail. For that reason, it replicates the data across multiples machines in a cluster(The replication factor by default is 3)
- Each file is broken into chunks (by default in a block of 64 Mb, although many users uses 128 Mb), and stored across multiple data nodes as local OS files
- It’s based on the Write-Once-Read-Many-Times pattern
- Hive: is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files
- Pig: is a platform for analyzing large data sets. Pig’s language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them,
and applying functions to records or groups of records
- Hadoop Streaming: is a utility that comes with the Hadoop distribution. The utility allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer
- Flume: is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing
- Sqoop: is an open-source tool that allows users to extract data from a relational database into Hadoop for further processing
- Oozieis a tool developed by Yahoo! for write workflows for interdependent Hadoop jobs
- HUE: is a User interface framework and SDK for visual Hadoop applications
- Zookeeper: is a coordination service for distributed applications
- HBase is the Hadoop database for random read/write access
- Cascading: is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster.
All without having to ‘think’ in MapReduce.
- Research, Business, everything is based on numbers: Statistics
- Mining, Mining: Data Mining
- Visualize it: Information Aesthetics
But, there is more:
You can search on the wiki, how many companies used Hadoop today.
Actually, my friend, give a try to Hadoop. You can download it from here or you can use the Cloudera Distribution for Hadoop (CDH). CDH is based on the most recent stable version of Hadoop more several patches and updates. You can use it in many different ways:
MapReduce is based on the principles of functional programing. In this programming model, data is explicitly passed between functions as parameters or return values which can only be changed by the active function at that moment. It’s a programming model for data processing where parallelism is inherent. It’s organized as a “map” function which transform a piece of data into some number of key/value pairs. Each of these elements will then be sorted by their key and reach to the same node, where a “reduce” function is used to merge(of the same key) into a single result.
There are a lot of resources to study on deep the MapReduce programming model. For example, on the Google Labs, is the original implementation of the model, or you can search on the wiki too, or you can read the books of Tom White and Jason Venner’s (Pro Hadoop: Build scalable, distributed applications in the cloud) from Apress.
Sorry, I forgot to give you this link: 10 MapReduce Tips from Tom.
This is the other diamant of Hadoop: its distributed filesystem. The architecture of HDFS is described in “The Hadoop Distributed File System” by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler on the proceedings of MSST2010, in May 2010.
Some of its features:
But, all these components are not the only pieces of the Hadoop ecosystem. There is more:
There are many problems that has been solved using this piece of technology and its ecosystem
It’s time to enter to the Cloud
There are a lot of companies that actually are using Cloud Services in many of their processes on their businesses. GitHub, The New York Times, Hopper.Travel, RazorFish are examples of this. Now, there are big players on this movement: Amazon, Google and Microsoft.
There are many companies that uses the incredible business behind the ElasticMapReduce and S3 from Amazon. The first, like is described on its page “is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).”
And the second is: “a data storage service. You are billed monthly for storage and data transfer. Transfer between S3 and AmazonEC2 is free. This makes use of S3 attractive for Hadoop users who run clusters on EC2.”
In the second part of this article, I will focus more on three skills that I think should know too a Data Scientist: