Choosing a MPP database is incredibly hard


Like the title says, to choose an enterprise-level Massive Parallel Processing (MPP) database is actually a big headache for every Data Science Manager; basically because there are very good choices around the tech world.

Continue reading “Choosing a MPP database is incredibly hard”

Data Scientists: the world need us

Data Science

Some months ago, I wrote a post dedicated to new Data Scientists, giving my personal recommendation about several books that are pure gold, and great tools like Python, R, and Apache Hadoop. Right now, today is a new day for this kind of professional; yes, because, the Harvard Business Review (HBR) published a great article talking about the Data Scientist, written by Thomas H. Davenport and D.J. Patil; and I think that both did an incredible job in this writing, believe me, you should read it, you will not regreat. So, I want to dedicate these lines to the raising quantity of jobs with a shining title: “Data Scientist”. If you look today in any Job Board like Linkedin, AOL Careers , Indeed, SimplyHired, Technology Ladders or Dice, and you do a little search about this title, you will find more than 250 new open positions everyday, doing only the search in U.S. If you expand the search to more countries like UK, Germany, Ireland, India, China, Netherlands, the numbers grow like a completed madness. Continue reading “Data Scientists: the world need us”

Hadley Wickham and Michael E Driscoll are in the same team now

Hadley Wickham

Hadley Wickham and Michael E. Driscoll: What a duo !!!

Yes, these are great news for the Data Science movement: Hadley Wickham (@hadleywickham), PhD, the Assistant Professor of the Department of Statistics in the Rice University, well known for the creation of ggplot2, the amazing R package for Data Visualization, has a new title right now: Data Scientist in Residence at Metamarkets Group, the company co-founded and managed by Michael E. Driscoll (@medriscoll), PhD, another brilliant mind of the Data Science space.

If you read the blog post at the Metamarkets Group, you should had noted that Wickham is learning Javascript, and in particular, D3.js, the great data visualization library written by Mike Bostock, and after I saw the Wickham’s profile at GitHub, its seems that he is working in a combination of ggplot2 and D3.js called r2d3.

I don’t know what do you think about this; for me this is great, seeing two Big Data rockstars working together on interesting projects and driving the future of Big Data analytics. Personally, I admire to the whole team at Metamarkets Group because they built a Petabyte Data Platform called Druid on top of AWS, optimized for performance, with a peak of 26 billions of operations per second. Is not awesome?

So, stay tuned to the Metamarkets Group blog about the work of Hadley there, I think that it will reshape the data visualization field…again !!!

Data Science paradigms

Hilary Mason

Don´t you know some Data Scientists? Here I let you my paradigms

There are a lot of professionals which want to become on Data Scientists (like me), but many times, they don’t know the work that current Data Scientists do on their work. I want to share with you some of the most well known Data Scientists, which love to share their knowledge with the world. Continue reading “Data Science paradigms”

2013: A great time to be a Data-Driven person

Data Science

Searching the web, I found in KDnuggets, the great Data Mining portal, that 2013 is the Year of Statistics. Even, there is a video about the topic on its site.

More and more companies, are interested on the services that this kind of professionals can provide to them. It doesn’t matter if it’s a startup, a middle-size company or a large corporation; everybody is loooking the “Holy Grail” of Data Science. Continue reading “2013: A great time to be a Data-Driven person”

My little advices for young Data Scientist

Data Science

Data Scientist is the sexy role for the next 20 years

This phrased was said by Hal Varian, Chief Financial Officer(CFO) at Google, in a interview to Mckinsey Quaterly News. Varian, who together to his team has become to Google to one of most profitable companies of the world, arriving to amazing numbers: 29,5 billions of dolars in a year.

But these numbers of the company, it would be not possible without three main roles that Varian calls: “Data Analyst”, “Statistician” and the “Data Visualization Expert”, described by the executive like the “hot and sexy jobs”. Varian says: “These professionals are and will be the key of the success of the companies in the next years, specially in these difficult times that is very hard to become to a business in a profitable piece”.So, there are many companies looking for a new professional that could combine these three skills: the “Data Scientist”. If you do a simple search in or searching “Data Scientist”, you can see the raising interest by the companies for this unique kind of professionals.

I want to be a Data Scientist, How I can prepare for the role?

This is a question that many young professionals (like me) have in their minds: “I want to be a Data Scientist, but How I can obtain the required knowledge for acomplish this?”. There are a lot of whitepapers, books, articles, blog sites; a lot of techniques, tools, etc. For that reason, when a new professional is faced to this insane quantity of resources, arrises a new question: What? There a lot of books, tools, How I can begin to do this? This is the main topic of this post, to help from my modest experience in this field to address to new professionals to select good and useful content. Ok, first, my books’s list:

All these amazing books helps me everyday, because they are writting for practitioners that use everyday theirs techniques and tools described in these texts. Remember, this is my personal list, you can build your list, adding more books or removing some. I let you a start point, you can decide how you should follow it. I recommend the order that I let you here, because the first book (Head First Data Analysis) do a amazing job explaining to you the tricky and challenging problems that can face a Data Analyst, in a concise and clean way, addressed for the outstanding way of its writing. (Note: All Head First’s books are incredible useful)

OK, I have the content, What about the tools?

I love Open Source, so, all the tools that I will recommend to you are developed and improved everyday under these principles:

  • Python: It’s a amazing language with a concise and clean syntax, easy to learn, easily extensible, with a lot of useful modules used by Scientists like Numpy and Maptplotlib
  • R: this amazing platform for statictical computing and data visualization has become on the “Lingua Franca for Statictians” today. The reasons are many.
  • Apache Hadoop and its ecosystem:The popular Open Source implementation of the MapReduce’s paradigm, based on a research paper by Google engineers in 2004. This project has become in one of the major trends today, with “Big Data” and “NoSQL”. Many companies are using today this amazing platform for large data sets processing (MapReduce) and distributed storage (HDFS) like Yahoo! for Social Graphs Analysis, Rackspace for Cross Data Center Log Processing, The New York Times for converting 4 TB of images of its archives to PDF files, VISA for Large Scale Transaction Analysis,eHarmony for Match Making, JP Morgan Chase for Data Processing for Finalcial Services and many more examples that you can find on the Hadoop World 2009site and on the last edition of 2011. There are many companies offering commercial versions of Apache Hadoop like Greenplum, the division of EMC with its Greenplum HD, MapR Technologies with its MR3 and MR5 editions, IBM with its BigInSights project, but for me, the leader in commercial support, training and even certifications is Cloudera, the company founded in 2009 by Amr Awadallah former, VP of Engineering for Data Systems at Yahoo!, (now is the current Cloudera CTO), Jeff Hammerbacher, former Data Scientists Team Manager at Facebook (Vice President of Products and Chief Scientist at Cloudera), Christophe Bisciglia and Mike Olson (currently the CEO of the company), former the CEO of Sleepycat, makers of BerkeleyDB, the open source embedded database engine, and then spent two years at Oracle acting like the Vice President for Embedded Technologies after Oracle’s adquisition of Sleepycat in 2006.

Final Thoughts

The rise of the Data Scientist began with Jeff, when he lead and created the Data Team at Facebook. And now in these days, every company, organization or whatever, are looking for this unique kind of professionals to do three key things, like Michael E. Driscoll (Co-founder and CEO of Metamarkets) said in his “Open Source Analytics Visualization and Predictive Modeling of Big Data with R”, in the OSCON 2009:

“We need professionals that they can able to munge, model and visualize data”

. So, It’s a great moment to develop these skills, and in that way, to be able to work in challenging problems that could solve a lot of headaches to your current or future CEO. For that reason , I let you to decide how to use this information, and if you have any comment, please, just send me an email.

Happy Hacking !!!