DataStax Enterprise 3.0: A synonym for High Secure Real-Time Analytics

DataStax

Some days ago, I had the pleasure to talk with two Apache Cassandra experts. The first was Edward Capriolo, a Hadoop System Administrator at Media6Degrees, organizer of the NYC Cassandra User Group and NYC NoSQL Meetups, author of the incredible “Cassandra High Performance Cookbook” book and one of the DataStax´s MVP.

The second was the same Jonathan Ellis, DataStax’s Chief Technology Officer and co-founder, who leads Apache Cassandra’s project too.

Continue reading “DataStax Enterprise 3.0: A synonym for High Secure Real-Time Analytics”

Advertisements

Some upcoming features in HBase 0.96

HBase

HBase 0.96 is synonym of speed, better compression and high performance

The HBase development team is doing in these days a great job, adding some rock-solid features to this amazing data store. The next release will be 0.96, and it brings great things which I discuss with you righ now. I will expose you here the best features based on my own opinion; I’m open to discussion, so, let a comment to enrich the blog post if you want. OK, let’s start the engine. Continue reading “Some upcoming features in HBase 0.96”

The Rise of Column-based data stores Part 1

Column-based data stores are becoming in an important trend today

If you read my post about Real-Time Analytics, you should be excited like me about this trend. Did you remember the phrase: “Time is equal to money”? Time is the main cause behind all innovations in the Database world: we want faster solutions; quick ways to gather huge quantity of data; faster ways to query billions of records; faster ways to adapt our infrastructure, etc; and many have tried to give clean and useful solutions to this problem.

Continue reading “The Rise of Column-based data stores Part 1”

Chicago: the Wonderland of Data Science in November

City of Chicago

Three great analytics events in one place

Attention !!!, my friends, these are great news for Data Scientists. In November, particulary, the 14th, 15th and 16th days, Chicago will become in the Wonderland of Analytics; because the IE Group will host three great events: Continue reading “Chicago: the Wonderland of Data Science in November”

Innovation, hard work and constant learning at Hadapt

Hadapt

Innovation, research and constant learning at Hadapt

I have the pleasure to make a great announcement from the Hadapt team: they are looking for Software development stars interesting to build the next generation database on top of the Hadoop platform. This information was posted in the Linkedin’s Hadoop Users group by Greg Clark, so if you are interested on this, you can contact him directly. Continue reading “Innovation, hard work and constant learning at Hadapt”

Why I love MongoDB

MongoDB

Why I love MongoDB?

First, a short story when I meet MongoDB

Did I tell you that my favorite NoSQL data storage system is MongoDB? Yes, I like a lot this document-based data storage solution; it have saved my job a lot of times right now. Why? Easy: in one of my journeys like IT Consultant to Venezuela, when I was working with PDVSA‘s fellows (great guys !!!), we had a huge MySQL 4.0 database (yes, you should note that was a very old version of MySQL, now my recommendation is Percona Server for MySQL users) which was giving a lot of headaches to DBAs. This was the main database of an importan application in the company, so, all bosses were running from one side to another, giving a lot of commands to everyone.

Continue reading “Why I love MongoDB”

My little advices for young Data Scientist

Data Science

Data Scientist is the sexy role for the next 20 years

This phrased was said by Hal Varian, Chief Financial Officer(CFO) at Google, in a interview to Mckinsey Quaterly News. Varian, who together to his team has become to Google to one of most profitable companies of the world, arriving to amazing numbers: 29,5 billions of dolars in a year.

But these numbers of the company, it would be not possible without three main roles that Varian calls: “Data Analyst”, “Statistician” and the “Data Visualization Expert”, described by the executive like the “hot and sexy jobs”. Varian says: “These professionals are and will be the key of the success of the companies in the next years, specially in these difficult times that is very hard to become to a business in a profitable piece”.So, there are many companies looking for a new professional that could combine these three skills: the “Data Scientist”. If you do a simple search in Indeed.com or Simplyhired.com searching “Data Scientist”, you can see the raising interest by the companies for this unique kind of professionals.

I want to be a Data Scientist, How I can prepare for the role?

This is a question that many young professionals (like me) have in their minds: “I want to be a Data Scientist, but How I can obtain the required knowledge for acomplish this?”. There are a lot of whitepapers, books, articles, blog sites; a lot of techniques, tools, etc. For that reason, when a new professional is faced to this insane quantity of resources, arrises a new question: What? There a lot of books, tools, How I can begin to do this? This is the main topic of this post, to help from my modest experience in this field to address to new professionals to select good and useful content. Ok, first, my books’s list:

All these amazing books helps me everyday, because they are writting for practitioners that use everyday theirs techniques and tools described in these texts. Remember, this is my personal list, you can build your list, adding more books or removing some. I let you a start point, you can decide how you should follow it. I recommend the order that I let you here, because the first book (Head First Data Analysis) do a amazing job explaining to you the tricky and challenging problems that can face a Data Analyst, in a concise and clean way, addressed for the outstanding way of its writing. (Note: All Head First’s books are incredible useful)

OK, I have the content, What about the tools?

I love Open Source, so, all the tools that I will recommend to you are developed and improved everyday under these principles:

  • Python: It’s a amazing language with a concise and clean syntax, easy to learn, easily extensible, with a lot of useful modules used by Scientists like Numpy and Maptplotlib
  • R: this amazing platform for statictical computing and data visualization has become on the “Lingua Franca for Statictians” today. The reasons are many.
  • Apache Hadoop and its ecosystem:The popular Open Source implementation of the MapReduce’s paradigm, based on a research paper by Google engineers in 2004. This project has become in one of the major trends today, with “Big Data” and “NoSQL”. Many companies are using today this amazing platform for large data sets processing (MapReduce) and distributed storage (HDFS) like Yahoo! for Social Graphs Analysis, Rackspace for Cross Data Center Log Processing, The New York Times for converting 4 TB of images of its archives to PDF files, VISA for Large Scale Transaction Analysis,eHarmony for Match Making, JP Morgan Chase for Data Processing for Finalcial Services and many more examples that you can find on the Hadoop World 2009site and on the last edition of 2011. There are many companies offering commercial versions of Apache Hadoop like Greenplum, the division of EMC with its Greenplum HD, MapR Technologies with its MR3 and MR5 editions, IBM with its BigInSights project, but for me, the leader in commercial support, training and even certifications is Cloudera, the company founded in 2009 by Amr Awadallah former, VP of Engineering for Data Systems at Yahoo!, (now is the current Cloudera CTO), Jeff Hammerbacher, former Data Scientists Team Manager at Facebook (Vice President of Products and Chief Scientist at Cloudera), Christophe Bisciglia and Mike Olson (currently the CEO of the company), former the CEO of Sleepycat, makers of BerkeleyDB, the open source embedded database engine, and then spent two years at Oracle acting like the Vice President for Embedded Technologies after Oracle’s adquisition of Sleepycat in 2006.

Final Thoughts

The rise of the Data Scientist began with Jeff, when he lead and created the Data Team at Facebook. And now in these days, every company, organization or whatever, are looking for this unique kind of professionals to do three key things, like Michael E. Driscoll (Co-founder and CEO of Metamarkets) said in his “Open Source Analytics Visualization and Predictive Modeling of Big Data with R”, in the OSCON 2009:

“We need professionals that they can able to munge, model and visualize data”

. So, It’s a great moment to develop these skills, and in that way, to be able to work in challenging problems that could solve a lot of headaches to your current or future CEO. For that reason , I let you to decide how to use this information, and if you have any comment, please, just send me an email.

Happy Hacking !!!