Moving the blog to Medium

Hey, friends. I´m moving my blog to Medium. I´ve been thinking to make this happen but I had been very busy in these days, but the last week I did it, so I don´t write any other post in my space in WordPress, but I will write it here:

I will be fixing everything in the new address, so if you see something wrong, please send me a message to and I will be more than glad for that. Thanks again, and let me know what do you think about this. All the best and I hope to see you as a subscriber in my Data-Driven Guy publication.


Big Data Research Initiative in the Obama Administration

Obama administration unveiled a Big Data Research and Development Initiative with $200 million

I received a message now from Yanchang Zhao announcing this initiative from the Obama administration about this investment on Big Data tools. Here is the completed message:

Obama administration unveiled a Big Data Research and Development Initiative with $200 million on March 29, 2012, to improve the ability to extract knowledge and insights from large and complex collections of digital data. Six Federal departments and agencies are involved to to improve the tools and techniques needed to access, organize, and glean discoveries from huge volumes of digital data, and transform the ability to use Big Data for scientific discovery, environmental and biomedical research, education, and national security.

The Big Data initiative aims to:

  • Advance state-of-the-art core technologies needed to collect, store, preserve, manage, analyze, and share huge quantities of data.
  • Harness these technologies to accelerate the pace of discovery in science and engineering, strengthen our national security, and transform teaching and learning; and
  • Expand the workforce needed to develop and use Big Data technologies.

It includes:

  • National Science Foundation and the National Institutes of Health – Core Techniques and Technologies for Advancing Big Data Science & Engineering;
  • Department of Defense – Data to Decisions;
  • National Institutes of Health – 1,000 Genomes Project Data Available on Cloud;
  • Department of Energy – Scientific Discovery Through Advanced Computing;
  • US Geological Survey – Big Data for Earth System Science.


More details at available at:


Final Thoughts

Everyday, I talk with my colleagues about the key role that will play Big Data tools like Apache Hadoop, R, NoSQL platforms, and we are convinced that we are in the right path to success, getting deeper everyday in these amazing pieces of technologies and learning a lot in the process. So, look for the advice from Michael Driscoll (@medriscoll), former founder of DataSpora and current CEO of MetaMarkets Group for new and young Data Scientist: “Learn to munge, model and visualize data now, and you will be rewarded later”

Happy Hacking !!!



Cloudera announces HBaseCon

HBaseCon 2012

Cloudera, on of the leader companies today on the Apache Hadoop and related technologies, announced that the first conference dedicated exclusivelly to HBase, the real-time column-based data storage, based on the research of Google BigTable will be in May 22 in the InterContinental San Francisco Hotel, located in 888 Howard Street, San Francisco, CA 94103. Seeing the agenda and the speakers list, I think that it could be a oustanding opportunity for networking events, to meet personally to Lars George, the author of the “HBase: The Definitive Guide”‘s book, talk with HBase committers to discuss the upcoming features in the platform, etc. Some of my favorite talks in the conference are:

Final Thoughts

So, What are you waiting for? Look here for a early bird registration, and enjoy every minute in the event.If you have any question about the event, please send an email to the Cloudera’s folks here. Cheers.

Happy Hacking !!!


Data 2.0 Summit is Tomorrow

If you are in San Francisco, CA, don’t lose the chance to participate in the Data 2.0 Summit. A lot of networking events, amazing talks, executives panels, a startup competition, all talking about Data Management. Some of the points that you can taste by yourself there:

  • San Francisco Open Data Cloud: See a presentation on the recently launched San Francisco Open Data initiative from the founder of Socrata, the data company on which the San Francisco open data is hosted
  • How Jigsaw got acquired for $175 Million: See a keynote presentation from Jim Fowler, founder of Jigsaw (now and how he founded and exited one of the largest data acquisitions of the decade
  • Participate in our Town Hall Talk: Attendees can submit mobile questions to our Town Hall Talk moderators from McKinsey, Trulia, and HP during this un-conference style session
  • Attend Presentations direct from the executives of Google, RadiumOne, Salesforce, DataSift, GNIP, Datastax, Neustar, Massive Health, CrowdControl Software, and Crowdflower
  • After-Party at 111 Minna Gallery on April 3rd for Data 2.0 Summit pass holders and their guests, we hope to see you there!

Final Thoughts

So, What are you waiting for? Register here and get a 20% off your summit pass with the code “data2get2012”.

Happy Hacking !!!


Apache Sqoop is officially a Top-Level Project in the ASF

Apache Sqoop has graduated as a Top-Level Project of the ASF

I received a message now from Sally Khudairi, in behalf of the Apache Software Foundation announcing that Apache Sqoop is officially a Top-Level Project :

Apache Sqoop is an Open Source big data tool used for efficient bulk transfer between Apache Hadoop and structured datastores.

Forest Hill, MD –The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of nearly 150 Open Source projects and initiatives, today announced that Apache Sqoop has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the Project’s community and products have been well-governed under the ASF’s meritocratic process and principles.

Designed to efficiently transfer bulk data between Apache Hadoop and structured datastores such as relational databases, Apache Sqoop allows the import of data from external datastores and enterprise data warehouses into Hadoop Distributed File System or related systems like Apache Hive and HBase.

“The Sqoop Project has demonstrated its maturity by graduating from the Apache Incubator,” explained Arvind Prabhakar, Vice President of Apache Sqoop. “With jobs transferring data on the order of billions of rows, Sqoop is proving its value as a critical component of production environments.”

Building on the Hadoop infrastructure, Sqoop parallelizes data transfer for fast performance and best utilization of system and network resources. In addition, Sqoop allows fast copying of data from external systems to Hadoop to make data analysis more efficient and mitigates the risk of excessive load to external systems.

“Connectivity to other databases and warehouses is a critical component for the evolution of Hadoop as an enterprise solution, and that’s where Sqoop plays a very important role” said Deepak Reddy, Hadoop Manager at “We use Sqoop extensively to store and exchange data between Hadoop and other warehouses like Netezza. The power of Sqoop also comes in the ability to write free-form queries against structured databases and pull that data into Hadoop.”

“Sqoop has been an integral part of our production data pipeline” said Bohan Chen, Director of the Hadoop Development and Operations team at Apollo Group. “It provides a reliable and scalable way to import data from relational databases and export the aggregation results to relational databases.”

Since entering the Apache Incubator in June 2011, Sqoop was quickly embraced as an ideal SQL-to-Hadoop data transfer solution. The Project provides connectors for popular systems such as MySQL, PostgreSQL, Oracle, SQL Server and DB2, and also allows for the development of drop-in connectors that provide high speed connectivity with specialized systems like enterprise data warehouses.

Craig Ling, Director of Business Systems at Tsavo Media, said “We adopted the use of Sqoop to transfer data into and out of Hadoop with our other systems over a year ago. It is straight forward and easy to use, which has opened the door to allow team members to start consuming data autonomously, maximizing the analytical value of our data repositories.”

Availability and Oversight
Apache Sqoop software is released under the Apache License v2.0, and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project’s day-to-day operations, including community development and product releases. Apache Sqoop source code, documentation, mailing lists, and related resources are available at

Final Thoughts

So, if you are a Hadoop user, you should consider to include to Sqoop in your current or in your infrastructure, because for its own features, it will be a critical part of the architecture. So, download it, install it and test it.

Happy Hacking !!!


Address Bar Spoofing in iOS 5.1

iOS 5.1 is vulnerable to Adress Bar Spoofing attack

David Viera-Kurtz, of MayorSecurity has discovered a new form to attack iOS 5.1 based devices, where the address bar of Apple Webkit/534.46 can be changed based on the use of the JavaScript function “”. This failure could be used for a remote attacker to change the address bar, and in that way to cheet to the user showing like the current page by a different URL actually visited: in short to give some realist phishing attacks.

Viera-Kurtz has published a concept test which he shows this failure, and any user that visit with his terminal, the URL, wil see that the URL can actually appears in the Safari browser is

There is not a avalaible patch for this, so, It’s recommended that you don’t visit important URLs with the Safari browser in iOS 5.1 throught of a link that is not trusted.

Happy Hacking !!!


Apache Software Foundation in the GSoC again this year

ASF in the GSoC

I received a message now from Sally Khudairi, in behalf of the Apache Software Foundation announcing that they will be participating in the Google Summer of Code again this year:

I let you the completed message:

The Apache Software Foundation will be participating in the Google Summer of Code again this year as a mentoring organization.

Google Summer of Code is a program where students work on open source projects backed by a stipend granted by Google. The Apache Software Foundation has been participating in the program since its inception in 2005.

Each year, 30-40 students are guided by volunteer mentors from various Apache communities. During the summer they learn what it means to participate in a diverse open source community and develop open source software “The Apache Way”. Many of past students are now active contributors to our projects.

This year we hope to build on our previous successes and again build student interest and enthusiasm in The Apache Software Foundation. Our list of project ideas (at already contains over 100 ideas spanning more than 25 Apache projects. But that’s only what we have come up with. A lot of students have their very own itch to scratch and approach our projects with their own ideas.

If you are enrolled as a university student and always wanted to get involved with Apache, here’s is your chance. Browse our ideas list, approach the projects you are most interested in, discuss your ideas, get involved, code the summer away, and at the end, get a nice paycheck from Google!

Final Thoughts

So, if you are a University student, you could consider to participate on the GSoC mentored by brilliant ASF minds; I think that It’s a great opportunity to work on a real project, used by hundreds and thousands of users like Apache HTTP server, Camel, Hadoop and its ecosystem, Cassandra, HBase, TrafficServer, and many more. So, What do you waiting for? Best wishes and good look.

Happy Hacking !!!


Post-Doctoral position related to Data Mining and Hadoop in France

Post Doctoral position for Data Mining practitioners

Yangchang Zhao, administrator of the R Data Mining Google Group, just announces now a Post Doctoral position in Montpellier, France related to Apache Hadoop and its ecosystem. Here is the completed information:

Title: Data Mining in the Cloud


Cloud platforms rely on technologies and architectures that handle massive distribution of data and computation. They are usually provided and maintained by major companies (Amazon, Google, Yahoo, Microsoft). Hadoop is a free platform written in Java that allows data management and processing in a cloud environment. Hadoop is maintained by the Apache Foundation and implements the Google MapReduce technology. Today, most solutions for data mining in the cloud are straightforward implementations of existing algorithms in the selected cloud programming language. A basic illustration is the implementation for MapReduce of the “aPriori” algorithm which performs successive counting steps that rely on the native cloud primitives.

However, not all algorithms can have such straightforward implementations. This work aims at developing a set of data mining primitives optimized for a cloud environment. Such primitives have to be useful for different data mining tasks (e.g., finding frequent itemsets and sequential patterns, clustering, etc.).

Missions and activities

Your mission will consist in:

  • Proposing efficient algorithms for some primitives that are useful for different data mining tasks and require a specific adaptation in the cloud
  • Implementing the proposed algorithms on an experimental platform for large scale parallel and distributed systems
  • Performing experiments on real scientific data and evaluating performances of your implementation for the tackled data mining primitives and the associated data mining tasks


Skills and profiles

  • Strong knowledge of statistics
  • Good proficiency in English
  • Good programming skills in Java
  • A Ph.D. in computer science or mathematics

Duration, Location and Salary

Duration is 12 months and location is Montpellier.
The net salary is 2138 euros and includes social security.
A first round of selection will be organized in April for those who applied before March 16, 2012. In case some positions remain available after the first round, a second round will be organized late June. For this second round the deadline for applying is June 29, 2012. We strongly recommend the applicants to submit before the first deadline, i.e. before March 16, 2012.


The Zenith project-team of INRIA, headed by Patrick Valduriez, aims to propose new solutions related to scientific data and activities. Our research topics incorporate the management and analysis of massive and complex data, such as uncertain data, in highly distributed environments.
Our team is located in Montpellier and hosted by the LIRMM Laboratory. Montpellier is a very active town located in south of France. It gathers together major research Labs, that work on environment and health, such as INRA, CIRAD or IRD. Generally speaking, these scientific activities generate extremely large amounts of complex data that need to be managed and analyzed.


Zenith Wep Page:
Application page:…

Happy Hacking !!!


Announced Apache Hadoop release 1.0.1 stable

Announcing Hadoop release 1.0.1 Stable

Apache Hadoop 1.0.1 is a stable release

Today in the Hadoop general mailing list Matt Foley, Release Manager for the version 1.0 of the project, announced the new stable release of this version. This is the completed message:

 Hadoop Release 1.0.1 is now available. This is a bug-fix release for version 1.0. This release should now be considered STABLE, replacing the venerable 0.20.203 release!

Patches in this minor release include:

  • Added hadoop-client and hadoop-minicluster artifacts for ease of client install and testing
  • Support run-as-user in non-secure mode
  • Better compatibility with Ganglia, HBase, and Sqoop

For additional fixes, please see the release notes at
Best regards,
Release Manager for 1.0



Final Thoughts

So, you can start to test in this new version all the new features, and to report them to tha mailing lists. For instructions how to install this version, you can visit the Apache Hadoop’s wiki.

Happy Hacking !!!

The PgCon 2012 schedule is announced

PgCon 2012: One of the main PostgreSQL events of the year

PgCon 2012 is one of the main PostgreSQL events in the year, and Dan Langille announced today on the pgcon-announce’s mailing list the schedule of the event. I let to you the completed message from Dan:

The list of talks and speakers for PGCon 2012 has been released. For 2012, we once again have a strong collection of talks that will appeal to a wide range of attendees. Registration will open soon and we will send out an announcement when it does. Be sure to start making your travel plans now.

Our sponsors

For more information on sponsorship opportunities, please see:

Follow us:

Happy Hacking !!!