Thinking like a Data-Driven Guy for Dropbox


When I used Dropbox for the first time from my Linux box, it was a shining moment for me. In that time, I was looking precisely for a solution for my files that I used to let behind always in my USB memory. For every Linux user, which many of them loves Open Source software; collaboration is an important issue, and Dropbox came to save my work a lot of times, because the platform itself, is a synonym of collaboration, and this is one of the reason why I love the platform.

The other reason why I love the platform, that they use my favorite programming language for the core development of the proprietary synchronization daemon: Python, and 2012, Guido, the creator of the language was included in Dropbox’s payroll: Just awesome !!!

So, I want to do my little contribution to the platform, writing some ideas how to improve it and the business itself. I will divide this in some key points:

  • Improve blogging frequency in Tech’s blog about Data Science at Dropbox
  • Improve user engagement in Mobile devices using Localytics services
  • Hire to Greg Nudelman like consultant to improve Dropbox for Android, and work with Mailbox’s team for Android-based version
  • Build a high class Data Science team to get more useful and better insights from Dropbox massive data sets
  • Improve Marketing efforts using Inbound Marketing techniques focused on Facebook, Google Plus, LinkedIn, Twitter, Blogging, ebooks, etc

Continue reading “Thinking like a Data-Driven Guy for Dropbox”

Data Scientists: the world need us

Data Science

Some months ago, I wrote a post dedicated to new Data Scientists, giving my personal recommendation about several books that are pure gold, and great tools like Python, R, and Apache Hadoop. Right now, today is a new day for this kind of professional; yes, because, the Harvard Business Review (HBR) published a great article talking about the Data Scientist, written by Thomas H. Davenport and D.J. Patil; and I think that both did an incredible job in this writing, believe me, you should read it, you will not regreat. So, I want to dedicate these lines to the raising quantity of jobs with a shining title: “Data Scientist”. If you look today in any Job Board like Linkedin, AOL Careers , Indeed, SimplyHired, Technology Ladders or Dice, and you do a little search about this title, you will find more than 250 new open positions everyday, doing only the search in U.S. If you expand the search to more countries like UK, Germany, Ireland, India, China, Netherlands, the numbers grow like a completed madness. Continue reading “Data Scientists: the world need us”

Data Science paradigms

Hilary Mason

Don´t you know some Data Scientists? Here I let you my paradigms

There are a lot of professionals which want to become on Data Scientists (like me), but many times, they don’t know the work that current Data Scientists do on their work. I want to share with you some of the most well known Data Scientists, which love to share their knowledge with the world. Continue reading “Data Science paradigms”

My little advices for young Data Scientist

Data Science

Data Scientist is the sexy role for the next 20 years

This phrased was said by Hal Varian, Chief Financial Officer(CFO) at Google, in a interview to Mckinsey Quaterly News. Varian, who together to his team has become to Google to one of most profitable companies of the world, arriving to amazing numbers: 29,5 billions of dolars in a year.

But these numbers of the company, it would be not possible without three main roles that Varian calls: “Data Analyst”, “Statistician” and the “Data Visualization Expert”, described by the executive like the “hot and sexy jobs”. Varian says: “These professionals are and will be the key of the success of the companies in the next years, specially in these difficult times that is very hard to become to a business in a profitable piece”.So, there are many companies looking for a new professional that could combine these three skills: the “Data Scientist”. If you do a simple search in or searching “Data Scientist”, you can see the raising interest by the companies for this unique kind of professionals.

I want to be a Data Scientist, How I can prepare for the role?

This is a question that many young professionals (like me) have in their minds: “I want to be a Data Scientist, but How I can obtain the required knowledge for acomplish this?”. There are a lot of whitepapers, books, articles, blog sites; a lot of techniques, tools, etc. For that reason, when a new professional is faced to this insane quantity of resources, arrises a new question: What? There a lot of books, tools, How I can begin to do this? This is the main topic of this post, to help from my modest experience in this field to address to new professionals to select good and useful content. Ok, first, my books’s list:

All these amazing books helps me everyday, because they are writting for practitioners that use everyday theirs techniques and tools described in these texts. Remember, this is my personal list, you can build your list, adding more books or removing some. I let you a start point, you can decide how you should follow it. I recommend the order that I let you here, because the first book (Head First Data Analysis) do a amazing job explaining to you the tricky and challenging problems that can face a Data Analyst, in a concise and clean way, addressed for the outstanding way of its writing. (Note: All Head First’s books are incredible useful)

OK, I have the content, What about the tools?

I love Open Source, so, all the tools that I will recommend to you are developed and improved everyday under these principles:

  • Python: It’s a amazing language with a concise and clean syntax, easy to learn, easily extensible, with a lot of useful modules used by Scientists like Numpy and Maptplotlib
  • R: this amazing platform for statictical computing and data visualization has become on the “Lingua Franca for Statictians” today. The reasons are many.
  • Apache Hadoop and its ecosystem:The popular Open Source implementation of the MapReduce’s paradigm, based on a research paper by Google engineers in 2004. This project has become in one of the major trends today, with “Big Data” and “NoSQL”. Many companies are using today this amazing platform for large data sets processing (MapReduce) and distributed storage (HDFS) like Yahoo! for Social Graphs Analysis, Rackspace for Cross Data Center Log Processing, The New York Times for converting 4 TB of images of its archives to PDF files, VISA for Large Scale Transaction Analysis,eHarmony for Match Making, JP Morgan Chase for Data Processing for Finalcial Services and many more examples that you can find on the Hadoop World 2009site and on the last edition of 2011. There are many companies offering commercial versions of Apache Hadoop like Greenplum, the division of EMC with its Greenplum HD, MapR Technologies with its MR3 and MR5 editions, IBM with its BigInSights project, but for me, the leader in commercial support, training and even certifications is Cloudera, the company founded in 2009 by Amr Awadallah former, VP of Engineering for Data Systems at Yahoo!, (now is the current Cloudera CTO), Jeff Hammerbacher, former Data Scientists Team Manager at Facebook (Vice President of Products and Chief Scientist at Cloudera), Christophe Bisciglia and Mike Olson (currently the CEO of the company), former the CEO of Sleepycat, makers of BerkeleyDB, the open source embedded database engine, and then spent two years at Oracle acting like the Vice President for Embedded Technologies after Oracle’s adquisition of Sleepycat in 2006.

Final Thoughts

The rise of the Data Scientist began with Jeff, when he lead and created the Data Team at Facebook. And now in these days, every company, organization or whatever, are looking for this unique kind of professionals to do three key things, like Michael E. Driscoll (Co-founder and CEO of Metamarkets) said in his “Open Source Analytics Visualization and Predictive Modeling of Big Data with R”, in the OSCON 2009:

“We need professionals that they can able to munge, model and visualize data”

. So, It’s a great moment to develop these skills, and in that way, to be able to work in challenging problems that could solve a lot of headaches to your current or future CEO. For that reason , I let you to decide how to use this information, and if you have any comment, please, just send me an email.

Happy Hacking !!!

Running MongoDB 2.0.1 in Fedora 15 from sources

MongoDB: the document-oriented database

MongoDB is one of the major trends today on the NoSQL world; with a great list of custormers like Foursquare, Boxed Ice, Etsy, SourceForge, Justin.Tv, GitHub, Disqus, and many more. The completed list can be found on the MongoDB official site.

For that reason, on May, 2011, we begun a excellent project that we called Naire (elephant dominator in Indian), focused on the building of a web platform for PostgreSQL servers monitoring, and we opted for MongoDB for storing all metrics and logs, and we saw that it was a great choice.

The completed stack is: Python/Django, RabbitMQ and Celery for background tasks processing, Nginx like Http Server with uWSGI, and Varnish-Cache for HTTP request caching and HTTP accelerator.

But, in that time, we didn’t have the amazing RPM packages built by the 10gen’s software engineers (the company behind of the commercial support for MongoDB), so, we had to compile MongoDB for Fedora from sources, and that’s the objective today with this post.

You have noticed that I´m a big fan of Fedora Linux, and of course, my laptop (A modest Acer Aspire 5251-1805) runs that operating system. It´s a amazing platform to develop Open Source applications, because there are a lot of packages and suites available for it. Pick one, and you will find it on the official repositories, or in the non-official but very useful too:

  • PostgreSQL (my favorite relational database system)
  • The amazing Python programming language, one of the key reasons why I use Fedora
  • Django, the well known web framework for perfectionists
  • Ruby on Rails, one of the most used web frameworks today
  • PHP, on the most used programming languages for web development today
  • Java, the platform by excellence for Enterprise Software Development
  • A lot of projects of the Apache Software Foundation including Http Server, Hadoop and its related projects
  • Nginx, one of the most well known HTTP servers today for its amazing performance

All these packages are available on Fedora: on the official repositories or 3rd party repositories. Remi, FreshRPM, are some of the 3rd party repositories more used.

MongoDB’s RPM packages are built now by 10gen engineers, so you can search all this information on the official site, but, like I said, we are going to install MongoDB from sources.

Step 1: Create the necessary directories for MongoDB

In Fedora and Red Hat Enterprise Linux, which is the main sponsor of the project; there are some conventions for Database applications and servers; so we want to adjust to these rules:

  • The data directory should be on the /var/lib/service_name/data
  • The binary directory should be on the /usr/local/service_name/bin
  • Every service should have its own user

So, we should create the required group and user for MongoDB, and its related directories on these directories:

$ mkdir -p /usr/local/mongodb/bin
$ mkdir -p /var/lib/mongodb/data
$ mkdir -p /var/lib/mongodb/logs
$ groupadd mongodb
$ useradd mongodb -d /var/lib/mongodb -c "MongoDB Administration User" -g mongodb -s /bin/bash
$ chown -R mongodb:mongodb /usr/local/mongodb/bin
$ chown -R mongodb:mongodb /var/lib/mongodb/data
$ chown -R mongodb:mongodb /var/lib/mongodb/logs

Step 2: Download MongoDB for your architecture

Then, we are going to download MongoDB from its official site. In my case, I use Fedora 15 for 64 bits (x86_64), so, I’m going to download the mongodb-linux-x86_64-2.0.1.tgz version. Then, you should decompress the .tgz file, and move the bin/ content to /usr/local/mongodb/bin, and check the permissions for those binary files:

$ tar xvf mongodb-linux-x86_64-2.0.1.tgz -C /tmp
$ cd /tmp/mongodb-linux-x86_64-2.0.1
$ mv bin/* /usr/local/mongodb/bin
$ chmod -R 700 /usr/local/mongodb/bin
$ chmod -R 700 /var/lib/mongodb/data
$ chmod -R 700 /var/lib/mongodb/logs

Step 3: Run MongoDB

The last step is to run MongoDB using the data directory and the logs directory:

$ PATH=$PATH:/usr/local/mongodb/bin
$ export PATH
$ su mongodb
$ mongod --dbpath /var/lib/mongodb/data/ --logpath /var/lib/mongodb/logs/mongodb-2.0.1.log
all output going to: /var/lib/mongodb/data/mongodb-2.0.1.log

Step 4: Connect to MongoDB with shell

Then you can run the shell command (mongo) to connect to the database:

$ mongo --host -port 27017

The output in my machine is:

MongoDB shell version: 2.0.1
connecting to:

And a short piece of the log’s output here in my laptop is this:

Mon Feb 13 15:55:46 [initandlisten] MongoDB starting : pid=7016 port=27017 dbpath=/var/lib/mongodb/data 64-bit host=unknown
Mon Feb 13 15:55:46 [initandlisten] db version v2.0.1, pdfile version 4.5
Mon Feb 13 15:55:46 [initandlisten] git version: 3a5cf0e2134a830d38d2d1aae7e88cac31bdd684
Mon Feb 13 15:55:46 [initandlisten] build info: Linux #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOS
Mon Feb 13 15:55:46 [initandlisten] options: { dbpath: "/var/lib/mongodb/data/", logpath: "/var/lib/mongodb/logs/mongodb-2.0.1.log" }
Mon Feb 13 15:55:46 [initandlisten] journal dir=/var/lib/mongodb/data/journal
Mon Feb 13 15:55:46 [initandlisten] recover : no journal files present, no recovery needed
Mon Feb 13 15:55:46 [initandlisten] waiting for connections on port 27017
Mon Feb 13 15:55:46 [websvr] admin web console waiting for connections on port 28017
Mon Feb 13 15:56:46 [clientcursormon] mem (MB) res:17 virt:612 mapped:0
Mon Feb 13 15:57:47 [initandlisten] connection accepted from #1
Mon Feb 13 15:58:00 [conn1] end connection
Mon Feb 13 15:58:49 [initandlisten] connection accepted from #2
Mon Feb 13 16:01:46 [clientcursormon] mem (MB) res:17 virt:613 mapped:0

Final Thoughts

Well, with these steps, you can use MongoDB for development process in your personal workstation, but if you don’t want to complicate you, use your Package Manager in your distribution; but if you can actually learn how MongoDB works; this is a useful way to find it. I let you a challenge for you: Write a init script compatible with chkconfig in Fedora for MongoDB. This will be my next blog post, but I want to let you to think and find a good solution for it.

Happy Hacking !!!


Installing Python scikit-learn package for Machine Learning in Fedora 15

Yesterday, I was looking for some examples for Machine Learning algorithms in Python; and I found a amazing package called scikit-learn, which is described on its own site:

” scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib). It aims to provide simple and efficient solutions to learning problems that are accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering.”

But, I found a minor issue: all guides for the installation, are for Debian/Ubuntu, MacPorts, NetBSD, but not for my Linux distribution: Fedora. So, I said: ” OK, let me try to create a simple How-To to installing this package on Fedora”.

Step 1: Dependencies

First, we need to install all dependencies for it:

# yum update && yum install scipy.x86_64 numpy.x86_64 python-devel.x86_64 python-matplotlib.x86_64 python-pip.noarch gcc-c++.x86_64

Step 2: Installing scikit-learn using pip

Now, I will use the fastest way to install the package using pip:

# pip-python install scikit-learn

This command should give a output very similar to this:

# warning: no files found matching '' # warning: no files found matching '*.TXT' under directory 'sklearn/datasets' # Installing /usr/lib64/python2.7/site-packages/scikit_learn-0.9-py2.7- # Successfully installed scikit-learn Cleaning up...

Step 3: Runing the examples

The final step is to run several examples provided here

Happy Hacking !!!