Choosing a MPP database is incredibly hard

Like the title says, to choose an enterprise-level Massive Parallel Processing (MPP) database is actually a big headache for every Data Science Manager; basically because there are very good choices around the tech world.

But, I will give my top reasons to choose a good platform of this kind.

Fast Query processing

I think that I don’t have to explain very much here, because you should know that this feature is critical for every Data-Driven business to answer bigger questions to be able to take action more quickly. If you have a platform where you can query huge data sets in matters of seconds or minutes, this is a huge advantage over your competitors. So, I think like a Product Manager, focused in Big Data Analytics, this is critical for my company.

Integration with Apache Hadoop

Apache Hadoop has become in the de-facto Analytics platform for Big Data processing, so, for a new business interested in Big Data, you have to build an integrated platform where Hadoop could play a critical role, and if you have a database which can communicate easily with the yellow elephant; you will be able to adapt to changes in the future more quickly, of course in terms of Business Analytics.

Cloud-ready solution

It doesn’t matter if you are a big company or a startup; the scalability that offer Cloud Computing platforms is massive, so, you should see this like a great advantage. Many companies are using for its core services platforms like Amazon Web Services, Rackspace Cloud, HP Cloud Services, etc; so, to choose a good database platform, you have to make sure that it runs very well in Cloud environments.

Fast Data compression

Many Data engineers don’t see the importance behind data compression until they have unmanageable Data clusters with a lot of space, draining company budget only to support its massive Data infrastructure. For me, this is a key feature, because, you can save a lot of space with smart and fast compression algorithms, and of course, have a better ROI for your Data infrastructure. Data grows exponentially by seconds in these days, and having a good platform that cares for you transparently for data compression is very good in terms of performance, so, I think that every good data store, not just MPP-based databases, have to include compression like one of its strong features.

Advanced Built-in Analytics functions

Why to use external applications for logical regression studies, collaborating filtering and all kind of Data Mining and statistic modeling research? I think that a better way to do all this is using built-in functions inside the database because where the data is.

So, I’m always look for this when I’m evaluating a new data store.

The winners

HP Vertica Analytics Platform 6.1 “Bulldozer”

I’ve talked a lot about Vertica because I love this product, and the last release of the platform kept the same feeling in my head: This is a truly Advanced Database. But, believe me, I’m not the only one; companies like Twitter, Zynga, Convertro, are big users of the platform.

If you want to know more about the Bulldozer release, just see the webinar dedicated to this topic here or you ca download its Datasheet from here.

Greenplum Database 4.0

But Vertica’s team is not the only one innovating in this space. The amazing engineering team at Greenplum have added some outstanding features to its new release which are very useful and synonym of the hard work of this team:

  • High Performance gNet™ for Hadoop
  • Out-of-the-Box Support for Big Data Analytics
  • Multi-level Partitioning with Dynamic Partitioning Elimination
  • Polymorphic Data Storage-MultiStorage/SSD Support
  • Fast Query processing with a new loading technology called Scatter/Gather Streaming, allowing automatic parallelization of data loading and queries
  • Analytics and Language Support
, proving methods for advanced analytic functions like t-statistics, p-values, and Naïve Bayes inside the database, and besides have a great integration with R
  • Dynamic Query Prioritization
  • and many more amazing features

You can see all features in its official site here. If you want to see it in action, just this amazing whitepaper explaining how Greemplum Database is being used for Advanced Cyber Analytics for U.S government agencies.

Teradata Aster Data Database 5.0

This is another great team which is doing very well in this field, with its amazing platform combining highly technical research too in a single product. Some of these features are:

  • A patent-pending SQL-MapReduce framework who allows to combine the power of MapReduce with SQL
  • Hybrid row/column storage depending of your needs
  • Two great things called “Always-On” and “Always-Parallel” who allow to use parallelism for data and analytics processing; and provide world-class fault tolerance
  • A great group of ready-to-use analytic functions for rapid analytic platform development called Aster MapReduce Analytics Portfolio
  • Rich monitoring and easy management of data and analytic processing with the intuitive Aster Management Console
  • Great integration with several languages like Java, C, C#, Python, C++, and R
  • Dynamic mixed workload management ensures scalable performance even with large numbers of concurrent users and workloads

You can download its Datasheet from here.

ParAccel Analytic Platform 4.0

This is another team which is doing a terrific job building this outstanding analytic platform.

Some of its features:

  • On-Demand integration with Hadoop
  • Columnar-Oriented storage scheme
  • A power Extensibility framework
  • Advanced Query Optimization, allowing to perform better complex operations like JOINs, sorting and query planing
  • Advanced communication protocol for the whole interconnection around the cluster to improve the process of data loading, backup and recovery and parallel query execution
  • Advanced I/O optimization which allows to scan performance improving, using high performance algorithms to predict which data blocks will be needed for future operations.
  • Adaptive compression encoding depending of the involved data type
  • A great number of analytic functions ready to use for a lot of techniques like pattern matching, time series analysis, advertising attribution and optimization, sophisticated fraud detection and event analysis and a lot of statistic methods like Univariate, Multivariate, Data Mining, Mathematical, Corporate Finance, Options/Derivatives, Portfolio Management, Fixed Income and many more. You can read more about all this on its Datasheet here.

There is a new player right now from AWS called Amazon Redshift, which is based in a version of ParAcel who seems to become in the future platform for Data Warehousing applications for its great features and a very low price for 2 TB per year. You can read the great post published by Dr. Werner Volgels, Amazon CTO, or you can see a great introduction to the service in a great talk given by Rahul Pathak (@rahulpathak), Sr. Product Manager for Amazon Redshift, in the past AWS Summit in New York:

Conclusions

Well, my good fellows, you have great choices, you just have to analyze which is the best for your needs. Stay tuned for the next blog post and thanks a lot for reading.

Advertisements

16 thoughts on “Choosing a MPP database is incredibly hard

  1. Pingback: Quora
  2. I came to your “Choosing a MPP database is incredibly hard | Diary of a Data-Driven Product Manager” page and noticed you could have a lot more traffic. I have found that the key to running a website is making sure the visitors you are getting are interested in your subject matter. There is a company that you can get traffic from and they let you try it for free. I managed to get over 300 targetted visitors to day to my website. Visit them today: http://nsru.net/fdse

  3. You actually make it seem so easy with your presentation but I find this topic to be really something that I think I would never understand. It seems too complex and extremely broad for me. I am looking forward for your next post, I will try to get the hang of it!

  4. Another columnar database that should be on this list is Infobright. We compared it against Redshift and Vertica. Each solution has its pluses and minuses, but when it came to ad hoc query performance, Infobright was the winner.

    1. That´s true, kishore2321. Infobright is a great solution. It would be nice to see the benchmarks that you did between HP Vertica and Infobright because query performance is one of the key advantages of HP Vertica. Both are direct competitors because they have similar architecture, similar features, but I think that HP Vertica is better positioned in the Big Data market with a stronger customers base, and with the investment from HP; Vertica´s growth in terms of revenue have exploded. Did you see the last 6.1 release of the platform codename “Bulldozer” ?

      In the last HP Vertica Big Data Conference 2013, a lot of customers exposed its use-cases of the platform and there a lot of amazing things to see how HP Vertica is providing value based on Data-Driven decisions.

      For example, see these posts:
      http://www.vertica.com/2013/08/22/doing-everything-with-big-data/
      http://www.vertica.com/2013/07/24/hp-vertica-boosts-performance-for-infinity-insurance/

  5. Can you comment on the price points? In this regard does anything come close to Redshift or is it by far the cheapest cloud ready MPP system available right now?

    1. Regards, Patrick and thanks a lot for the comment. About the price of these solutions, unfortunately, all prices are not public, so Amazon Redshift is a very good choice for a Cloud-based solution. But, I will try to get a deep research on this topic and I will answer you a more detailed answer about this topic.

  6. I’ve been browsing on-line greater than 3 hours these
    days, but I by no means found any fascinating article like yours.
    It is lovely value sufficient for me. Personally, if
    all webmasters and bloggers made good content material as you
    probably did, the net can be a lot more helpful than ever before.

  7. Redshift isn’t based on PADB, its based on PostgreSQL.
    Don’t believe to WIKI, read it from Amazon. Amazon just bought some component from PADB that allows PETA scaling.

    PS
    As Vertica integrator I have to say that I didn’t loosed any POC because of performance, but on price.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s