Like the title says, to choose an enterprise-level Massive Parallel Processing (MPP) database is actually a big headache for every Data Science Manager; basically because there are very good choices around the tech world.
But, I will give my top reasons to choose a good platform of this kind.
Fast Query processing
I think that I don’t have to explain very much here, because you should know that this feature is critical for every Data-Driven business to answer bigger questions to be able to take action more quickly. If you have a platform where you can query huge data sets in matters of seconds or minutes, this is a huge advantage over your competitors. So, I think like a Product Manager, focused in Big Data Analytics, this is critical for my company.
Integration with Apache Hadoop
Apache Hadoop has become in the de-facto Analytics platform for Big Data processing, so, for a new business interested in Big Data, you have to build an integrated platform where Hadoop could play a critical role, and if you have a database which can communicate easily with the yellow elephant; you will be able to adapt to changes in the future more quickly, of course in terms of Business Analytics.
It doesn’t matter if you are a big company or a startup; the scalability that offer Cloud Computing platforms is massive, so, you should see this like a great advantage. Many companies are using for its core services platforms like Amazon Web Services, Rackspace Cloud, HP Cloud Services, etc; so, to choose a good database platform, you have to make sure that it runs very well in Cloud environments.
Fast Data compression
Many Data engineers don’t see the importance behind data compression until they have unmanageable Data clusters with a lot of space, draining company budget only to support its massive Data infrastructure. For me, this is a key feature, because, you can save a lot of space with smart and fast compression algorithms, and of course, have a better ROI for your Data infrastructure. Data grows exponentially by seconds in these days, and having a good platform that cares for you transparently for data compression is very good in terms of performance, so, I think that every good data store, not just MPP-based databases, have to include compression like one of its strong features.
Advanced Built-in Analytics functions
Why to use external applications for logical regression studies, collaborating filtering and all kind of Data Mining and statistic modeling research? I think that a better way to do all this is using built-in functions inside the database because where the data is.
So, I’m always look for this when I’m evaluating a new data store.
HP Vertica Analytics Platform 6.1 “Bulldozer”
I’ve talked a lot about Vertica because I love this product, and the last release of the platform kept the same feeling in my head: This is a truly Advanced Database. But, believe me, I’m not the only one; companies like Twitter, Zynga, Convertro, are big users of the platform.
Greenplum Database 4.0
But Vertica’s team is not the only one innovating in this space. The amazing engineering team at Greenplum have added some outstanding features to its new release which are very useful and synonym of the hard work of this team:
- High Performance gNet™ for Hadoop
- Out-of-the-Box Support for Big Data Analytics
- Multi-level Partitioning with Dynamic Partitioning Elimination
- Polymorphic Data Storage-MultiStorage/SSD Support
- Fast Query processing with a new loading technology called Scatter/Gather Streaming, allowing automatic parallelization of data loading and queries
- Analytics and Language Support , proving methods for advanced analytic functions like t-statistics, p-values, and Naïve Bayes inside the database, and besides have a great integration with R
- Dynamic Query Prioritization
- and many more amazing features
You can see all features in its official site here. If you want to see it in action, just this amazing whitepaper explaining how Greemplum Database is being used for Advanced Cyber Analytics for U.S government agencies.
Teradata Aster Data Database 5.0
This is another great team which is doing very well in this field, with its amazing platform combining highly technical research too in a single product. Some of these features are:
- A patent-pending SQL-MapReduce framework who allows to combine the power of MapReduce with SQL
- Hybrid row/column storage depending of your needs
- Two great things called “Always-On” and “Always-Parallel” who allow to use parallelism for data and analytics processing; and provide world-class fault tolerance
- A great group of ready-to-use analytic functions for rapid analytic platform development called Aster MapReduce Analytics Portfolio
- Rich monitoring and easy management of data and analytic processing with the intuitive Aster Management Console
- Great integration with several languages like Java, C, C#, Python, C++, and R
- Dynamic mixed workload management ensures scalable performance even with large numbers of concurrent users and workloads
You can download its Datasheet from here.
ParAccel Analytic Platform 4.0
This is another team which is doing a terrific job building this outstanding analytic platform.
Some of its features:
- On-Demand integration with Hadoop
- Columnar-Oriented storage scheme
- A power Extensibility framework
- Advanced Query Optimization, allowing to perform better complex operations like JOINs, sorting and query planing
- Advanced communication protocol for the whole interconnection around the cluster to improve the process of data loading, backup and recovery and parallel query execution
- Advanced I/O optimization which allows to scan performance improving, using high performance algorithms to predict which data blocks will be needed for future operations.
- Adaptive compression encoding depending of the involved data type
A great number of analytic functions ready to use for a lot of techniques like pattern matching, time series analysis, advertising attribution and optimization, sophisticated fraud detection and event analysis and a lot of statistic methods like Univariate, Multivariate, Data Mining, Mathematical, Corporate Finance, Options/Derivatives, Portfolio Management, Fixed Income and many more. You can read more about all this on its Datasheet here.
There is a new player right now from AWS called Amazon Redshift, which is based in a version of ParAcel who seems to become in the future platform for Data Warehousing applications for its great features and a very low price for 2 TB per year. You can read the great post published by Dr. Werner Volgels, Amazon CTO, or you can see a great introduction to the service in a great talk given by Rahul Pathak (@rahulpathak), Sr. Product Manager for Amazon Redshift, in the past AWS Summit in New York:
Well, my good fellows, you have great choices, you just have to analyze which is the best for your needs. Stay tuned for the next blog post and thanks a lot for reading.