Parallel DBMSs excel at efficient querying of large data sets; MapReduce-style systems excel at complex analytics and ETL tasks. Neither is good at what the other does well. Hence, the two technologies are complementary.
The finding that Vertica is faster than Hadoop or DBMS-X would be more credible if the article's author were not CTO and co-founder of Vertica, a fact nowhere mentioned in the article or author listing.
It is interesting to see both of the articles come up. Anyway, the Google MapReduce is not as the same as Hadoop, which leaves us a mysterious comparison. Also I think the invention of MapReduce itself is not for research but for solving their own problems, not elegant in an academic way.
This is an improvement over Stonebraker's other writings related to MapReduce and NoSQL, but still a very slanted view. The authors pit Hadoop, a specific (and imperfect) implementation of MapReduce against the idealized conception of parallel DBMS's. Even their tests are slanted to show Vertica in a good light (an important fact to consider is Stonebraker's vested interest in Vertica coming out ahead). The article by Dean and Ghemawat nicely illustrates the fallacies in the comparison paper and show just where Stonebraker went wrong (again).
The following letter was published in the Letters to the Editor in the April 2010 CACM (http://cacm.acm.org/magazines/2010/4/81506).
I applaud the debate on MapReduce between "MapReduce and Parallel DBMSs: Friends or Foes?" by Michael Stonebraker et al. and "MapReduce: A Flexible Data Processing Tool" by Jeffrey Dean and Sanjay Ghemawat (Jan. 2010). But I strongly object to the former's criticism of the MapReduce designers, saying "Engineers should stand on the shoulders of those who went before, rather than on their toes." Creating an alternate method is not stepping on anyone's toes. Such accusations, besides being unjust, impede science.
As we noted in the article, the Map phase of a MapReduce computation is essentially a filter and a group-by operation in SQL, while the Reduce phase is largely a target-list computation in SQL. When user-defined functions are included in SQL (as they are in many commercial implementations), the functionality provided by parallel SQL DBMSs and MapReduce implementations appears to be the same.
The parallel DBMS literature, dating from the 1980s, includes hundreds of articles on implementation tactics. Our comment about "standing on the shoulders..." was meant to suggest that any new implementation effort should carefully review the prior literature to learn what past results are available, then add to the store of total knowledge.
The MapReduce team seemed not to have done this exercise. Hence the comment.
Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin
Displaying all 4 comments