Historically, Online Transaction Processing (OLTP) was performed by customers submitting traditional transactions (order something, withdraw money, cash a check, etc.) to a relational DBMS. Large enterprises might have dozens to hundreds of these systems. Invariably, enterprises wanted to consolidate the information in these OLTP systems for business analysis, cross selling, or some other purpose. Hence, Extract-Transform-and-Load (ETL) products were used to convert OLTP data to a common format and load it into a data warehouse. Data warehouse activity rarely shared machine resources with OLTP because of lock contention in the DBMS and because business intelligence (BI) queries were so resource-heavy that they got in the way of timely responses to transactions.
This combination of a collection of OLTP systems, connected to ETL, and connected to one or more data warehouses is the gold standard in enterprise computing. I will term it “Old OLTP.” By and large, this activity was supported by the traditional RDBMS vendors. In the past I have affectionately called them “the elephants”; in this posting I refer to them as “Old SQL.”
As noted by most pundits, “the Web changes everything,” and I have noticed a very different collection of OLTP requirements that are emerging for Web properties, which I will term “New OLTP.” These sites seem to be driven by two customer requirements:
The need for far more OLTP throughput. Consider new Web-based applications such as multi-player games, social networking sites, and online gambling networks. The aggregate number of interactions per second is skyrocketing for the successful Web properties in this category. In addition, the explosive growth of smartphones has created a market for applications that use the phone as a geographic sensor and provide location-based services. Again, successful applications are seeing explosive growth in transaction requirements. Hence, the Web and smartphones are driving the volume of interactions with a DBMS through the roof, and New OLTP developers need vastly better DBMS performance and enhanced scalability.
The need for real-time analytics. Intermixed with a tidal wave of updates is the need for a query capability. For example, a Web property wants to know the number of current users playing its game, or a smartphone user wants to know “What is around me?” These are not the typical BI requests to consolidated data, but rather real-time inquiries to current data. Hence, New OLTP requires a real-time query capability.
In my opinion, these two characteristics are shared by quite a number of enterprise non-Web applications. For example, electronic trading firms often trade securities in several locations around the world. The enterprise wants to keep track of the global position (short or long and by how much) for each security. To do so, all trading actions must be recorded, creating a fire hose of updates. Furthermore, there are occasional real-time queries. Some of these are triggered by risk exposure—i.e., alert the CEO if the aggregate risk for or against a particular security exceeds a certain monetary threshold. Others come from humans, e.g., “What is the current position of the firm with respect to security X?”
Hence, we expect New OLTP to be a substantial application area, driven by Web applications as the early adopters but followed by enterprise systems. Let’s look at the deployment options.
1) Traditional OLTP. This architecture is not ideal for New OLTP for two reasons. First, the OLTP workload experienced by New OLTP may exceed the capabilities of Old SQL solutions. In addition, data warehouses are typically stale by tens of minutes to hours. Hence, this technology is incapable of providing real-time analytics.
2) NoSQL. There have been a variety of startups in the past few years that call themselves NoSQL vendors. Most claim extreme scalability and high performance, achieved through relaxing or eliminating transaction support and moving back to a low-level DBMS interface, thereby eliminating SQL.
In my opinion, these vendors have a couple of issues when presented with New OLTP. First, most New OLTP applications want real ACID. Replacing real ACID with either no ACID or “ACID lite” just pushes consistency problems into the applications where they are far harder to solve. Second, the absence of SQL makes queries a lot of work. In summary, NoSQL will translate into “lots of work for the application”—i.e., this will be the full employment act for programmers for the indefinite future.
3) New SQL. Systems are starting to appear that preserve SQL and offer high performance and scalability, while preserving the traditional ACID notion for transactions. To distinguish these solutions from the traditional vendors, we term this class of systems “New SQL.” Such systems should be equally capable of high throughput as the NoSQL solutions, without the need for application-level consistency code. Moreover, they preserve the high-level language query capabilities of SQL. Such systems include Clustrix, NimbusDB, and VoltDB. (Disclosure: I am a founder of VoltDB.)
Hence, New SQL should be considered as an alternative to NoSQL or Old SQL for New OLTP applications. If New OLTP is as big a market as I foresee, I expect we will see many more New SQL engines employing a variety of architectures in the near future.
Disclosure: Michael Stonebraker is associated with four startups that are either producers or consumers of database technology.
RavenDB - a document database of the NoSql genre, has great ETL, OLTP support, and is backed by ACID storage. See http://t.co/fVxvmSV (and it evolved a lot since then).
Was hoping to know the characteristicsmof New SQL, Actuallynnot convinced with your arguments about noSQL databases as you have no argures presented. Problems with old SQL are known, promises of noSQL are also known, what is new SQL?
In the context of transaction processing, I would define a NewSQL DBMS as one having the following 5 characteristics:
1) SQL as the primary mechanism for application interaction
2) ACID support for transactions
3) A non-locking concurrency control mechanism so real-time reads will not conflict with writes, and thereby cause
them to stall.
4) An architecture providing much higher per-node performance than available from the traditional "elephants"
5) A scale-out, shared-nothing architecture, capable of running on a large number of nodes without bottlenecking
About point 3,
Isn't this achieved in most traditional databases by MVCC.
About point 4,
Most databases (Clustrix,Akiban,NimbusDB) which are NewSQL candidates talks only about better query performance using distributed query or kind of object storage. I am not sure if they have anything in better for DML performance.
VoltDB is an exception;
I am not sure if they are much better than TerraData,GreenPlum which are based on OLD RDBMS architecture.
About Point 5,
Yes, this is a new feature; if I understand correctly, it means scaling the performance by adding new nodes without interrupting existing users;
So does any traditional RDBMS implements following 2 points is candidate for NewSQL:
1. Better Query or DML Performance by distributing load.
2. Scale-out, shared nothing architecture.
My previous comment suggested 5 criteria that defined a NewSQL DBMS. I would like to stress three points that I made previously. First, my posting focused on DBMSs for new OLTP applications. Data warehouse vendors, such as Teradata and Greenplum, are focused on a completely different market, and are not designed to perform high velocity transactions. Hence, they are not considered as NewSQL vendors.
Second, most of the OldSQL vendors use standard two phase locking, although there are exceptions. Hence, there are OldSQL engines which satisfy some of my 5 criteria.
Third, one of the big problems with OldSQL engines is their mediocre per-node performance. One of my criteria for NewSQL vendors is much better per-node performance. The proof of this is via performance on standard benchmarks. Hence, whether any particular vendor satisfies this criteria would have to be determined experimentally. As such, the list of vendors who satisfy the 5 criteria may well change over time.
Incase any database has scale-out architecture, why it is most necessary that per-node performance also should be very high; anyway if performance is getting better by adding more nodes it will be achieved.
My main focus is to understand why can't some existing database like PostgreSQL be considered an option similar to NewSQL, if we enhance it to support scale-out architecture in it. Saying this doesn't mean I have a clear idea about how to achieve it.
It will have benefit to existing customers as well even though the performance is somewhat less than NewSQL engines. It can save lot of effort to change applications to suit to new engines.
In round numbers,NewSQL engines are a factor of 50 or more faster on New OLTP than OldSQL. That means that and OldSQL engine would require 500 nodes to do a task that can be accomplished by a NewSQL engine with 10 nodes.
The downside of 500 nodes is:
increased hardware cost
increased system administration complexity
increased data base administration complexity
high availability complexity (if a node fails once a month, then NewSQL fails every 3rd day, while OldSQL
fails once an hour.)
less flexibility (if load doubles, have to add 500 more nodes, rather than 10)
Name: Marius from Scimore
2PC doesn't scale, also in-doubt states almost impossible to handle correct, since, at any time the coordinator could die and no way to enlist the in-doubt transactions. For example, instead of 2PC, ScimoreDB from www.scimore.com uses dynamic 2PC, where nested sub-coordinators coordinate transactions in parallel, and, failures of sub-coordinators are tolerated. When combining D2PC with the tree execution, massive execution parallelization can be achieved, handling 100's of nodes with minimum penalty to compare with a master to nodes query distribution.
I am interested, if ScimoreDB can be considered as NewSQL, we still based on sub-set of 2PC, since, we believe, quorum based algorithms impose performance penalties on read transactions.
On July 5th you write, "The proof of this is via performance on standard benchmarks."
Would you point the curious to these results.
All that you say sounds good and theoretical. Some real numbers would make them something more than intellectual entertainment.
I admire your work and what you've achieved as well as admit that I've greatly benefited from the insights you and others (incl. Rick Cattell in ) have shared with the community. However, when I read the blog post at hand I have to ask myself: do we really need to come up with new marketing terms every year (hint: NewSQL) or wouldn't our time be better invested in trying to understand what the fundamental issues are and how to overcome them. For example, the MR vs. parallel DB discussion (is it still alive?) or, maybe, someone can help me finding answers to my question ?
KUTGW and hope to hear from you!