I know quite a few scientists who deal with the processing and storage of large amounts of data. All are unhappy with relational DBMS. There are three common reasons that they cite:
Tables are not a good data model for their application. The most common requirement among earth scientists, oceanographers, high energy physicists, and astronomers is support for large arrays. It has been shown repeatedly that simulating arrays on top of tables is an unnatural act and gives very bad performance. Biologists and chemists seem equally unhappy with tables, although they want something other than arrays.
The operators in relational DBMSs do not meet scientific needs. For example a scientist with remote sensing data often wants to regrid that data to match the co-ordinate system of some other data set. Regriding array data using SQL operations is nearly impossible. Hence, earth scientists want science-specific operations, such as regrid, as a primitive operation.
Required features are not supported in current commercial DBMSs. For example, all scientific data has uncertainty; however, RDBMSs assume data is precise (as is typically true in business data processing). A scientist must deal with uncertainty in application logic, and the result is not pretty. An extra field with the imprecision must be added, and then all operations (such as filtering) must be recoded to take such uncertainty into account. In addition, scientists never want to update data in place, thereby losing the old value. If a data value is bad, they want to add the new value to the DBMS, preserving the old value. In this way, they can see historical states of the data and correction procedures. Such a “no overwrite” strategy is also universal in accounting systems who perform double entry bookkeeping. However, current RDBMSs update data in place, and scientists must deal with history in painful application logic. Equally troubling in current RDBMSs are the lack of support for data lineage (how was the data derived) and named versions (so a scientist can make local changes to a data set without affecting other users).
The net result of these problems is that scientists either don’t use commercial DBMSs or use them grudgingly.
Unfortunately, science applications are not currently a billion dollar market. As such, science requirements have been largely ignored by the major commercial vendors. Moreover, there is no evidence that this state of affairs will change anytime soon. This leaves science users out in the cold, and often they must resort to “rolling their own” on top of the “bare metal.”
Personally, I believe that there are a collection of planet-threatening problems, such as climate change and ozone depletion, that only scientists are in a position to solve. Hence, the sorry state of DBMS support in particular (and system software support in general) for this class of users is very troubling.
Science users, of course, want a commercial quality DBMS, i.e., one that is reliable, scalable and comes with good documentation and support. They also want something that is open source. There is no hope that such a software system can be built in a research lab or university. Such institutions are good at prototypes, but not production software. Hence, the obvious solution is a nonprofit foundation, along the lines of Apache or Mozilla, whose charter would be to build such a DBMS. It could not be financed by venture capital, because of market size issues. As such support must come from governments and foundations.
It is high time that the United States got behind such an initiative.
Mike, why do you suppose no one has organized a major open source initiative around this idea? What is stopping the creation of a sci-mysql?
While I agree that RDBMS is not an optimal technology for scientific applications and that an open source initiative may lead to some good innovation, I'd be cautious in separating the data model from the query and management language.
There are proprietary tools (such as kx.com) that have done so successfully. The speed and capacity of such tools is phenomenal (as are the licensing fees one must pay).
I don't understand the problem with table for array. All I can think is that table with schema is unnecessarily "complicated" for array.
I agree traditional RA operators are too limited. But it is hard to unify general high-level primitive for science operations like the RA operators.
One approach is to build machine learning algorithms above database. Since these operations are usually computation intensive, and are used to handle uncertainty, performance metrics are not only I/O any more. However, DBMS is for data storage and management, analysis functions should be left for the upper level.
This statement might surprise some people but relational databases do require relational approach to stored data. I'm all burned out after recent few months struggle of trying to squeeze GIS data into relational database, so I do not think I have any religious flair related to the issue, but before we even think of implementing anything in RDBMS we need to consider learning/trying to understand such relational paradigms as: relationship between data entities (known in RDBMS circles as analyzing Universe of Discourse), normalization process, transitive dependencies (!), 3-Normal Form and 5-th Normal Form. Yes, RDBMS usage carries a learning overhead and the curve is steep. I'm not sure if is possible to design a software tool that would free us from initial research, which in this case are intricacies relational world, but we may try :-)
Having worked in both commercial and scientific spheres, and as current curator for some 22 terabytes of climate science data, I would no more try to use an RDBMS to store vast amounts of data than try to jump over the moon. However, in commercial spheres, an RDBMS is just a tool; that is, an enormous amount of software development is required to produce customised applications which use the core facilities of an RDBMS to gain access to the required data, but it is still the application, per se, which provides the grunt work.
In relation to scientific data, an RDBMS can be used to hold the metadata, plus pointers to the actual data; where by "pointers", I mean paths to files, or URLs, etc. But you still need to customise the application to suit the project requirements, but that's life!
Displaying all 5 comments