Database vs Computation Framework (ever asked the question Druid vs Spark SQL?)

HBase, Cassandra, Druid, Hive, Spark SQL……. Are all these similar to each other? What’s the conceptual difference between their designs?

With the introduction of Apache Hive long time back, the boundary between database system and computation system are becoming more blurred than ever. Nowadays, with system like Spark SQL, you can achieve pretty high speed data retrieval with very complex queries on a large dataset. Now we may ask the question, is Spark SQL a database? Can it replace my HBase cluster?

Before answering this question, let’s first review what the problem a database system is trying to solve. A database is a system to help organizing the data. Now it is pretty obvious that the purpose of organizing the data is for easy management (for processing or for retrieval). Thus in the past, having a database (in the traditional sense) is almost a must have for companies that wants to deal with large amount of data.

Database works by organizing the data and at the same time providing interface for user to insert or query the data. Organizing the data is for querying optimizations. With the introduction of MapReduce (and to be more precise, the open source version of MapReduce — Hadoop), it actually opens up a whole new way of accessing the data. Leading the way was Hive, which provided us ways to access not so organized data with huge flexibility (definitely sacrifice in terms of query latency). Supporting a SQL like query interface on top of relatively arbitrary dataset requires good Scalability support due to the relatively in-efficient (or resource intensive) nature of accessing none-organized or optimized data. MapReduce answered this challenge by providing a good framework to scale data processing system and thus gave birth to query engine like Hive.

Technology is fast evolving. The gaps between different class of systems are being filled by new systems very quickly. Other than Hive, we got systems like Spark SQL (and even Spark) which further blurred the difference between a computation system and a database. Essentially now we got a lot of frameworks that can make computation system interactive (and thus overlapping with some use cases that database system was originally designed to serve).

This may sound even more confusing now. Let me now jump to abstraction level to talk about the design differences between the computation framework and database system.

The design of a computation framework is focusing on providing Flexibility of accessing or processing the data. By being ‘not so aware’ of organizing the data, computation framework achieved better support of customization and complex data transformations. On top of these, providing a query engine + potentially a columnar storage format will give the computation framework certain level of interactivity.   

The design of a database system on the other hand is focusing on providing great Interactivity for data retrieval. By being ‘very aware’ of the data stored, optimization can be done to improve the overall efficiency of data retrieval. The most common method used to achieve this goal is Indexing. Recent system combines indexing with columnar storage to further improve the interactivity.

Thus, Spark SQL is not a database system in the strict term due to the difference in design focuses. However in reality, a highly optimized implementation of the computation framework can potentially even exceed the interactivity compare to a less optimized database system for certain use cases. However, it is always a good idea to consciously check how Indexing is done and how data partition is handled when you evaluate a solution for your data retrieval need.

For the title discussion: Druid vs Spark SQL. Druid is considered as a database system while Spark SQL is considered to be a query engine on top of a flexible computation framework. Druid utilize things like bitmap indices, dictionary encoding, real-time node and other mechanisms in order to provide real-time querying support for certain use cases. This means that if your query space is pretty well defined and the requirement on query latency is high (like supporting dashboard), Druid can be the better solution for you compare to Spark SQL.

All the Druid optimizations also made it not as flexible as Spark SQL. In Spark SQL, you can essentially run any query on any datasets (as Spark does not really care what the data is). To further optimize the performance, one can choose to write the data in a columnar storage format like Parquet, where you get the benefit of things like compression and selective column querying. However, due to the lack of indexing, the amount of data need to be deserialized and processed for a particular query is still going to be much larger than Druid on average. This would in turn hurt the overall query performance. Keep in mind that this is not saying that it’s not possible to do row indexing in Spark SQL. It’s just that it will require much more work (as it’s not supported by default)

Hopefully the above summary and high level discussions can be of helpful to those who are debating between Spark SQL, Druid and other similar systems.

 

Haijie Wu

Leave a Reply

Your email address will not be published. Required fields are marked *