November 2017 – Haijie Wu's Engineering Blog

HBase, Cassandra, Druid, Hive, Spark SQL……. Are all these similar to each other? What’s the conceptual difference between their designs?

With the introduction of Apache Hive long time back, the boundary between database system and computation system are becoming more blurred than ever. Nowadays, with system like Spark SQL, you can achieve pretty high speed data retrieval with very complex queries on a large dataset. Now we may ask the question, is Spark SQL a database? Can it replace my HBase cluster?

Before answering this question, let’s first review what the problem a database system is trying to solve. A database is a system to help organizing the data. Now it is pretty obvious that the purpose of organizing the data is for easy management (for processing or for retrieval). Thus in the past, having a database (in the traditional sense) is almost a must have for companies that wants to deal with large amount of data.

Database works by organizing the data and at the same time providing interface for user to insert or query the data. Organizing the data is for querying optimizations. With the introduction of MapReduce (and to be more precise, the open source version of MapReduce — Hadoop), it actually opens up a whole new way of accessing the data. Leading the way was Hive, which provided us ways to access not so organized data with huge flexibility (definitely sacrifice in terms of query latency). Supporting a SQL like query interface on top of relatively arbitrary dataset requires good Scalability support due to the relatively in-efficient (or resource intensive) nature of accessing none-organized or optimized data. MapReduce answered this challenge by providing a good framework to scale data processing system and thus gave birth to query engine like Hive.

Technology is fast evolving. The gaps between different class of systems are being filled by new systems very quickly. Other than Hive, we got systems like Spark SQL (and even Spark) which further blurred the difference between a computation system and a database. Essentially now we got a lot of frameworks that can make computation system interactive (and thus overlapping with some use cases that database system was originally designed to serve).

This may sound even more confusing now. Let me now jump to abstraction level to talk about the design differences between the computation framework and database system.

The design of a computation framework is focusing on providing Flexibility of accessing or processing the data. By being ‘not so aware’ of organizing the data, computation framework achieved better support of customization and complex data transformations. On top of these, providing a query engine + potentially a columnar storage format will give the computation framework certain level of interactivity.

The design of a database system on the other hand is focusing on providing great Interactivity for data retrieval. By being ‘very aware’ of the data stored, optimization can be done to improve the overall efficiency of data retrieval. The most common method used to achieve this goal is Indexing. Recent system combines indexing with columnar storage to further improve the interactivity.

Thus, Spark SQL is not a database system in the strict term due to the difference in design focuses. However in reality, a highly optimized implementation of the computation framework can potentially even exceed the interactivity compare to a less optimized database system for certain use cases. However, it is always a good idea to consciously check how Indexing is done and how data partition is handled when you evaluate a solution for your data retrieval need.

For the title discussion: Druid vs Spark SQL. Druid is considered as a database system while Spark SQL is considered to be a query engine on top of a flexible computation framework. Druid utilize things like bitmap indices, dictionary encoding, real-time node and other mechanisms in order to provide real-time querying support for certain use cases. This means that if your query space is pretty well defined and the requirement on query latency is high (like supporting dashboard), Druid can be the better solution for you compare to Spark SQL.

All the Druid optimizations also made it not as flexible as Spark SQL. In Spark SQL, you can essentially run any query on any datasets (as Spark does not really care what the data is). To further optimize the performance, one can choose to write the data in a columnar storage format like Parquet, where you get the benefit of things like compression and selective column querying. However, due to the lack of indexing, the amount of data need to be deserialized and processed for a particular query is still going to be much larger than Druid on average. This would in turn hurt the overall query performance. Keep in mind that this is not saying that it’s not possible to do row indexing in Spark SQL. It’s just that it will require much more work (as it’s not supported by default)

Hopefully the above summary and high level discussions can be of helpful to those who are debating between Spark SQL, Druid and other similar systems.

Haijie Wu

Apache Phoenix is a pretty powerful query analytic layer built on top of Apache HBase. In this particular post, I will go through one of the ‘not so obvious’ issue while using Apache Phoenix.

Normally, Apache Phoenix is a relatively thin layer and thus majority of the day-to-day maintenance issues (like performance degradation, query failures or other reliability related issues) are generated from HBase. However, that’s not always the case and it’s a good idea to always check Phoenix client and its co-processor whenever you can’t find the root cause from HBase.

Now, this particular issue that I am going to cover here caused very serious performance degradation that actually rendered the entire Phoenix cluster not really usable. To make the situation even worse, there was no clear failure messages for this problem.

I will start from the general setup of the cluster and the symptom of the issue. The cluster mentioned here consists of around 25 nodes with a total capacity of around 100T. There is one Phoenix table in this cluster that occupied majority of the space. That table were both write and read intensive. While things were running fine for a pretty long time after the launch of the table, it suddenly went south in a given day. Below were the three main symptoms:

The write/insertion speed suddenly dropped to around 10% of the original speed

The read operation for the largest table become essentially not responsive (timeout all the time)

Requests per second literally dropped to 0 for all region servers except one according to the HBase master page (key to the problem)

Some normal handling operations like the restart of HBase and the restart of HDFS were tried multiple times but none was able to resolve the problem. After many rounds of debugging, what got narrowed down was that all other small tables worked fine except the largest table. This resulted in the our action to stop writing and reading from that largest table and re-created a new table instead. After this operation, the Phoenix cluster went back to normal for about 3 months until the problem reoccurred.

This time, I decided to dig deeper and see what I can find. After certain amount of log reading, code reading and of course, guessing, finally the root cause of the problem was identified. In Phoenix, there was a thing called metadata cache. As all the Phoenix table information was stored in a region called “system.catalog”, the region becomes critical to the normal operation of the system (more like a Master). Queries you executed will need to go through this system.catalog first to obtain ‘necessary’ table information (I did not dig deep enough to fully understand the details). Due to the critical role this region had, Phoenix implemented a cache in the co-processor (client also have one) to help improving the performance. Now the problem comes, when you run out of this cache, the performance degraded dramatically. One reason for this is due to the fact that this region is not split-able and thus all your query will be congested into one node. And thus this is the reason why we saw the symptom where only one region server seems to have received requests during the incident (all requests are blocked by this region that serves the Phoenix metadata).

Long story short, below is the two critical configurations that need to be changed to have enough memory allocation.

phoenix.coprocessor.maxMetaDataCacheSize (needs to be changed on all HBase region server’s hbase-site.xml)

phoenix.client.maxMetaDataCacheSize (this is the client side cache)

In our case, the above parameters were changed to 100MB (from the 20MB/10MB default).

Haijie Wu

Month: November 2017

Database vs Computation Framework (ever asked the question Druid vs Spark SQL?)

Apache Phoenix: The critical co-processor metadata caching