Next-Generation Mainframers Community

 View Only

Data Consolidation: Handling Data Storage at Scale for Multiple Use Cases

By Troy Curtsinger posted Aug 02, 2023 12:42 PM

  

Data at scale quickly becomes difficult to store or query efficiently. Hundreds of thousands of events are expected to come in every minute, which necessitates a Datastore which can handle the billions of rows of data. 

 

 

For our problem, we are trying to support two separate use cases, both of which are meant to work with similar data. One needs to see the historical record of values for a given column on a very consistent basis, while the other needs to fire off multiple queries in succession, each using multiple where clauses, group bys, and generally “slicing and dicing” the data. 

 

During my internship at Broadcom, my objective is to identify a robust database solution to facilitate the needs of both use cases.

 

For the first use case, a NoSQL columnar solution is ideal, but provides no support for the second use case. For the second use case, a row based traditional database or an OLAP database make sense, but do not often provide the speed at scale that the first use case requires.  This means we want either a Time-series Database or a real-time OLAP database which gives us the response time and scale we need for both use cases.

 

A time-series database is a system where data is stored in rows with a time or timestamp, alongside other values such as readings or labels, providing a datastore tailored for time-series data. Examples include TimescaleDB, InfluxDB, and Prometheus.

 

OLAP, or OnLine Analytical Processing databases, are optimized to support analytical queries, often “slicing and dicing” the data with multiple where clauses, group bys, and etc. these databases are specially made to support dashboarding and other analytical use cases. Examples include Clickhouse, Snowflake, and Apache Pinot.



Apache Pinot:

Source: Apache Pinot(https://pinot.apache.org/)

 

In this case, I was researching Apache Pinot. It has a few features which make it attractive for our problem, since it has a variety of indexes and options to optimize queries.

 

Each index is a trade off, requiring more disk space but improving query performance. Pinot supports a variety of indexes based on data type, like timestamp, text, JSON, or geospatial, as well as the ability to create reversed or sorted indexes on columns. 

 

Apache Pinot also supports a unique index, the Star Tree. This index stores aggregation and row values for certain slices of data, navigated like a tree. The root node has multiple nodes below it, divided by certain ranges of values, or a single value for a column. This node contains the avg, max, min and etc. for each relevant column in the index. Below this node are more nodes, split on other columns, and providing the same aggregations for this slice of data.This allows a query with multiple where clauses to navigate the star tree based on the clauses in the query and use the aggregations in the star tree instead of calculating them at query time.

 

To test the results of these indexes in action, I set up an environment, loaded mock data, and used Gatling, a load testing tool, to test some of the queries our solution needs to serve. This table uses the default star tree, as well as the sorted, text, and timestamp indexes on various columns of the data. This test is performed on an environment containing ~3 billion points of data, with 105 sessions during the test, firing off 1 query at a time.

 

This test does show that all of the queries could be completed by the database, but under load some fail to perform. In best cases, minimum values show that some queries are served in only a couple of milliseconds, while 75th percentile values show that it is not uncommon for queries to take over ten seconds, some taking up to 55 seconds.

 

Summary

Data Consolidation requires solving multiple unique problems, by combining their solutions into one option which supports all of their needs. In this case, I tested Apache Pinot, which has a promising set of features. In my test, it successfully handles most of the queries needed in just a couple of milliseconds, but others need multiple seconds to be fulfilled, and under load, those times balloon. A more optimized or scaled system may be able to support these use cases.

 

Hopefully you found this post interesting, and feel free to leave a comment if any of the things mentioned here piqued your curiosity!

0 comments
3 views

Permalink