Users today are asking ever more from their data warehouse. This is resulting in advancements of what is provided by the technology, and a resulting shift in the art of the possible. As an example of this, in this post we look at Real Time Data Warehousing (RTDW), which is a category of use cases customers are building on Cloudera and which is becoming more and more common amongst our customers.
Let’s consider a large Asian Telecommunications provider who is rolling out 5G. They built a RTDW using Cloudera to ensure a good customer experience and to keep maintenance costs under control. Having a live view of all aspects of their network lets them identify potentially faulty hardware in real time so they can avoid impact to customer call/data service. It also gives them a chance to remediate the issue earlier, when it is cheaper to resolve.
The capabilities that more and more customers are asking for are:
The factors driving this trend are part technical, part business, and part cultural. On the technical side, it is cheaper and easier than ever to instrument everything and send that data in real-time through a messaging system. On the business side, companies and governments are digitizing and automating as many of their operations as possible so decision making and asset management can be more effective. On the cultural side, people expect to have the answers they need available at their fingertips, immediately, and without having to go ask someone (thanks Google and Wikipedia).
One other example highlights this trend. An AdTech company in the US provides processing, payment, and analytics services for digital advertisers. Data processing and analytics drive their entire business. So they needed a data warehouse that could keep up with the scale of modern big data systems, but provide the semantics and query performance of a traditional relational database. They chose to build their RTDW on Cloudera.
The simplest way to describe a RTDW is that it looks and feels like a normal data warehouse, but everything is faster even while massive scale is maintained. It is a type of data warehouse modernization that lets you have “small data” semantics and performance at “big data” scale.
While this may sound obvious, and to some perhaps even trivial, decades of data warehousing have shown otherwise. It is very hard to maintain interactive performance, over large amounts of data that is arriving very fast, some of which might need updates, with a large number of queries of varying patterns. Cloudera offers RTDW capabilities that tick all these boxes. As such, many customers are building RTDW applications as part of their overall strategy of using Cloudera to modernize their data warehouse practice.
The following table provides more specifics on the use case characteristics that make up a RTDW.
Ingest |
|
Query |
|
Data Model |
|
Figure 1 below shows a standard architecture for a Real-Time Data Warehouse. It has the key elements of fast ingest, fast storage, and immediate querying for BI purposes.
As seen in Figure 2, this is easy to extend to cover additional capabilities that might be required. These include stream processing/analytics, batch processing, tiered storage (i.e. for active archive or joining live data with historical data), or machine learning.
In addition to understanding the attributes of an RTDW, it is useful to look at the types of applications that can be built within the RTDW category.
These are end-to-end, high volume applications that are used for general purpose data processing, Business Intelligence, operational reporting, dashboarding, and ad hoc exploration. But an important caveat is that ingest speed, semantic richness for developers, data freshness, and query latency are paramount. They can be scoped to a single purpose (i.e. a data mart) or more comprehensively as an Enterprise Data Warehouse.
In addition to the characteristics described above in What is Real Time Data Warehousing?, a General Purpose RTDW has the following attributes:
Sometimes you know that there is always a time element to the data events and to the analysis, and you know in advance the types of queries your users will run. You can take this knowledge and build a RTDW that is specialized for Time Series and Event Analytics. By doing so the benefits to ingest speed, query latency, and scalability can be huge. The tradeoff is a loss of generality of supported query patterns, which is OK because the reason that you selected this specialized approach in the first place is that it is ideal for your specific use case and you don’t need anything more general.
In addition to the RTDW characteristics described above in What is Real Time Data Warehousing?, Time Series and Event Analytics RTDWs have the following description:
In some cases you need to act on the data within the stream as it is flowing into the warehouse. This could be done for stream processing (e.g. cleansing, feature engineering, CDC reconciliation) or for stream analytics (e.g. alert when threshold exceeded over a rolling window of statistics on the data, score the event data against a predictive model to decide which action to take next). Operating on data in the stream gives you the ability to make better decisions in “machine-time”, which complements the ability to make better decisions in “human-time” once the data lands in the warehouse.
Cloudera offers a platform, Cloudera Data Platform (CDP), for building end-to-end data applications in both the public and private cloud. CDP contains a rich array of services to move, store, process, and query your data. Some of these are offered within the Cloudera Data Warehouse (CDW) service, and others are offered within the different templated cluster types in the Cloudera DataHub service. The table below summarizes the building blocks used to create a RTDW application within CDP.
Use… | Containing… | Used For… |
CDW |
Hive |
|
Druid |
|
|
Hue |
|
|
DataViz
(Tech Preview) |
|
|
Data Hub –Real Time Data Mart Template |
Kudu |
|
Impala |
|
|
Spark Streaming |
|
|
HUE |
|
|
Data Hub –Streams Messaging Template |
Kafka |
|
Data Hub –Flow Management Template |
Nifi |
|
Data Hub –Streaming Analytics Template |
Flink |
|
The following two diagrams show how these components work in conjunction for the different flavors of RTDW applications, as described above.
Please keep an eye out for the next posts in this series, where we will discuss these two flavors of RTDW in more depth. These will provide more details on how the technologies work together and how you can build your own RTDW applications.
In the meantime, if you want to learn more, please check out this video, which shows how to build an end-to-end Event Analytics application in CDP, using Apache Kafka, Apache Druid, Apache Hive, and Cloudera DataViz. In addition, we have a webinar and blog explaining how you can use Apache Kudu and Apache Impala to create a time series application within CDP. And finally, if you want to learn more about using CDP to do analytics, processing, and routing of data within the stream, please watch this video, which highlights Apache NiFi and Apache Kafka.
This may have been caused by one of the following: