The Complete Customer Data Stack: Data Collection (Part 2) | RudderStack

neub9
By neub9
5 Min Read

Relational Data and Beyond

In the first part, we discussed the importance of considering both data and infrastructure when building a data stack. We also emphasized the crucial role that categories play in this holistic approach, and we delved into the significance of event data as a major source of data.

Now, in this post, we’ll be focusing on the other major source of data, relational data. We’ll provide an overview of how to gather relational data from both cloud applications and databases, and we’ll also touch on two additional, but still important, sources of data.

Relational Data

Relational data is a prominent category of data that is commonly used. The concept of ELT has emerged as a result of the necessity to work with such data. Historically, this data originates from RDBMS systems, but with the advent of Cloud technology, new types of sources that expose similarly structured data, particularly Cloud Applications, have also become prevalent. Let’s examine some of the characteristics of Relational Data and how it differs from Streaming Data:

  • They change frequently.
  • They usually don’t come in large volumes, although it may occur.
  • They have high dimensionality.

Relational data, regardless of whether it originates from cloud apps or databases, undergoes frequent changes. It can also be deleted, which is a significant difference compared to streaming data and a crucial consideration when ensuring data consistency. For instance, in an eCommerce setting, the contents of a customer’s basket may change multiple times during a purchase, thus necessitating the tracking of these changes on the data infrastructure.

Most of the time, relational data does not come in the same volume as clickstream data. This is especially true for Cloud Applications, where the volume is relatively smaller. However, RDBMS may handle larger volumes, but typically not as substantial as clickstream data. Moreover, relational data features high dimensionality, with CRMs and cloud apps possessing numerous tables, thereby adding complexity to the data and the infrastructure required to handle it.

Collecting Data from Cloud Applications

Cloud applications are now an integral part of many companies’ operations, encompassing a wide spectrum of tools such as CRMs, marketing platforms, and customer success tools. However, unlike RDBMS, the backend infrastructure of these applications cannot be accessed for ETL data. Typically, data from cloud applications is accessible through a REST API or bulk exports.

Interacting with APIs usually entails dealing with small quantities of data and latency issues when attempting to extract large amounts of data. This, combined with strict rate limits imposed by vendors, translates to a slow process in accessing cloud app data. Additionally, network errors are common, and API error documentation is often lacking, creating the need for a robust error handling system. Finally, managing deletions from Cloud Apps can be challenging, especially when dealing with hard deletes.

Systems for collecting data from cloud applications are typically equipped with a scheduler for orchestrating data collection jobs and connector components responsible for extracting data from the source. Maintaining a global state to implement delta syncs is crucial for pulling new and updated data from cloud sources efficiently.

Collecting Data from Databases

RDBMS systems are vital for software systems, as they house valuable and unique data. While collecting data from RDBMS systems allows for tighter integration and more control, executing queries directly from the production database may pose risks to its performance and reliability. Strategies such as performing collection tasks during periods of low database load and maintaining a separate replica database are often employed to mitigate these risks.

Deletions are still problematic when collecting data from databases, although soft deletes may be a viable solution. Change Data Capture (CDC) methods, which involve accessing the replication log of the database to capture and store data on the data infrastructure, are becoming increasingly popular due to their minimal impact on the database and effectiveness in handling deletions.

In conclusion, understanding the complexities and unique aspects of relational data and cloud applications is essential for building a robust and efficient data infrastructure. Effective data collection methods and error handling systems are crucial for seamlessly integrating these data sources into the overall data stack.

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *