Introducing The Streaming Datalake – insideBIGDATA

By neub9
3 Min Read

Evolution of Event Streaming

The storage and processing of data is not a new problem. Since the invention of the computer, users have needed to operate on data, save it, and recall it at a later date to further operate on it. Over the past 50 years, an industry worth hundreds of billions of dollars has grown to support this and everything from general-purpose solutions for home users (yes Excel counts 😉 ) to huge distributed systems for working on unimaginably large datasets like those of Google, Facebook, and LinkedIn are available. 

Event streaming is a modern take on this problem. Instead of storing data X, recalling it, transforming it to Y and storing it again, event streaming stores the “event” that X was created and the “event” that X was changed to Y. This creates a log of operations that can be read and responded to by the appropriate services in real time. The services that respond to events on these logs can typically be smaller and more self-contained than alternatives, leading to more flexible, resilient, and scalable solutions. 

Event streaming platforms were designed with a very simple purpose in mind, to get event X from source A to sink B in a fast and resilient way. Like the message queueing systems that preceded them, they also buffered data so that if sink B was not yet ready for event X, it would be stored by the platform until it was. 

Stream processing is a natural extension of this where the source and sink of a given process are event streams. Using stream processing, events from one or more topics can be consumed, manipulated, and produced out to another event stream ready for another service.

Today we’re starting to flip the stream processing analogy inside out. The success of stream processing as an architecture meant that the process of reading data from an event streaming system and “materializing” it to a store has become commonplace. 

The difference between a database and a datalake is that a database typically stores the processed, current data required to power applications whereas a datalake stores historical and raw data generally for the purpose of analytics.

Thankfully there have been new features aimed specifically at increasing the volume and variety of data stored in the platform. With these changes, the event streaming platform becomes a viable data single source of truth for all of the data in an organization.

The future is bright! Never before have there been…

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *