Data lakehouse platform

1/5/2024

As a database engineer, I would bend over backward to implement my use case in Oracle - even if it’s not the right tool for the job - instead of creating a replication process that would lead to “fun” 2 AM conversations with operations. Not many would argue with the statement that Oracle DB is a monolith, but are cloud data warehouses any different in that sense? Oracle DB locks data behind a proprietary file format, so using an additional engine (for a use case like machine learning) requires replication. The main concern with this paradigm is creating a monolith. Fast forward to today, can the cloud data warehouse take over the entire value-chain and “eat” the data lake? If I had offered to replace the data warehouse with the cumbersome, code-intensive, Hadoop stack, I would get laughed out of the room. The work wasn’t easy, but it was SQL-based, so analytics access was widespread. At the time, I could recite every tuning technique for the Oracle query optimizer. Intelligence work is an internet-scale problem with many different use cases and thousands of analytics users. Paradigm 1: The cloud data warehouse is extended to the lakeĮarly on, I had to stretch the limitations of the data warehouse. Let’s explore all three of these paradigms and the next evolutionary step in data infrastructure. There’s a third option that combines the lake with multiple houses (from multiple vendors) to get the best of both worlds - the Open Data Lakehouse. These two paradigms have created a zero-sum game, but it doesn’t need to be that way. Some say data warehouses will cannibalize data lakes, and some say the exact opposite. In the last 3 years, data lakehouses entered the mix, which database and data lake vendors interpret in several ways.

Whether you work on-premise or in the cloud, coding and expertise in the complex Hadoop/Spark stack turn the lake into a swamp.

It’s easy to store data cheaply in a lake (object storage), but it’s tough to make it valuable and easy to query. Enter data lakes - though not the perfect alternative. But warehouses can’t keep up with all data processing needs like streaming, machine learning, and exponential growth in volumes.

Large enterprises used warehouses for a long time because they’re great for making data valuable. In 2005 I started my career as a database engineer in the army intelligence - around the time Apache Hadoop was released to open source - allowing me to experience warehouses and lakes as an end-user and as a manager committed to timelines, budgets, and compliance. There’s been a 15-year battle between data lakes and warehouses. Welcome to The Cloud Lakehouse blog, where we’ll re-examine warehouses and lakes as we rethink the future of data platform architecture.

0 Comments

Data lakehouse platform

Leave a Reply.

Author

Archives

Categories