January 9, 2017

Drain the Swamp: Understanding Data Lake Architecture

Imagine a clear blue mountain lake with water streaming in from several surrounding peaks. Now imagine an octopus in that lake, and a fisherman standing on the shore, reeling in a trout. How did that trout get up the mountain? Or the octopus for that matter?

Data lakes are kind of like this visual. They pool together data from a lot of different streams, and due to the variety of streams that feed the lake, you may find surprises. These streams can include anything from structured tables pulled from your current data warehouse to unstructured data from your social media streams and everything in between. Proponents of Big Data love data lakes, because they represent an untapped resource for manipulation, analysis, and discovery.

Some experts suggest that a data lakes end up much murkier than my example — namely because the streams that feed them are murky. Unstructured data presents many opportunities for manipulation and analysis, but a lack of careful planning can quickly turn your lake into a swamp.

data lake becomes swamp of sadness

How you may feel trying to extract data from your swamp. If you recall the movie, this doesn’t end well.

That fisherman in our analogy? He’s your data scientist, analyst, programmer, or anyone you allow to access the lake. It’s a lonely undertaking: not everyone has the expertise or the patience to go fishing in a data lake, but I bet you’ll be happy to consume whatever he catches.

Enough with the analogy. Let’s get technical.

Data Lakes vs. Data Warehouses

The best outcome of building a data lake is to form a central repository where all your data from multiple sources is stored in its original form, available for search and analysis. This is where a data lake differs from a data warehouse. The CTO of Pentaho, James Dixon, is credited with creating the idea of data lake. In his metaphor, data is water, and the warehouse holds packaged bottled water in neat and easily searchable rows and columns. A warehouse requires processed, identified, and sanitized data on input, while a lake can store data in any form, including unstructured and unfiltered data.

Big Data specialists call this “schema on read” vs. “schema on write” structuring. Data warehouses require specialists to process and assign schema to your data when storing, which requires a lot of work, is expensive, and takes up a lot of server space. A data lake lets you store your data cheaply and without manipulation, and you assign schema when you access the data later.

Store All the Things

A data lake’s main purpose is to provide access to all of an organization’s data that might be helpful in the future, even when we don’t anticipate it. This need has grown out of our increasingly digitized workplaces and home lives.

A company that manufactures and sells crock pots (yes, the kind you use to make chili) can use data lake to store information from every part of its business:

  • Information from the manufacturing floor on production speeds, mistakes, or safety statistics
  • RFID and barcode input from warehouses including storage, shipping, and logistics matters
  • User engagement stats from the company website
  • Social media interactions with customers
  • Email, chat, and phone logs from support
  • Marketing campaign data
  • B2B and B2C sales input from the CRM
  • IoT data from WiFi-connected crock pots on usage

I’m surely missing a lot of other data sources, but you get the picture. Our lives are filled with data, and we don’t know all of the possible future questions that could be answered with that information, so it’s helpful to store it now for when we might need it later. We don’t currently know why there’s an octopus in our analogy lake, but maybe one day we’ll find a way to catch and eat that alongside our trout.

Data Lake Architecture

Building a data lake takes careful planning, because at some point, you’ll need to access all that information you’ve stored. Unsearchable data in your lake means you’ve actually built a swamp, and nobody wants that. At the most basic level, think of these things as you build your lake:

  1. Input: How does your data get into the lake in the first place? Are you going to use streaming methods or batch uploads? How often will you update the data, and how large will your batches be?
  2. Security: Data lakes contain potentially sensitive information, especially when you’re storing customer data, health and medical information, or even search histories. Build your lake with the mindset that your data needs to remain secure; add authorization levels and possibly encryption.
  3. Organization: Although data lakes include “raw” data, it should still be searchable so you can find what you need later on. This will require some basic structuring, such as dated batches.
  4. Access: Who will have access to the raw, unfiltered data, and what systems will they use to manipulate that data into intelligible forms? Some suggest overlaying a powerful search engine to parse the data, while others suggest internal organization systems like nodes to separate the data into accessible files.

Benefits of Data Lake

Most experts suggest that you build a data lake alongside your existing data storage systems, as each have their benefits. While no data storage method is perfect, warehouses and lakes can work together to suit your needs.

  1. When you process data before storing it, you define its characteristics based on current questions, ultimately limiting your ability to manipulate the stored data. Outlining schema before storage also means that some raw data is lost in processing. Data lakes store all of your raw data.
  2. A data lake stores your data in its original form, giving you almost infinite power to manipulate later without disturbing or changing the raw input.
  3. Due to its unstructured nature, data lakes can store much more data in cheaper repositories. This means data storage is more democratic across the business, and SMBs can build analytics models with lower overhead costs.

Problems with Data Lake

Like any technology, data lakes are far from perfect. You’ll find that implementing a data lake alongside your current warehouse will improve your access to data, but it may complicate your analysts’ lives.

  1. Unstructured data requires specialized programming. Even though it comes after the data is stored, you’ll still need to build programs to access, sort, sanitize, and manipulate the data into a usable form.
  2. You still need to plan for potential use cases. Planning for possible future uses helps you clarify the kinds of data you currently have, and whether your current processes will work.
  3. Maintenance: Just because you import your data in raw formats doesn’t mean you should avoid cleaning it up. Make sure your input stays clean so your lake doesn’t turn into a swamp.
  4. Access to the lake is not democratic. At this point, data analysts should be the only ones who have access; only they will understand how to manipulate the data. Eventually, we’ll see a time when business users can search to access necessary data, but that’s still a hope for the future.
  5. Data hoarding: at some point, you have to wonder what your business is going to do with all of this data, and why you’re holding onto it. Many companies tout the importance of “storing all the things” to answer some future question. But these known unknowns mean we hold on to too much data, even when we don’t need all of it.

* * *

Building a data lake is no simple feat. It takes planning and forethought, and it’s not a set-it-and-forget-it solution. By implementing the data lake architecture, you open your company up to new discoveries and possibly new business models based on information that you used to throw away. That’s where the promise of data lakes lie: in the hope of eventual surprise outcomes.

Looking for software? Try our Product Selection Tool