From 023d88e781f7fe64c62b247d876441ee40921b1b Mon Sep 17 00:00:00 2001 From: Mike Wasson Date: Tue, 27 Feb 2018 11:10:21 -0800 Subject: [PATCH] Add data lake topic to Data Guide (#435) * Create Data Lake topic --- docs/data-guide/concepts/big-data.md | 13 ---------- docs/data-guide/concepts/data-lake.md | 36 +++++++++++++++++++++++++++ docs/data-guide/toc.yml | 2 ++ 3 files changed, 38 insertions(+), 13 deletions(-) create mode 100644 docs/data-guide/concepts/data-lake.md diff --git a/docs/data-guide/concepts/big-data.md b/docs/data-guide/concepts/big-data.md index 2233d8477a8..660aadcb6cb 100644 --- a/docs/data-guide/concepts/big-data.md +++ b/docs/data-guide/concepts/big-data.md @@ -52,19 +52,6 @@ Most big data architectures include some or all of the following components: * **Orchestration**. Most big data solutions consist of repeated data processing operations, encapsulated in workflows, that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results straight to a report or dashboard. To automate these workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop. -## Data lake - -If you have read anything about big data, it's likely you've seen the term _data lake_. You may have seen the word used for the name of a product, or perhaps a concept about storing large quantities of data. - -A data lake consists of both storage and processing. Data lake storage is built with several goals in mind: fault-tolerance, infinite scalability, and high-throughput ingestion of data with varying shapes and sizes. Data lake processing involves one or more processing engines built with these goals in mind, and can operate on data stored in a data lake at scale. - -Data lake stores are often used in event streaming or IoT scenarios, because they can persist large amounts of relational and nonrelational data without transformation or schema definition. They are built to handle high volumes of small writes at low latency, and are optimized for massive throughput. - -Another term commonly used in data scenarios is _data mart_. Typically, a data mart is a store of data that is cleansed, packaged, and structured for easy consumption. Unlike a data mart, a data lake is designed to ingest raw data, leaving it in its original or least-processed form to allow questions to be asked in various ways and at various times. If the data is cleansed and structured in a specific way, like in a data mart, then it is difficult to adapt how the data is processed and analyzed when new questions or tools come about in the future. This is why a data lake is composed of both storage and processing as separate entities. - -Relevant Azure service: -- [Azure Data Lake](https://azure.microsoft.com/scenarios/data-lake/) - ## Lambda architecture When working with very large data sets, it can take a long time to run the sort of queries that clients need. These queries can't be performed in real time, and often require algorithms such as [MapReduce](https://en.wikipedia.org/wiki/MapReduce) that operate in parallel across the entire data set. The results are then stored separately from the raw data and used for querying. diff --git a/docs/data-guide/concepts/data-lake.md b/docs/data-guide/concepts/data-lake.md new file mode 100644 index 00000000000..bbd186c6c94 --- /dev/null +++ b/docs/data-guide/concepts/data-lake.md @@ -0,0 +1,36 @@ +# Data lakes + +A data lake is a storage repository that holds a large amount of data in its native, raw format. Data lake stores are optimized for scaling to terabytes and petabytes of data. The data typically comes from multiple heterogeneous sources, and may be structured, semi-structured, or unstructured. The idea with a data lake is to store everything in its original, untransformed state. This approach differs from a traditional [data warehouse](../scenarios/data-warehousing.md), which transforms and processes the data at the time of ingestion. + +Advantages of a data lake: + +- Data is never thrown away, because the data is stored in its raw format. This is especially useful in a big data environment, when you may not know in advance what insights are available from the data. +- Users can explore the data and create their own queries. +- May be faster than traditional ETL tools. +- More flexible than a data warehouse, because it can store unstructured and semi-structured data. + +A complete data lake solution consists of both storage and processing. Data lake storage is designed for fault-tolerance, infinite scalability, and high-throughput ingestion of data with varying shapes and sizes. Data lake processing involves one or more processing engines built with these goals in mind, and can operate on data stored in a data lake at scale. + +## When to use a data lake + +Typical uses for a data lake include [data exploration](../scenarios/interactive-data-exploration.md), data analytics, and machine learning. + +A data lake can also act as the data source for a data warehouse. With this approach, the raw data is ingested into the data lake and then transformed into a structured queryable format. Typically this transformation uses an [ELT](../scenarios/etl.md#extract-load-and-transform-elt) (extract-load-transform) pipeline, where the data is ingested and transformed in place. Source data that is already relational may go directly into the data warehouse, using an ETL process, skipping the data lake. + +Data lake stores are often used in event streaming or IoT scenarios, because they can persist large amounts of relational and nonrelational data without transformation or schema definition. They are built to handle high volumes of small writes at low latency, and are optimized for massive throughput. + +## Challenges + +- Lack of a schema or descriptive metadata can make the data hard to consume or query. +- Lack of semantic consistency across the data can make it challenging to perform analysis on the data, unless users are highly skilled at data analytics. +- It can be hard to guarantee the quality of the data going into the data lake. +- Without proper governance, access control and privacy issues can be problems. What information is going into the data lake, who can access that data, and for what uses? +- A data lake may not be the best way to integrate data that is already relational. +- By itself, a data lake does not provide integrated or holistic views across the organization. +- A data lake may become a dumping ground for data that is never actually analyzed or mined for insights. + +## Relevant Azure services + +- [Data Lake Store](/azure/data-lake-store/) is a hyper-scale, Hadoop-compatible repository. +- [Data Lake Analytics](/azure/data-lake-analytics/) is an on-demand analytics job service to simplify big data analytics. + diff --git a/docs/data-guide/toc.yml b/docs/data-guide/toc.yml index d987a6b8524..9b3e0d5a109 100644 --- a/docs/data-guide/toc.yml +++ b/docs/data-guide/toc.yml @@ -39,6 +39,8 @@ href: concepts/non-relational-data.md - name: Working with CSV and JSON files href: concepts/csv-and-json.md + - name: Data lakes + href: concepts/data-lake.md - name: Big data architectures href: concepts/big-data.md - name: Advanced analytics