Skip to content

Commit

Permalink
Update batch-processing.md (MicrosoftDocs#438)
Browse files Browse the repository at this point in the history
Fixed broken link.
  • Loading branch information
Ed Price - MSFT authored and Mike Wasson committed Mar 1, 2018
1 parent 9a724bb commit c96e017
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions docs/data-guide/scenarios/batch-processing.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ For example, the logs from a web server might be copied to a folder and then pro

## When to use this solution

Batch processing is used in a variety of scenarios, from simple data transformations to a more complete ETL (extract-transform-load) pipeline. In a big data context, batch processing may operate over very large data sets, where the computation takes significant time. (For example, see [Lambda architecture](../concepts/big-data.md##lambda-architecture).) Batch processing typically leads to further interactive exploration, provides the modeling-ready data for machine learning, or writes the data to a data store that is optimized for analytics and visualization.
Batch processing is used in a variety of scenarios, from simple data transformations to a more complete ETL (extract-transform-load) pipeline. In a big data context, batch processing may operate over very large data sets, where the computation takes significant time. (For example, see [Lambda architecture](../concepts/big-data.md#lambda-architecture).) Batch processing typically leads to further interactive exploration, provides the modeling-ready data for machine learning, or writes the data to a data store that is optimized for analytics and visualization.

One example of batch processing is transforming a large set of flat, semi-structured CSV or JSON files into a schematized and structured format that is ready for further querying. Typically the data is converted from the raw formats used for ingestion (such as CSV) into binary formats that are more performant for querying because they store data in a columnar format, and often provide indexes and inline statistics about the data.

Expand Down Expand Up @@ -81,4 +81,4 @@ For more information, see [Analytics and reporting](../technology-choices/analys
- **Azure Data Factory**. Azure Data Factory pipelines can be used to define a sequence of activities, scheduled for recurring temporal windows. These activities can initiate data copy operations as well as Hive, Pig, MapReduce, or Spark jobs in on-demand HDInsight clusters; U-SQL jobs in Azure Date Lake Analytics; and stored procedures in Azure SQL Data Warehouse or Azure SQL Database.
- **Oozie** and **Sqoop**. Oozie is a job automation engine for the Apache Hadoop ecosystem and can be used to initiate data copy operations as well as Hive, Pig, and MapReduce jobs to process data and Sqoop jobs to copy data between HDFS and SQL databases.

For more information, see [Pipeline orchestration](../technology-choices/pipeline-orchestration-data-movement.md)
For more information, see [Pipeline orchestration](../technology-choices/pipeline-orchestration-data-movement.md)

0 comments on commit c96e017

Please sign in to comment.