title | description | author | ms:date |
---|---|---|---|
Choosing a batch processing technology |
zoinerTejada |
02/12/2018 |
Big data solutions often use long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these jobs involve reading source files from scalable storage (like HDFS, Azure Data Lake Store, and Azure Storage), processing them, and writing the output to new files in scalable storage.
The key requirement of such batch processing engines is the ability to scale out computations, in order to handle a large volume of data. Unlike real-time processing, however, batch processing is expected to have latencies (the time between data ingestion and computing a result) that measure in minutes to hours.
In Azure, all of the following data stores will meet the core requirements for batch processing:
- Azure Data Lake Analytics
- Azure SQL Data Warehouse
- HDInsight with Spark
- HDInsight with Hive
- HDInsight with Hive LLAP
To narrow the choices, start by answering these questions:
-
Do you want a managed service rather than managing your own servers?
-
Do you want to author batch processing logic declaratively or imperatively?
-
Will you perform batch processing in bursts? If yes, consider options that let you pause the cluster or whose pricing model is per batch job.
-
Do you need to query relational data stores along with your batch processing, for example to look up reference data? If yes, consider the options that enable querying of external relational stores.
The following tables summarize the key differences in capabilities.
Azure Data Lake Analytics | Azure SQL Data Warehouse | HDInsight with Spark | HDInsight with Hive | HDInsight with Hive LLAP | |
---|---|---|---|---|---|
Is managed service | Yes | Yes | Yes 1 | Yes 1 | Yes 1 |
Supports pausing compute | No | Yes | No | No | No |
Relational data store | Yes | Yes | No | No | No |
Programmability | U-SQL | T-SQL | Python, Scala, Java, R | HiveQL | HiveQL |
Programming paradigm | Mixture of declarative and imperative | Declarative | Mixture of declarative and imperative | Declarative | Declarative |
Pricing model | Per batch job | By cluster hour | By cluster hour | By cluster hour | By cluster hour |
[1] With manual configuration and scaling.
Azure Data Lake Analytics | SQL Data Warehouse | HDInsight with Spark | HDInsight with Hive | HDInsight with Hive LLAP | |
---|---|---|---|---|---|
Access from Azure Data Lake Store | Yes | Yes | Yes | Yes | Yes |
Query from Azure Storage | Yes | Yes | Yes | Yes | Yes |
Query from external relational stores | Yes | No | Yes | No | No |
Azure Data Lake Analytics | SQL Data Warehouse | HDInsight with Spark | HDInsight with Hive | HDInsight with Hive LLAP | |
---|---|---|---|---|---|
Scale-out granularity | Per job | Per cluster | Per cluster | Per cluster | Per cluster |
Fast scale out (less than 1 minute) | Yes | Yes | No | No | No |
In-memory caching of data | No | Yes | Yes | No | Yes |
Azure Data Lake Analytics | SQL Data Warehouse | HDInsight with Spark | Apache Hive on HDInsight | Hive LLAP on HDInsight | |
---|---|---|---|---|---|
Authentication | Azure Active Directory (Azure AD) | SQL / Azure AD | No | local / Azure AD 1 | local / Azure AD 1 |
Authorization | Yes | Yes | No | Yes 1 | Yes 1 |
Auditing | Yes | Yes | No | Yes 1 | Yes 1 |
Data encryption at rest | Yes | Yes 2 | Yes | Yes | Yes |
Row-level security | No | Yes | No | Yes 1 | Yes 1 |
Supports firewalls | Yes | Yes | Yes | Yes 3 | Yes 3 |
Dynamic data masking | No | No | No | Yes 1 | Yes 1 |
[1] Requires using a domain-joined HDInsight cluster.
[2] Requires using Transparent Data Encryption (TDE) to encrypt and decrypt your data at rest.
[3] Supported when used within an Azure Virtual Network.