- https://github.com/2L-KnowledgeBase/FE-Rookie-101
- https://github.com/2L-KnowledgeBase/core-java
Currently (these definitions are changable in developer community and updated overtime) we said Spark is a unified analytics engine for large-scale data processing ...
关于 是什么的定义(what-is) 时常随着开源项目本身的发展以及PMC Member们的协商而进行调整(下一个Major Releasae), 简单来说, Spark是一个分布式计算框架(引擎), 很多开发者常喜欢用它和MapReduce做比较.
It provides high-level APIs in Scala/Java/Python and R, and an optimized engine that supports general computation graphs(DAG, Directed Acyclic Graph) for data analysis.
高阶 API 又可以划分为 RDD / Dataset (Dataframe, Spark>=2.0之后 Dataframe API 和 Dataset 进行了合并) 2大类, 后者是社区参考著名的 Python 包 Pandas 的 API 设计风格(更利于分析师使用) 二次封装 RDD API 实现
It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.
除了可以直接使用上面提到的 RDD/Dataset APIs 编写 SparkApplicaiton 之外. Spark 还提供如下针对特定场景的内嵌库 (built-in library).
SparkSQL在构建现代化数仓中相当流行, 现代化数仓这一中文概念,请参考eBay的俞育才老师在18年QCON上的sharing,由于MPP(e.g. Teradata/Netezza/Greenplum/..)本身高昂的成本, 越来越多的公司开始将离线分析从传统数仓迁移到现代化数仓(HDFS + Hive + SparkSQL).
关于 eBay DSS(Data Services and Solutions) Team 完成迁移的sharing 从 "Spark& AI Summit Europe - 2018/10" 开始到国内的QCon/Spark&AI Meetup 等等.. 真的有很多.. 以下为Databricks Youtube账号提供的录制(自备梯子)
- Moving eBay’s Data Warehouse Over to Apache Spark (Kimberly Curtis & Brian Knauss)
- Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang and Lipeng Zhu (eBay)
- Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing - Keith Sun eBay
- Experience Of Optimizing Spark SQL When Migrating from MPP Database - Yucai Yu and Yuming Wang eBay
SparkStreaming由于Spark2.0后Dataset API的统一, 使得SparkStreaming使用Spark SQL APIs
Spark lastest version doc: http://spark.apache.org/docs/latest/. For historical versions, find here.
- 《Learning Spark》 (En / 译版2-9章) @Notes
- Mastering Spark SQL
- Mastering Apache Spark
- Advanced Apache Spark Training - Sameer Farooqui (Databricks)
Spark usage scenarios @PayPal (publicly exposed only)
- SCaaS: Spark Compute as a Service at Paypal - Prabhu Kasinathan
- Merchant Churn Prediction Using SparkML at PayPal (Chetan Nadgire and Aniket Kulkarni)
- Graph Representation Learning to Prevent Payment Collusion Fraud (Venkatesh Ramanathan)
- PayPal Merchant ecosystem using Spark, Hive, Druid, HBase & Elasticsearch
- Known issues
- Submit your PR