DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD

Map-reduce, streaming analysis, and external memory algorithms and their implementation using the Hadoop and its eco-system: HBase, Hive, Pig and Spark. The class will include assignment of analyzing large existing databases.

Spark Installation (Python)

Operating System	Blog Post	Youtube Video
Mac	Install Spark on Mac	Youtube Video
Ubuntu	Install Spark on Ubuntu	Youtube Video
Windows	Install Spark on Windows	Youtube Video

Section 1: Distributed computation using Map Reduce

map-reduce
counting words example, loading, processing, collecting.
The work environment: Notebooks, markdown, code cells, display cells, S3, passwords and Vault, github.
the memory hierarchy, S3 File, SQL tables, data frames / RDD, Parquet files.

Section 2: Analysis based on squared error:

Built-in PCA: https://github.com/apache/spark/blob/master/examples/src/main/python/ml/pca_example.py
Built-in Regression
PCA with missing values
Mahalanobis Distance
K-means
Compressed representation and reconstruction

Section 3: Classification:

Logistic regression
- https://github.com/apache/spark/blob/master/examples/src/main/python/ml/logistic_regression_with_elastic_net.py
Tree-based regression
- https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/decision_tree_regression_example.py
Ensamble methods for classification
- Random forests: https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/random_forest_classification_example.py
- gradient boosted trees: https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/gradient_boosting_classification_example.py

Section 4: Performance tuning: measuring and tuning spark applications

Configuration: http://spark.apache.org/docs/latest/configuration.html
Monitoring: http://spark.apache.org/docs/latest/monitoring.html
Tuning: http://spark.apache.org/docs/latest/tuning.html

Section 5: Spark Streaming and stochastic gradient descent

Assignments (From Newest to Oldest)

[Homework 5 Part 2: Higgs Boson](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Homeworks/HW5/2.Higgs.ipynb)

[Homework 5 Part 1: Cover Types](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Homeworks/HW5/1.CoverType.ipynb)

[Homework 3 Part 2: Reconstruction of Plots](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Homeworks/2.Reconstruction-HW-Copy.ipynb)

[Homework 3 Part 1: PCA analysis](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Homeworks/1.PCA_analysis-HW-Copy.ipynb)

[Homework 2](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Homeworks/HW-2.ipynb)

[Homework 1: Spark Moby Dick N Grams](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Submissions/HW-1_MichaelGalarnyk.py)

Notes

[Timing for Regex vs string.translate and string.replace](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Timing_Regex_Translate_Replace_Join.ipynb)

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Coursera_Big_Data_Specialization		Coursera_Big_Data_Specialization
Data		Data
Final		Final
HW_Solutions		HW_Solutions
Homeworks		Homeworks
Lectures		Lectures
Notebooks		Notebooks
Original_Homeworks		Original_Homeworks
Submissions		Submissions
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
Timing_Regex_Translate_Replace_Join.ipynb		Timing_Regex_Translate_Replace_Join.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD

Spark Installation (Python)

Section 1: Distributed computation using Map Reduce

Section 2: Analysis based on squared error:

Section 3: Classification:

Section 4: Performance tuning: measuring and tuning spark applications

Section 5: Spark Streaming and stochastic gradient descent

Assignments (From Newest to Oldest)

Notes

About

Releases

Packages

Languages

mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD

Folders and files

Latest commit

History

Repository files navigation

DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD

Spark Installation (Python)

Section 1: Distributed computation using Map Reduce

Section 2: Analysis based on squared error:

Section 3: Classification:

Section 4: Performance tuning: measuring and tuning spark applications

Section 5: Spark Streaming and stochastic gradient descent

Assignments (From Newest to Oldest)

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages