As a (to-be) or (current) data scientist, data engineer, data analyst, machine learning engineer or any other professional or new-bie or student in this space, the below are useful points to know in a snapshot. They can become your learning points or fill any gaps in your understanding. Credits to the authors of the Kaggle survey, most of the ideas are from that document.
- Important part of your role at work (if work with data)
- Media sources
- Useful blogs to read
- Course providers
- Courses
- Primary tools to analyse data
- IDEs
- Hosted Notebook products
- Programming languages
- Data visualization libraries or tools
- Specialized hardware
- Machine Learning Algorithms
- Machine learning frameworks
- Machine learning products
- Big data / analytics products
- Cloud computing platforms
- Cloud computing products
- Automated pipelines
- Automated machine learning tools (or partial AutoML tools)
- Tools to help manage machine learning experiments
- Publicly share or deploy your data analysis or machine learning applications
- Relational database products
- Other Tools
- Contributing
- Analyze and understand data to influence product or business decisions
- Build and/or run the data infrastructure that the business uses for storing, analyzing, and operationalizing data
- Build prototypes to explore applying machine learning to new areas
- Build and/or run a machine learning service that operationally improves the product or workflows
- Experimentation and iteration to improve existing ML models
- Do research that advances the state of the art of machine learning
- Hacker News
- Twitter: data science influencers
- Reddit: r/machinelearning, r/datascience, etc
- Kaggle: forums, blog, social media, etc
- Course Forums (forums.fast.ai, Coursera forums, etc)
- Podcasts
- Blogs: Towards Data Science, Medium, Analytics Vidhya, KDnuggets, etc
- Slack Communities: ods.ai, kagglenoobs, etc
- Journal Publications: traditional publications, preprint journals, etc
- Course Forums: forums.fast.ai, etc
- YouTube: Google Cloud AI Adventures, Siraj Raval, etc
- MWML Newsletter | MWML Site | lessons
- Data Pheonix Newsletter (previously known as Data Science Digest) | past issues
- Other email newsletters (Data Elixir, O'Reilly Data & AI, etc)
- What I Learned from Writing a Data Science Article Every Week for a Year
- Why you should be a Generalist first, Specialist later as a Data Scientist?
- Lessons from the deep end of data science slides | halfstackdatascience site
- DataQuest
- Coursera
- DataCamp
- LinkedIn Learning: [1] | [2] | [3]
- edX
- Fast.ai
- Udacity
- Udemy
- University Courses
- Cloud-certification programs (direct from AWS, Azure, GCP, or similar)
- Kaggle Courses (Kaggle Learn)
- Learning Centre H2O
See Courses
- Basic statistical software (Microsoft Excel, Google Sheets, etc.)
- Advanced statistical software (SPSS, SAS, etc.)
- Business intelligence software (Salesforce, Tableau, Spotfire, etc.)
- Local development environments (RStudio, JupyterLab, etc.)
- Cloud-based data software & APIs (AWS, GCP, Azure, etc.)
- Pandas profiling
- Bamboolib
- Dabl: Data Analysis Baseline Library (just like Pandas profiling) | GitHub
- Also see resources under Data: [1] | [2]
- Vim
- Emacs
- Spacemacs: Emacs + Vim
- PyCharm
- Spyder
- RStudio
- Atom
- Jupyter (JupyterLab, Jupyter Notebooks, etc...)
- Sublime Text
- VSCode
- Visual Studio
- MATLAB
- Notepad++
- Also see Cheatsheets
- Microsoft Azure Notebooks
- FloydHub
- Paperspace / Gradient
- Code Ocean
- AWS Notebook Products (EMR Notebooks, Sagemaker Notebooks, etc)
- Amazon EMR Notebooks
- Amazon Sagemaker Studio
- Databricks Collaborative Notebooks
- Google Colab
- Google Cloud Notebook Products (AI Platform, Datalab, etc)
- Google Cloud Datalab Notebooks
- Binder / JupyterHub
- Kaggle Notebooks (Kernels)
- IBM Watson Studio
- Count
- Also see resources under Notebooks and Cheatsheets
- Swift
- MATLAB
- R
- Julia
- Javascript
- Java
- C
- Bash
- SQL
- C++
- Python, also see Programming in Python
- Typescript
- Also see Cheatsheets
- Matplotlib
- Bokeh
- Gglot / ggplot2
- Shiny
- Geoplotlib
- Leaflet / Folium
- Plotly / Plotly Express - recommend learning
- Seaborn
- D3.js - recommend learning libraries built on top of D3.js
- Altair
- Pandas profiling
- Bamboolib
- See more resources under Visualisation and Cheatsheets
- CPUs
- GPUs
- See also NVIDIA's RAPIDS
- TPUs
- FPGA
- IPU
- See more resources under Cloud/DevOps/Infra
- Decision Trees or Random Forests
- Generative Adversarial Networks
- Convolutional Neural Networks
- Linear or Logistic Regression
- Gradient Boosting Machines (xgboost, lightgbm, etc)
- Dense Neural Networks (MLPs, etc)
- Bayesian Approaches
- Evolutionary Approaches: Approach 1 | Approach 2
- Recurrent Neural Networks
- Transformer Networks (BERT, gpt-2, etc)
- See more resources under Machine Learning resources: [1] | [2]
- RandomForest
- Tensorflow
- Swift for TensorFlow - next generation platform for deep learning and differentiable programming
- LightGBM
- Keras
- Caret
- PyTorch | Also see PyTorch
- Fast.ai
- Spark Mlib
- Scikit-learn
- Xgboost
- CatBoost
- H2O 3
- Prophet
- Tidymodels
- MXNet
- JAX
- See more resources under Machine Learning resources: [1] | [2] and Cheatsheets
- Amazon SageMaker
- Google Cloud Speech-to-Text
- SAS
- Azure Machine Learning Studio
- Google Cloud Machine Learning Engine
- Google Cloud Natural Language
- Google Cloud Vision
- RapidMiner
- Google Cloud Translation
- Cloudera
- See more resources under Machine Learning resources: [1] | [2], Cloud/DevOps/Infra, Data > Programs and Tools and Cheatsheets
- Teradata
- Google Cloud Pub/Sub
- AWS Elastic MapReduce
- Google Cloud Dataflow
- AWS Redshift
- AWS Athena
- AWS Kinesis
- Microsoft Analysis Services
- Google BigQuery
- Databricks
- Microsoft Azure Data Lake Storage
- Apache Spark
- Apache Hive
- Apache Pig
- See more resources under Cloud/DevOps/Infra, Data > Programs and Tools, Databases and Cheatsheets
- IBM Cloud
- VMWare Cloud
- Alibaba Cloud
- SAP Cloud
- Google Cloud Platform (GCP)
- RedHat Cloud
- Oracle Cloud
- Amazon Web Services (AWS)
- Microsoft Azure
- Salesforce Cloud
- Tencent Cloud
- See more resources under Cloud/DevOps/Infra and Data > Programs and Tools
- Oracle Cloud Infrastructure
- Google Kubernetes Engine
- Azure Virtual Machines
- Google Compute Engine (GCE)
- AWS Elastic Beanstalk
- AWS Elastic Compute Cloud (EC2)
- Google App Engine
- AWS Lambda
- Google Cloud Functions
- Azure Container Service
- AWS Batch
- See more resources under Cloud/DevOps/Infra and Data > Programs and Tools
- Automated data augmentation (e.g. imgaug, albumentations
- Automated feature engineering/selection (e.g. tpot, boruta_py)
- Automated model selection (e.g. auto-sklearn, xcessiv)
- Automated model architecture searches (e.g. darts, enas)
- Automated hyperparameter tuning (e.g. hyperopt, ray.tune, Vizier)
- Automated Machine Learning Hyperparameter Tuning in Python LinkedIn Post
- Automation of full ML pipelines (e.g. Google AutoML, H20 Driverless AI)
- Machine Learning incl. #AutoML overfits. On a mission to develop Causal AI
- Automation of full ML pipelines: (e.g. Google AutoML, H2O Driverless AI)
- See more resources under Automation, Cloud/DevOps/Infra, and Data > Programs and Tools
- Databricks AutoML
- DataRobot AutoML
- Tpot
- Google AutoML
- Auto_ml
- Auto-Keras
- Auto-Sklearn
- PyCaret | On awesome-ai-ml-dl
- Xcessiv
- MLbox
- H20 Driverless AI
- AutoGluon - AutoML for Image, Text, and Tabular Data.
- FLAML - A fast library for AutoML and tuning.
- Neural Network Intelligence - An open source AutoML toolkit for automate machine learning lifecycle.
- The 3 Best Free Online Resources to Learn MLOps
- An awesome list of references for MLOps - Machine Learning Operations 👉 ml-ops.org
- Awesome Production Machine Learning
- See more resources under Automation, Cloud/DevOps/Infra, Data > Programs and Tools and Cheatsheets
- Valohai
- Neptune.ai
- Weights & Biases
- Comet.ml
- Sacred + Omniboard
- TensorBoard
- Polyaxon
- Guild AI
- Trains
- Domino Model Monitor
- Apache Airflow
- Flyte
- Oracle Database
- Microsoft SQL Server
- Azure SQL Database
- PostgresSQL
- SQLite
- AWS Relational Database Service
- Microsoft Access
- AWS DynamoDB
- MySQL
- Google Cloud SQL
- Google Cloud Firestore
- MongoDB
- Snowflake
- IBM Db2
- See other resources under Databases and Cheatsheets
- Amazon QuickSight
- Microsoft Power BI
- Google Data Studio
- Tableau
- Qlik
- Domo
- TIBCO Spotfire
- Alteryx
- Sisense
- SAP Analytics Cloud
- Snowplow Analytics
- Looker
- ChartIO: Cloud-based data analytics exploration for all
- Salesforce
- Einstein Analytics
- Count
Contributions are very welcome, please share back with the wider community (and get credited for it)!
Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.
Back to main page (table of contents)