Range of cheat sheets, coding resources, videos, etc that I want to keep track of & others may find helpful.
PySpark_RDD_Cheat_Sheet.pdf Source: https://www.datacamp.com/cheat-sheet/pyspark-cheat-sheet-spark-in-python
# Create a DataFrame from a CSV file:
df = spark.read.csv("/mnt/datasets/sample.csv", header=True, inferSchema=True)
# Display the first few rows of the DataFrame:
# Select columns from a DataFrame:
df.select("column1", "column2").show()
Best Practice: Avoid using collect()
Avoid using collect() as it brings all data to the driver node, which can cause memory issues.
# Bad practice - using collect() to bring all data to driver:
data = df.collect()
# Better practice - use show() or take() instead:
Repartitioning DataFrames:
Repartition your DataFrames to optimize performance when dealing with large datasets.
# Repartition the DataFrame based on a column:
df = df.repartition("column_name")
Check if your df is pandas or pyspark
Convert pandas to pyspark df
from pyspark.sql import SparkSession
# Initialize Spark session if not already done
spark = SparkSession.builder.getOrCreate()
# Convert Pandas DataFrames to PySpark
example_df = spark.createDataFrame(example_df)
.join(df1, fn.col("Code") == fn.col("Der_Code"), how="left")
See the df in different ways
# Prints the columns & types
# Prints list of columns in a paragraph format
#Prints disinct values in a given column
Drop columns
df = df.drop("column1", "column2")
List of built-in functions: Beginner friendly build-in function list
Specific functions:
Pandas library doc:
Some standard data investigation code: https://medium.com/analytics-vidhya/statistical-analysis-in-python-using-pandas-27c6a4209de2
Cheat Sheets
Source: https://www.datacamp.com/cheat-sheet/python-for-data-science-a-cheat-sheet-for-beginners
Source: https://www.datacamp.com/cheat-sheet/numpy-cheat-sheet-data-analysis-in-python
Source: https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-for-data-science-in-python
Source: https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-data-wrangling-in-python
SQL Best Practices Cheat Sheet Source: https://aeshantechhub.co.uk/databricks-dbutils-cheat-sheet-and-pyspark-amp-sql-best-practice-cheat-sheet/
SQL is widely used in Databricks for data querying and transformation. Below are some best practices to keep your queries optimized.
Common SQL Operations:
# Creating a table from a DataFrame:
# Running a SQL query on the DataFrame:
spark.sql("SELECT * FROM temp_table").show()
Best Practice: Use LIMIT when previewing data
Avoid fetching large datasets during development. Use LIMIT to preview data instead.
# Use LIMIT to preview data in SQL:
spark.sql("SELECT * FROM temp_table LIMIT 10").show()
Best Practice: Leverage Caching
Cache intermediate results in memory to optimise performance for iterative queries.
# Cache a DataFrame for future use:
Avoid using SELECT * in production
Using SELECT * can lead to unnecessary data transfer and slow performance, especially with large datasets.
# Bad practice - using SELECT *:
spark.sql("SELECT * FROM temp_table")
# Better practice - select only needed columns:
Complete formatting cheat sheet: Markdown Cheatsheet
Some bits from the above I used alot:
Style Syntax Keyboard shortcut Example Output
Bold ** ** or __ __
Example: **This is bold text**
Italic * * or _ _
Example: _This text is italicized_
Strikethrough ~~ ~~
Example: ~~This was mistaken text~~
Bold and nested italic ** ** and _ _
Example: **This text is _extremely_ important**
All bold and italic *** ***
Example: ***All this text is important***
Subscript <sub> </sub>
Example: This is a <sub>subscript</sub> text
Superscript <sup> </sup>
Example:This is a <sup>superscript</sup> text This is a superscript text
Underline <ins> </ins>
Example:This is an <ins>underlined</ins> text This is an underlined text
Horizontal rules
Three or more...
Insert a hyperlink
[Insert your text here](https://www.google.com)
You can also use code blocks to create diagrams in Markdown. GitHub supports Mermaid, GeoJSON, TopoJSON, and ASCII STL syntax. For more information, see Creating diagrams.
Source: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks You can add an optional language identifier to enable syntax highlighting in your fenced code block.
Syntax highlighting changes the color and style of source code to make it easier to read.
For example, to syntax highlight Markdown code:
hello this is my markdown
You can put code in this block & it will show in a grey box. Change your language as required. More info can be found below on supported languages.
This will display the code block. If using python, the code block will be formatted with colours.
When you create a fenced code block that you also want to have syntax highlighting on a GitHub Pages site, use lower-case language identifiers. For more information, see About GitHub Pages and Jekyll.
We use Linguist to perform language detection and to select third-party grammars for syntax highlighting. You can find out which keywords are valid in the languages YAML file.
To create a contents and sections in your markdown file, you can use the below code. Source: https://stackoverflow.com/questions/11948245/markdown-to-create-pages-and-table-of-contents
# Table of contents
1. [Introduction](#introduction)
2. [Some paragraph](#paragraph1)
1. [Sub paragraph](#subparagraph1)
3. [Another paragraph](#paragraph2)
## This is the introduction <a name="introduction"></a>
Some introduction text, formatted in heading 2 style
## Some paragraph <a name="paragraph1"></a>
The first paragraph text
### Sub paragraph <a name="subparagraph1"></a>
This is a sub paragraph, formatted in heading 3 style
## Another paragraph <a name="paragraph2"></a>
The second paragraph text
See full list here π:
To create a task list, preface list items with a hyphen and space followed by [ ]. To mark a task as complete, use [x].
- [x] #739
- [ ] https://github.com/octo-org/octo-repo/issues/740
- [ ] Add delight to the experience when all tasks are complete :tada:
Markdown code:
<summary>This is a collapsed section</summary>
### You can add a header
You can add text within a collapsed section.
You can add an image or a code block, too.
print("Hello World")
Preview how it looks: