This project is designed to help you systematically explore and analyze datasets using Python and pandas, following the structure of the Real Python tutorial “Using pandas and Python to Explore Your Dataset.” It provides scripts, notebooks, and examples that demonstrate:
-
📥 Environment setup
- Install Python 3, pandas, matplotlib (and optionally Jupyter/Anaconda).
- Sample install commands using
pip
orconda
.
-
🧰 Data ingestion
- Use
pd.read_csv()
(orread_json()
,read_html()
, etc.) to load data. - Example uses real-world data: e.g. NBA results CSV.
- Use
-
🔍 Initial data inspection
- Use
.head()
,.tail()
,.info()
,.shape
, andlen()
to get a quick overview. - Adjust display settings (
display.max.columns
,display.precision
) for better visibility.
- Use
-
🗂️ Data structure understanding
- Explore
Series
andDataFrame
basics. - Compare indexing methods: bracket
[]
,.loc
,.iloc
.
- Explore
-
❓ Querying & filtering
- Filter rows via conditions (e.g.,
df[df["col"] > X]
). - Use
.loc
,.iloc
to select specific rows/columns.
- Filter rows via conditions (e.g.,
-
📊 Grouping & aggregating
- Summarize data using
.groupby()
,.sum()
,.mean()
,.count()
. - Combine datasets (
concat
,merge
) when working with multiple sources.
- Summarize data using
-
🧼 Cleaning & casting
- Detect and handle missing/inconsistent/invalid values.
- Convert types (
df["col"] = df["col"].astype(...)
) as needed.
-
📈 Visualization
- Use pandas' built-in
.plot()
(histograms, scatter, bar, etc.) to visualize distributions, trends, and categories. - Leverage matplotlib integration within Jupyter or standalone scripts.
- Use pandas' built-in
environment_setup/
– shell scripts and instructions to configure your Python environment.data/
– sample datasets (e.g. NBA ELO, FiveThirtyEight, etc.).notebooks/
– Jupyter notebooks illustrating each key step:- Overview & loading
- Inspection & display settings
- Indexing & selection
- Querying & filtering
- GroupBy and aggregation
- Cleaning & typing
- Merging datasets
- Visualizing data
scripts/
– Python files that reproduce key tasks outside Jupyter.requirements.txt
– minimal dependencies (pandas, matplotlib, Jupyter optional).README.md
– (this file).
- Clone the repo
git clone <repo-url> cd <repo-folder>
- The virtual environment (
pandas_env/
) is included for convenience. - Outputs
-
Set up your environment
pip install -r requirements.txt # or conda install pandas matplotlib jupyter
-
Run a nootbook
jupiter notebook
-
Explore!
- Follow the notebooks step-by-step to learn:
- Inspecting your data with .info(), .head(), .shape, .describe()
- Subsetting using .loc, .iloc, filtering expressions
- Grouping and summarizing by category
- Cleaning missing and inconsistent entries
- Visualizing distributions and relationships with .plot()
- By the end of this project, you’ll be able to:
- Load data from multiple formats into pandas
- Understand core data structures: Series & DataFrame
- Access and filter data efficiently
- Aggregate and group information to extract insights
- Clean data and prepare it for analysis
- Create visualizations that highlight key patterns
- Combine multiple datasets for comprehensive analysis
Reka Horvath (2020, January 06). Using pandas and Python to Explore Your Dataset Real Python. https://realpython.com/pandas-python-explore-dataset/