I gained experience in handling missing data (NaN values), importing data from CSV files, removing unnecessary data, selecting specific columns, identifying maximum values within datasets, and locating corresponding row indices.
I have acquired proficiency in utilizing Pandas and Matplotlib to transform raw data into informative visualizations. I employed pandas and matplotlib to generate a visual representation of the most widely used programming languages from 2008 to 2024.
I learnt how to:
- use HTML markdowns in Notebooks
- combine the groupby() and count() functions to aggregate data
- use the value_counts() function
- slice DataFrames using the square bracket notation
- use the agg() function to run an operation on a particular column
- rename() columns of DataFrames
- create a linear chart with two seperate axes to visualize data that have different scales
- create a scatter plot and Bar chart in Matplotlib
- work with tables in a relational database by using primary and foreign keys
- merge() DataFrames along a particular column.
I learnt how to:
- use .describe() to get a snapshot of your data like average, highest and lowest values
- use .resample() to make a time-series data comparable to another by changing the periodicity.
- work with matplotlib.dates Locators to better style a timeline (e.g., an axis on a chart).
- find the number of NaN values with .isna().values.sum()
- change the resolution of a chart using the figure's dpi
- create dashed '--' and dotted '-.' lines using linestyles
- use different kinds of markers (e.g., 'o' or '^') on charts.
- fine-tune the styling of Matplotlib charts by using limits, labels, linewidth and colours
- use .grid() to help visually identify seasonality in a time series.
I learnt how to:
- pull a random sample from a DataFrame using .sample()
- find duplicated entries with .duplicated() and .drop_duplicates()
- convert string and object data types into numbers with .to_numeric()
- use plotly to generate pie, donut and bar charts as well as box and scatter plots
I learnt how to:
- create arrays with np.array()
- generate arrays using .arange(), .random() and .linspace()
- analyse the shape and dimensions of ndarray
- slice and subset a ndarray based on its indices
- do linear algebra like operations with scalars and matrix multiplication
- use NumPy's broadcasting to make ndarrays shapes compatible
- manipulate images in the form ndarrays
I learnt how to:
- use nested loops to remove unwanted characters from multiple columns
- create bubble charts using Seaborn library
- filter Pandas DataFrame based on multiple conditions using both .loc[] and .query()
- style Seaborn charts using the pre-built styles and by modifying Matplotlib parameters
- use floor division to convert years to decades
- use Seaborn to superimpose a linear regression over our data
- run regressions with scikit-learn and calculate the coefficients
I learnt how to:
- create a Choropleth to display data on a map
- create bar charts showing different segments of the data with plotly
- create Sunburst charts with plotly.
- use Seaborn's .lmplot() and show best-fit lines across multiple categories using the row, hue, and lowess parameters
I learnt how to:
- use histograms to visualise distributions
- superimpose histograms on top of each other even when the data series have different lengths
- use a to smooth out kinks in a histogram and visualise a distribution with a Kernel Density Estimate (KDE)
- improve a KDE by specifying boundaries on the estimates
- use scipy and test for statistical significance by looking at p-values
- highlight different parts of a time series chart in Matplotib
- add and configure a Legend in Matplotlib
- NumPy's .where() function to process elements depending on a condition
I learnt how to:
- quickly spot relationships in a dataset using Seaborn's .pairplot()
- split the data into a training and testing dataset to better evaluate a model's performance
- run a multivariable regression
- evaluate that regression-based on the sign of its coefficients
- analyse and look for patterns in a model's residuals
- improve a regression model using (a log) data transformation
- specify your own values for various features and use your model to make a prediction