Pandas

Loading pandas

`Series`

One dimentional ndarray (N-dimentional array) with axis labels.

index: Pandas represents Series as a numerical array. index is stored seperately.

Create from `list`

years = [1776, 1867, 1821]
countries = ['USA', 'Canada', 'Mexico']
indep = pd.Series(data=years, index=countries)
indep

USA       1776
Canada    1867
Mexico    1821
dtype: int64

indep['USA'] # or indep[0]

Create from `dict`

pd.Series({ 'Sam': 5, 'Frank': 10, 'Spike': 7 })

Sam       5
Frank    10
Spike     7
dtype: int64

Properties of `Series`

q1 = { 'Japan': 80, 'China': 450, 'India': 200, 'USA': 250 }
q2 = { 'Brazil': 100,'China': 500, 'India': 210,'USA': 260 }
sales_Q1 = pd.Series(q1)
sales_Q2 = pd.Series(q2)

Index keys

sales_Q2.keys()

Index(['Brazil', 'China', 'India', 'USA'], dtype='object')

Broadcasting

Same as in numpy.

sales_Q1 * 3

Japan     240
China    1350
India     600
USA       750
dtype: int64

`+`, `-`, `/`, `*`

Pandas uses NaN for missing values.

sales_Q1 + sales_Q2

Brazil      NaN
China     950.0
India     410.0
Japan       NaN
USA       510.0
dtype: float64

Use .add() (or other operator methods) to specify a fill_value.

sales_Q1.add(sales_Q2, fill_value=0)

Brazil    100.0
China     950.0
India     410.0
Japan      80.0
USA       510.0
dtype: float64

`DataFrame`

A group of Series objects that share the same index.

Creating a DataFrame

d = np.random.randint(0, 100, (4, 3))
i = np.array(['CA', 'NY', 'AZ', 'TX'], dtype='U2')
c = np.array(['Jan', 'Feb', 'Mar'], dtype='U3')
df = pd.DataFrame(d, index=i, columns=c)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, CA to TX
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Jan     4 non-null      int64
 1   Feb     4 non-null      int64
 2   Mar     4 non-null      int64
dtypes: int64(3)
memory usage: 128.0+ bytes

Attributes and methods of a DataFrame

columns

df.columns

Index(['Jan', 'Feb', 'Mar'], dtype='object')

index

df.index

Index(['CA', 'NY', 'AZ', 'TX'], dtype='object')

head, tail First or last 5 rows. Count may be changed.
info Gives information about a dataframe.
describe Gives basic statistical information on numerical columns.

df.describe()

             Jan        Feb       Mar
count   4.000000   4.000000   4.00000
mean   14.000000  39.500000  44.25000
std     8.406347  31.171568  18.20943
min     5.000000   4.000000  23.00000
25%     9.500000  28.000000  35.75000
50%    13.000000  37.000000  43.50000
75%    17.500000  48.500000  52.00000
max    25.000000  80.000000  67.00000

transpose Transpose (or flip) columns and rows.

df.describe().transpose()

     count   mean        std   min    25%   50%   75%   max
Jan    4.0  14.00   8.406347   5.0   9.50  13.0  17.5  25.0
Feb    4.0  39.50  31.171568   4.0  28.00  37.0  48.5  80.0
Mar    4.0  44.25  18.209430  23.0  35.75  43.5  52.0  67.0

drop Drops a row or a column depending on the axis.

df.drop('Mar', axis=1) # axis=0 row, axis=1 column
# df.drop('Mar', axis=1, inplace=True)

    Jan  Feb
CA    5    4
NY   25   36
AZ   11   80
TX   15   38

inplace may be used to permenantly delete. However, there are signs it will be deprecated. Assigning the resulting dataframe to itself is more preferable.

Columns

Indexing

Grab a single column. Returns a pandas Series.

df['Jan']

CA     5
NY    25
AZ    11
TX    15
Name: Jan, dtype: int64

type(df['Jan'])

<class 'pandas.core.series.Series'>

Grab multiple columns.

df[['Jan', 'Feb']]

    Jan  Feb
CA    5    4
NY   25   36
AZ   11   80
TX   15   38

type(df[['Jan', 'Feb']])

<class 'pandas.core.frame.DataFrame'>

Assign new columns

df['Z Feb'] = np.round((df['Feb'] - df['Feb'].mean()) / df['Feb'].std(), 2)
df['Z Feb']

CA   -1.14
NY   -0.11
AZ    1.30
TX   -0.05
Name: Z Feb, dtype: float64

Rows

df = pd.read_csv('tips.csv')
df.head()

   total_bill   tip     sex smoker  day  ... size  price_per_person          Payer Name         CC Number  Payment ID
0       16.99  1.01  Female     No  Sun  ...    2              8.49  Christy Cunningham  3560325168603410     Sun2959
1       10.34  1.66    Male     No  Sun  ...    3              3.45      Douglas Tucker  4478071379779230     Sun4608
2       21.01  3.50    Male     No  Sun  ...    3              7.00      Travis Walters  6011812112971322     Sun4458
3       23.68  3.31    Male     No  Sun  ...    2             11.84    Nathaniel Harris  4676137647685994     Sun5260
4       24.59  3.61  Female     No  Sun  ...    4              6.15        Tonya Carter  4832732618637221     Sun2251

:

[5 rows x 11 columns]

Changing the index

Any primary key may be used as an index.

dfp = df.set_index('Payment ID')
dfp.head()

            total_bill   tip     sex smoker  ... size price_per_person          Payer Name         CC Number
Payment ID                                   ...
Sun2959          16.99  1.01  Female     No  ...    2             8.49  Christy Cunningham  3560325168603410
Sun4608          10.34  1.66    Male     No  ...    3             3.45      Douglas Tucker  4478071379779230
Sun4458          21.01  3.50    Male     No  ...    3             7.00      Travis Walters  6011812112971322
Sun5260          23.68  3.31    Male     No  ...    2            11.84    Nathaniel Harris  4676137647685994
Sun2251          24.59  3.61  Female     No  ...    4             6.15        Tonya Carter  4832732618637221

:

[5 rows x 10 columns]

To reset index.

dfp.reset_index().head()

  Payment ID  total_bill   tip     sex smoker  ...    time size  price_per_person          Payer Name         CC Number
0    Sun2959       16.99  1.01  Female     No  ...  Dinner    2              8.49  Christy Cunningham  3560325168603410
1    Sun4608       10.34  1.66    Male     No  ...  Dinner    3              3.45      Douglas Tucker  4478071379779230
2    Sun4458       21.01  3.50    Male     No  ...  Dinner    3              7.00      Travis Walters  6011812112971322
3    Sun5260       23.68  3.31    Male     No  ...  Dinner    2             11.84    Nathaniel Harris  4676137647685994
4    Sun2251       24.59  3.61  Female     No  ...  Dinner    4              6.15        Tonya Carter  4832732618637221

:

[5 rows x 11 columns]

Querying based on index

Integer based indexing.

df.iloc[0]

total_bill                       16.99
tip                               1.01
sex                             Female
smoker                              No
day                                Sun
time                            Dinner
size                                 2
price_per_person                  8.49
Payer Name          Christy Cunningham
CC Number             3560325168603410
Payment ID                     Sun2959
Name: 0, dtype: object

Label based indexing.

dfp.loc["Sun2959"]

total_bill                       16.99
tip                               1.01
sex                             Female
smoker                              No
day                                Sun
time                            Dinner
size                                 2
price_per_person                  8.49
Payer Name          Christy Cunningham
CC Number             3560325168603410
Name: Sun2959, dtype: object

Indexing multiple rows.

dfp.iloc[0:4] # pythonic slicing may be used
dfp.loc[['Sun2959', 'Sun5260']].transpose()

Payment ID                   Sun2959           Sun5260
total_bill                     16.99             23.68
tip                             1.01              3.31
sex                           Female              Male
smoker                            No                No
day                              Sun               Sun
time                          Dinner            Dinner
size                               2                 2
price_per_person                8.49             11.84
Payer Name        Christy Cunningham  Nathaniel Harris
CC Number           3560325168603410  4676137647685994

Appending rows

.append() method has been deprecated. Use .concat() instead.

row = df.iloc[0]
pd.concat([df, row.to_frame().T]).head()

  total_bill   tip     sex smoker  day  ... size price_per_person          Payer Name         CC Number Payment ID
0      16.99  1.01  Female     No  Sun  ...    2             8.49  Christy Cunningham  3560325168603410    Sun2959
1      10.34  1.66    Male     No  Sun  ...    3             3.45      Douglas Tucker  4478071379779230    Sun4608
2      21.01   3.5    Male     No  Sun  ...    3              7.0      Travis Walters  6011812112971322    Sun4458
3      23.68  3.31    Male     No  Sun  ...    2            11.84    Nathaniel Harris  4676137647685994    Sun5260
4      24.59  3.61  Female     No  Sun  ...    4             6.15        Tonya Carter  4832732618637221    Sun2251

:

[5 rows x 11 columns]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.org

pandas.org

Pandas

Loading pandas

`Series`

Create from `list`

Create from `dict`

Properties of `Series`

Index keys

Broadcasting

`+`, `-`, `/`, `*`

`DataFrame`

Creating a DataFrame

Attributes and methods of a DataFrame

Columns

Indexing

Assign new columns

Rows

Changing the index

Querying based on index

Appending rows

Files

pandas.org

Latest commit

History

pandas.org

File metadata and controls

Pandas

Loading pandas

Series

Create from list

Create from dict

Properties of Series

Index keys

Broadcasting

+, -, /, *

DataFrame

Creating a DataFrame

Attributes and methods of a DataFrame

Columns

Indexing

Assign new columns

Rows

Changing the index

Querying based on index

Appending rows

`Series`

Create from `list`

Create from `dict`

Properties of `Series`

`+`, `-`, `/`, `*`

`DataFrame`