Skip to content

Latest commit

 

History

History
378 lines (325 loc) · 10.9 KB

pandas.org

File metadata and controls

378 lines (325 loc) · 10.9 KB

Pandas

Loading pandas

Series

One dimentional ndarray (N-dimentional array) with axis labels.

  • index: Pandas represents Series as a numerical array. index is stored seperately.

Create from list

years = [1776, 1867, 1821]
countries = ['USA', 'Canada', 'Mexico']
indep = pd.Series(data=years, index=countries)
indep
USA       1776
Canada    1867
Mexico    1821
dtype: int64
indep['USA'] # or indep[0]
1776

Create from dict

pd.Series({ 'Sam': 5, 'Frank': 10, 'Spike': 7 })
Sam       5
Frank    10
Spike     7
dtype: int64

Properties of Series

q1 = { 'Japan': 80, 'China': 450, 'India': 200, 'USA': 250 }
q2 = { 'Brazil': 100,'China': 500, 'India': 210,'USA': 260 }
sales_Q1 = pd.Series(q1)
sales_Q2 = pd.Series(q2)

Index keys

sales_Q2.keys()
Index(['Brazil', 'China', 'India', 'USA'], dtype='object')

Broadcasting

Same as in numpy.

sales_Q1 * 3
Japan     240
China    1350
India     600
USA       750
dtype: int64

+, -, /, *

Pandas uses NaN for missing values.

sales_Q1 + sales_Q2
Brazil      NaN
China     950.0
India     410.0
Japan       NaN
USA       510.0
dtype: float64
  • Use .add() (or other operator methods) to specify a fill_value.
sales_Q1.add(sales_Q2, fill_value=0)
Brazil    100.0
China     950.0
India     410.0
Japan      80.0
USA       510.0
dtype: float64

DataFrame

A group of Series objects that share the same index.

Creating a DataFrame

d = np.random.randint(0, 100, (4, 3))
i = np.array(['CA', 'NY', 'AZ', 'TX'], dtype='U2')
c = np.array(['Jan', 'Feb', 'Mar'], dtype='U3')
df = pd.DataFrame(d, index=i, columns=c)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, CA to TX
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Jan     4 non-null      int64
 1   Feb     4 non-null      int64
 2   Mar     4 non-null      int64
dtypes: int64(3)
memory usage: 128.0+ bytes

Attributes and methods of a DataFrame

  • columns
df.columns
Index(['Jan', 'Feb', 'Mar'], dtype='object')
  • index
df.index
Index(['CA', 'NY', 'AZ', 'TX'], dtype='object')
  • head, tail First or last 5 rows. Count may be changed.
  • info Gives information about a dataframe.
  • describe Gives basic statistical information on numerical columns.
df.describe()
             Jan        Feb       Mar
count   4.000000   4.000000   4.00000
mean   14.000000  39.500000  44.25000
std     8.406347  31.171568  18.20943
min     5.000000   4.000000  23.00000
25%     9.500000  28.000000  35.75000
50%    13.000000  37.000000  43.50000
75%    17.500000  48.500000  52.00000
max    25.000000  80.000000  67.00000
  • transpose Transpose (or flip) columns and rows.
df.describe().transpose()
     count   mean        std   min    25%   50%   75%   max
Jan    4.0  14.00   8.406347   5.0   9.50  13.0  17.5  25.0
Feb    4.0  39.50  31.171568   4.0  28.00  37.0  48.5  80.0
Mar    4.0  44.25  18.209430  23.0  35.75  43.5  52.0  67.0
  • drop Drops a row or a column depending on the axis.
df.drop('Mar', axis=1) # axis=0 row, axis=1 column
# df.drop('Mar', axis=1, inplace=True)
    Jan  Feb
CA    5    4
NY   25   36
AZ   11   80
TX   15   38

inplace may be used to permenantly delete. However, there are signs it will be deprecated. Assigning the resulting dataframe to itself is more preferable.

Columns

Indexing

  • Grab a single column. Returns a pandas Series.
df['Jan']
CA     5
NY    25
AZ    11
TX    15
Name: Jan, dtype: int64
type(df['Jan'])
<class 'pandas.core.series.Series'>
  • Grab multiple columns.
df[['Jan', 'Feb']]
    Jan  Feb
CA    5    4
NY   25   36
AZ   11   80
TX   15   38
type(df[['Jan', 'Feb']])
<class 'pandas.core.frame.DataFrame'>

Assign new columns

df['Z Feb'] = np.round((df['Feb'] - df['Feb'].mean()) / df['Feb'].std(), 2)
df['Z Feb']
CA   -1.14
NY   -0.11
AZ    1.30
TX   -0.05
Name: Z Feb, dtype: float64

Rows

df = pd.read_csv('tips.csv')
df.head()
   total_bill   tip     sex smoker  day  ... size  price_per_person          Payer Name         CC Number  Payment ID
0       16.99  1.01  Female     No  Sun  ...    2              8.49  Christy Cunningham  3560325168603410     Sun2959
1       10.34  1.66    Male     No  Sun  ...    3              3.45      Douglas Tucker  4478071379779230     Sun4608
2       21.01  3.50    Male     No  Sun  ...    3              7.00      Travis Walters  6011812112971322     Sun4458
3       23.68  3.31    Male     No  Sun  ...    2             11.84    Nathaniel Harris  4676137647685994     Sun5260
4       24.59  3.61  Female     No  Sun  ...    4              6.15        Tonya Carter  4832732618637221     Sun2251

:

[5 rows x 11 columns]

Changing the index

Any primary key may be used as an index.

dfp = df.set_index('Payment ID')
dfp.head()
            total_bill   tip     sex smoker  ... size price_per_person          Payer Name         CC Number
Payment ID                                   ...
Sun2959          16.99  1.01  Female     No  ...    2             8.49  Christy Cunningham  3560325168603410
Sun4608          10.34  1.66    Male     No  ...    3             3.45      Douglas Tucker  4478071379779230
Sun4458          21.01  3.50    Male     No  ...    3             7.00      Travis Walters  6011812112971322
Sun5260          23.68  3.31    Male     No  ...    2            11.84    Nathaniel Harris  4676137647685994
Sun2251          24.59  3.61  Female     No  ...    4             6.15        Tonya Carter  4832732618637221

:

[5 rows x 10 columns]

To reset index.

dfp.reset_index().head()
  Payment ID  total_bill   tip     sex smoker  ...    time size  price_per_person          Payer Name         CC Number
0    Sun2959       16.99  1.01  Female     No  ...  Dinner    2              8.49  Christy Cunningham  3560325168603410
1    Sun4608       10.34  1.66    Male     No  ...  Dinner    3              3.45      Douglas Tucker  4478071379779230
2    Sun4458       21.01  3.50    Male     No  ...  Dinner    3              7.00      Travis Walters  6011812112971322
3    Sun5260       23.68  3.31    Male     No  ...  Dinner    2             11.84    Nathaniel Harris  4676137647685994
4    Sun2251       24.59  3.61  Female     No  ...  Dinner    4              6.15        Tonya Carter  4832732618637221

:

[5 rows x 11 columns]

Querying based on index

  • Integer based indexing.
df.iloc[0]
total_bill                       16.99
tip                               1.01
sex                             Female
smoker                              No
day                                Sun
time                            Dinner
size                                 2
price_per_person                  8.49
Payer Name          Christy Cunningham
CC Number             3560325168603410
Payment ID                     Sun2959
Name: 0, dtype: object
  • Label based indexing.
dfp.loc["Sun2959"]
total_bill                       16.99
tip                               1.01
sex                             Female
smoker                              No
day                                Sun
time                            Dinner
size                                 2
price_per_person                  8.49
Payer Name          Christy Cunningham
CC Number             3560325168603410
Name: Sun2959, dtype: object
  • Indexing multiple rows.
dfp.iloc[0:4] # pythonic slicing may be used
dfp.loc[['Sun2959', 'Sun5260']].transpose()
Payment ID                   Sun2959           Sun5260
total_bill                     16.99             23.68
tip                             1.01              3.31
sex                           Female              Male
smoker                            No                No
day                              Sun               Sun
time                          Dinner            Dinner
size                               2                 2
price_per_person                8.49             11.84
Payer Name        Christy Cunningham  Nathaniel Harris
CC Number           3560325168603410  4676137647685994

Appending rows

  • .append() method has been deprecated. Use .concat() instead.
row = df.iloc[0]
pd.concat([df, row.to_frame().T]).head()
  total_bill   tip     sex smoker  day  ... size price_per_person          Payer Name         CC Number Payment ID
0      16.99  1.01  Female     No  Sun  ...    2             8.49  Christy Cunningham  3560325168603410    Sun2959
1      10.34  1.66    Male     No  Sun  ...    3             3.45      Douglas Tucker  4478071379779230    Sun4608
2      21.01   3.5    Male     No  Sun  ...    3              7.0      Travis Walters  6011812112971322    Sun4458
3      23.68  3.31    Male     No  Sun  ...    2            11.84    Nathaniel Harris  4676137647685994    Sun5260
4      24.59  3.61  Female     No  Sun  ...    4             6.15        Tonya Carter  4832732618637221    Sun2251

:

[5 rows x 11 columns]