Pandas

Pandas#

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labelled” data both easy and intuitive. Pandas can help you with cleaning, transforming, and analysing you data starting from importing data from files to removing missing values, filtering rows, calculating statistics of the data and much more.

!pip install pandas

Requirement already satisfied: pandas in d:\github\python-for-data-science\venv\lib\site-packages (2.1.0)
Requirement already satisfied: numpy>=1.23.2 in d:\github\python-for-data-science\venv\lib\site-packages (from pandas) (1.25.2)
Requirement already satisfied: python-dateutil>=2.8.2 in d:\github\python-for-data-science\venv\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in d:\github\python-for-data-science\venv\lib\site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in d:\github\python-for-data-science\venv\lib\site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in d:\github\python-for-data-science\venv\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)

[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip

Series#

The first main data type we will learn about for pandas is the Series data type. Let’s import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn’t need to hold numeric data, it can hold any arbitrary Python Object.

Let’s explore this concept through some examples:

import numpy as np
import pandas as pd

Creating a Series#

You can convert a list,numpy array, or dictionary to a Series:

labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

Using Lists

pd.Series(data=my_list)

  10
  20
  30
dtype: int64

pd.Series(data=my_list,index=labels)

a    10
b    20
c    30
dtype: int64

pd.Series(my_list,labels)

a    10
b    20
c    30
dtype: int64

NumPy Arrays

pd.Series(arr)

  10
  20
  30
dtype: int32

pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int32

Dictionary

pd.Series(d)

a    10
b    20
c    30
dtype: int64

Data in a Series#

A pandas Series can hold a variety of object types:

pd.Series(data=labels)

  a
  b
  c
dtype: object

# Even functions (although unlikely that you will use this)
pd.Series([sum,print,len])

    <built-in function sum>
  <built-in function print>
    <built-in function len>
dtype: object

Using an Index#

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let’s see some examples of how to grab information from a Series. Let us create two series, ser1 and ser2:

ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])  

ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])                                   

ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

ser1['USA']

Operations are then also done based off of index:

ser1 + ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

Let’s stop here for now and move on to DataFrames, which will expand on the concept of Series!

DataFrames#

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let’s use pandas to explore this topic!

from numpy.random import randn
np.random.seed(101)

randn(5, 4)

array([[ 2.70684984,  0.62813271,  0.90796945,  0.50382575],
       [ 0.65111795, -0.31931804, -0.84807698,  0.60596535],
       [-2.01816824,  0.74012206,  0.52881349, -0.58900053],
       [ 0.18869531, -0.75887206, -0.93323722,  0.95505651],
       [ 0.19079432,  1.97875732,  2.60596728,  0.68350889]])

'A B C D E'

'A B C D E'

'A B CD E'.split()

['A', 'B', 'CD', 'E']

df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

df

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
B	-0.134841	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	-0.497104	-0.754070	-0.943406	0.484752
E	-0.116773	1.901755	0.238127	1.996652

df_test = pd.DataFrame({'A': [10, 10], 'B': [20, 25], 'C': [30, 35]}, index=['X', 'Y'])
df_test

	A	B	C
X	10	20	30
Y	10	25	35

Selection and Indexing#

Let’s learn the various methods to grab data from a DataFrame

# Pass a list of column names
df[['W','Z']]

	W	Z
A	0.302665	-1.159119
B	-0.134841	0.184502
C	0.807706	0.329646
D	-0.497104	0.484752
E	-0.116773	1.996652

# SQL Syntax (NOT RECOMMENDED!)
df.W

A    0.302665
B   -0.134841
C    0.807706
D   -0.497104
E   -0.116773
Name: W, dtype: float64

DataFrame Columns are just Series

type(df['W'])

pandas.core.series.Series

type(df)

pandas.core.frame.DataFrame

Creating a new column:

df['new'] = df['W'] + df['Y']

df

	W	X	Y	Z	new
A	0.302665	1.693723	-1.706086	-1.159119	-1.403420
B	-0.134841	0.390528	0.166905	0.184502	0.032064
C	0.807706	0.072960	0.638787	0.329646	1.446493
D	-0.497104	-0.754070	-0.943406	0.484752	-1.440510
E	-0.116773	1.901755	0.238127	1.996652	0.121354

Removing Columns

df.drop('new',axis=1)

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
B	-0.134841	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	-0.497104	-0.754070	-0.943406	0.484752
E	-0.116773	1.901755	0.238127	1.996652

df.drop('A', axis=0)

	W	X	Y	Z	new
B	-0.134841	0.390528	0.166905	0.184502	0.032064
C	0.807706	0.072960	0.638787	0.329646	1.446493
D	-0.497104	-0.754070	-0.943406	0.484752	-1.440510
E	-0.116773	1.901755	0.238127	1.996652	0.121354

# Not inplace, just a view unless specified!
df

	W	X	Y	Z	new
A	0.302665	1.693723	-1.706086	-1.159119	-1.403420
B	-0.134841	0.390528	0.166905	0.184502	0.032064
C	0.807706	0.072960	0.638787	0.329646	1.446493
D	-0.497104	-0.754070	-0.943406	0.484752	-1.440510
E	-0.116773	1.901755	0.238127	1.996652	0.121354

df.drop('new',axis=1,inplace=True)

df

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
B	-0.134841	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	-0.497104	-0.754070	-0.943406	0.484752
E	-0.116773	1.901755	0.238127	1.996652

Can also drop rows this way:

df.drop('E',axis=0)

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
B	-0.134841	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	-0.497104	-0.754070	-0.943406	0.484752

Selecting Rows

df.loc['A']

W    0.302665
X    1.693723
Y   -1.706086
Z   -1.159119
Name: A, dtype: float64

Or select based off of position instead of label

df.iloc[2]

W    0.807706
X    0.072960
Y    0.638787
Z    0.329646
Name: C, dtype: float64

Selecting subset of rows and columns

df

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
B	-0.134841	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	-0.497104	-0.754070	-0.943406	0.484752
E	-0.116773	1.901755	0.238127	1.996652

df.loc['B','Y']

0.16690463609281317

df.loc[['A','B'],['W','Y']]

	W	Y
A	0.302665	-1.706086
B	-0.134841	0.166905

Conditional Selection#

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

df

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
B	-0.134841	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	-0.497104	-0.754070	-0.943406	0.484752
E	-0.116773	1.901755	0.238127	1.996652

df>0

	W	X	Y	Z
A	True	True	False	False
B	False	True	True	True
C	True	True	True	True
D	False	False	False	True
E	False	True	True	True

df[df>0]

	W	X	Y	Z
A	0.302665	1.693723	NaN	NaN
B	NaN	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	NaN	NaN	NaN	0.484752
E	NaN	1.901755	0.238127	1.996652

df[df['W']>0]

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
C	0.807706	0.072960	0.638787	0.329646

df[df['W']>0]['Y']

A   -1.706086
C    0.638787
Name: Y, dtype: float64

df[df['W']>0][['Y','X']]

	Y	X
A	-1.706086	1.693723
C	0.638787	0.072960

For two conditions you can use | and & with parenthesis:

df

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
B	-0.134841	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	-0.497104	-0.754070	-0.943406	0.484752
E	-0.116773	1.901755	0.238127	1.996652

df[(df['W']>0) & (df['Y'] > 1)]

	W	X	Y	Z

df[(df['W']>0) | (df['Y'] > 1)]

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
C	0.807706	0.072960	0.638787	0.329646

More Index Details#

Let’s discuss some more features of indexing, including resetting the index or setting it something else. We’ll also talk about index hierarchy!

df

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
B	-0.134841	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	-0.497104	-0.754070	-0.943406	0.484752
E	-0.116773	1.901755	0.238127	1.996652

# Reset to default 0,1...n index
df.reset_index()

	index	W	X	Y	Z
0	A	0.302665	1.693723	-1.706086	-1.159119
1	B	-0.134841	0.390528	0.166905	0.184502
2	C	0.807706	0.072960	0.638787	0.329646
3	D	-0.497104	-0.754070	-0.943406	0.484752
4	E	-0.116773	1.901755	0.238127	1.996652

df

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
B	-0.134841	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	-0.497104	-0.754070	-0.943406	0.484752
E	-0.116773	1.901755	0.238127	1.996652

newind = 'CA NY WY OR CO'.split()

newind

['CA', 'NY', 'WY', 'OR', 'CO']

df['States'] = newind

df

	W	X	Y	Z	States
A	0.302665	1.693723	-1.706086	-1.159119	CA
B	-0.134841	0.390528	0.166905	0.184502	NY
C	0.807706	0.072960	0.638787	0.329646	WY
D	-0.497104	-0.754070	-0.943406	0.484752	OR
E	-0.116773	1.901755	0.238127	1.996652	CO

df.set_index('States')

	W	X	Y	Z
States
CA	0.302665	1.693723	-1.706086	-1.159119
NY	-0.134841	0.390528	0.166905	0.184502
WY	0.807706	0.072960	0.638787	0.329646
OR	-0.497104	-0.754070	-0.943406	0.484752
CO	-0.116773	1.901755	0.238127	1.996652

df

	W	X	Y	Z	States
A	0.302665	1.693723	-1.706086	-1.159119	CA
B	-0.134841	0.390528	0.166905	0.184502	NY
C	0.807706	0.072960	0.638787	0.329646	WY
D	-0.497104	-0.754070	-0.943406	0.484752	OR
E	-0.116773	1.901755	0.238127	1.996652	CO

df.set_index('States',inplace=True)

df

	W	X	Y	Z
States
CA	0.302665	1.693723	-1.706086	-1.159119
NY	-0.134841	0.390528	0.166905	0.184502
WY	0.807706	0.072960	0.638787	0.329646
OR	-0.497104	-0.754070	-0.943406	0.484752
CO	-0.116773	1.901755	0.238127	1.996652

Multi-Index and Index Hierarchy#

Let us go over how to work with Multi-Index, first we’ll create a quick example of what a Multi-Indexed DataFrame would look like:

# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]

zip(outside, inside)

<zip at 0x18975d1c800>

for i in zip(outside, inside):
    print(i)

('G1', 1)
('G1', 2)
('G1', 3)
('G2', 1)
('G2', 2)
('G2', 3)

hier_index = list(zip(outside,inside))
print(hier_index)

[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]

hier_index = pd.MultiIndex.from_tuples(hier_index)

hier_index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

np.random.randn(6,2)

array([[-9.93263500e-01,  1.96799505e-01],
       [-1.13664459e+00,  3.66479606e-04],
       [ 1.02598415e+00, -1.56597904e-01],
       [-3.15791439e-02,  6.49825833e-01],
       [ 2.15484644e+00, -6.10258856e-01],
       [-7.55325340e-01, -3.46418504e-01]])

df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

		A	B
G1	1	0.147027	-0.479448
	2	0.558769	1.024810
	3	-0.925874	1.862864
G2	1	-1.133817	0.610478
	2	0.386030	2.084019
	3	-0.376519	0.230336

Now let’s show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

df.loc['G1']

	A	B
1	0.147027	-0.479448
2	0.558769	1.024810
3	-0.925874	1.862864

df.loc['G1'].loc[1]

A    0.147027
B   -0.479448
Name: 1, dtype: float64

df.index.names

FrozenList([None, None])

df.index.names = ['Gene','Experiment']

df

		A	B
Gene	Experiment
G1	1	0.147027	-0.479448
	2	0.558769	1.024810
	3	-0.925874	1.862864
G2	1	-1.133817	0.610478
	2	0.386030	2.084019
	3	-0.376519	0.230336

df.xs('G1')

	A	B
Experiment
1	0.147027	-0.479448
2	0.558769	1.024810
3	-0.925874	1.862864

df.xs(('G1',1))

A    0.147027
B   -0.479448
Name: (G1, 1), dtype: float64

df.xs(1,level='Experiment')

	A	B
Gene
G1	0.147027	-0.479448
G2	-1.133817	0.610478

Data Input and Output#

This notebook is the reference code for getting input and output, pandas can read a variety of file types using its pd.read_ methods. Let’s take a look at the most common data types:

Download the following files example and excel_sample to continue with the commands further.

CSV#

CSV Input#

df1 = pd.read_csv('example')
df1

	A	B
0	0.302665	1.693723
1	-1.706086	-1.159119
2	-0.134841	0.390528
3	0.166905	0.184502
4	0.807706	0.072960
5	0.638787	0.329646

CSV Output#

df.to_csv('example.csv',index=False)

Excel#

Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.

Excel Input#

pd.read_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

	Gene	Experiment	A	B
0	G1	1	0.147027	-0.479448
1	NaN	2	0.558769	1.024810
2	NaN	3	-0.925874	1.862864
3	G2	1	-1.133817	0.610478
4	NaN	2	0.386030	2.084019
5	NaN	3	-0.376519	0.230336

Excel Output#

df

		A	B
Gene	Experiment
G1	1	0.147027	-0.479448
	2	0.558769	1.024810
	3	-0.925874	1.862864
G2	1	-1.133817	0.610478
	2	0.386030	2.084019
	3	-0.376519	0.230336

df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

Dealing with Missing Data#

Let’s show a few convenient methods to deal with Missing Data in pandas:

df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

df

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2
2	NaN	NaN	3

df.dropna()

	A	B	C
0	1.0	5.0	1

df.dropna(axis=1)

	C
0	1
1	2
2	3

df.dropna(thresh=2)

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2

df.dropna(thresh=2, axis=1)

	A	C
0	1.0	1
1	2.0	2
2	NaN	3

df

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2
2	NaN	NaN	3

df.fillna(value=2)

	A	B	C
0	1.0	5.0	1
1	2.0	2.0	2
2	2.0	2.0	3

df

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2
2	NaN	NaN	3

df['A'].mean()

1.5

df['B'].mean()

5.0

df['A'].fillna(value=df['A'].mean())

  1.0
  2.0
  1.5
Name: A, dtype: float64

Operations#

There are lots of operations with pandas that will be really useful to you, but don’t fall into any distinct category. Let’s show them here in this notebook:

import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()

	col1	col2	col3
0	1	444	abc
1	2	555	def
2	3	666	ghi
3	4	444	xyz

Info on Unique Values#

df['col2'].unique()

array([444, 555, 666], dtype=int64)

df['col2'].nunique()

df['col2'].value_counts()

col2
444    2
555    1
666    1
Name: count, dtype: int64

Selecting Data#

#Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]

newdf

	col1	col2	col3
3	4	444	xyz

Applying Functions#

df['col1']

  1
  2
  3
  4
Name: col1, dtype: int64

df['col1'].sum()

def times2(x):
    return x*2

df['col1'].apply(times2)

  2
  4
  6
  8
Name: col1, dtype: int64

df['col3']

  abc
  def
  ghi
  xyz
Name: col3, dtype: object

df['col3'].apply(len)

  3
  3
  3
  3
Name: col3, dtype: int64

Permanently Removing a Column

del df['col1']

df

	col2	col3
0	444	abc
1	555	def
2	666	ghi
3	444	xyz

Get column and index names:

df.columns

Index(['col2', 'col3'], dtype='object')

df.index

RangeIndex(start=0, stop=4, step=1)

Sorting and Ordering a DataFrame:

df

	col2	col3
0	444	abc
1	555	def
2	666	ghi
3	444	xyz

df.sort_values(by='col2') #inplace=False by default

	col2	col3
0	444	abc
3	444	xyz
1	555	def
2	666	ghi

Find Null Values or Check for Null Values

df.isnull()

	col2	col3
0	False	False
1	False	False
2	False	False
3	False	False

# Drop rows with NaN Values
df.dropna()

	col2	col3
0	444	abc
1	555	def
2	666	ghi
3	444	xyz

Filling in NaN values with something else:

import numpy as np

df = pd.DataFrame({'col1':[1,2,3,np.nan],
                   'col2':[np.nan,555,666,444],
                   'col3':['abc','def','ghi','xyz']})
df.head()

	col1	col2	col3
0	1.0	NaN	abc
1	2.0	555.0	def
2	3.0	666.0	ghi
3	NaN	444.0	xyz

df.fillna('FILL')

	col1	col2	col3
0	1.0	FILL	abc
1	2.0	555.0	def
2	3.0	666.0	ghi
3	FILL	444.0	xyz

data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)

df

	A	B	C	D
0	foo	one	x	1
1	foo	one	y	3
2	foo	two	x	2
3	bar	two	y	5
4	bar	one	x	4
5	bar	one	y	1

df.pivot_table(values='D',index=['A', 'B'],columns=['C'])

	C	x	y
A	B
bar	one	4.0	1.0
bar	two	NaN	5.0
foo	one	1.0	3.0
foo	two	2.0	NaN

Groupby#

The groupby method allows you to group rows of data together and call aggregate functions

import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}

df = pd.DataFrame(data)

df

	Company	Person	Sales
0	GOOG	Sam	200
1	GOOG	Charlie	120
2	MSFT	Amy	340
3	MSFT	Vanessa	124
4	FB	Carl	243
5	FB	Sarah	350

Now you can use the .groupby() method to group rows together based off of a column name. For instance let’s group based off of Company. This will create a DataFrameGroupBy object:

df.groupby('Company')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001897737D950>

You can save this object as a new variable:

df[['Company', 'Sales']]

	Company	Sales
0	GOOG	200
1	GOOG	120
2	MSFT	340
3	MSFT	124
4	FB	243
5	FB	350

by_comp = df[['Company', 'Sales']].groupby("Company")

And then call aggregate methods off the object:

by_comp.mean()

	Sales
Company
FB	296.5
GOOG	160.0
MSFT	232.0

More examples of aggregate methods:

by_comp.std()

	Sales
Company
FB	75.660426
GOOG	56.568542
MSFT	152.735065

by_comp.min()

	Sales
Company
FB	243
GOOG	120
MSFT	124

by_comp.max()

	Sales
Company
FB	350
GOOG	200
MSFT	340

by_comp.count()

	Sales
Company
FB	2
GOOG	2
MSFT	2

by_comp.describe()

	Sales
	count	mean	std	min	25%	50%	75%	max
Company
FB	2.0	296.5	75.660426	243.0	269.75	296.5	323.25	350.0
GOOG	2.0	160.0	56.568542	120.0	140.00	160.0	180.00	200.0
MSFT	2.0	232.0	152.735065	124.0	178.00	232.0	286.00	340.0

by_comp.describe().transpose()

	Company	FB	GOOG	MSFT
Sales	count	2.000000	2.000000	2.000000
	mean	296.500000	160.000000	232.000000
	std	75.660426	56.568542	152.735065
	min	243.000000	120.000000	124.000000
	25%	269.750000	140.000000	178.000000
	50%	296.500000	160.000000	232.000000
	75%	323.250000	180.000000	286.000000
	max	350.000000	200.000000	340.000000

by_comp.describe().transpose()['GOOG']

Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GOOG, dtype: float64

Merging, Joining, and Concatenating#

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this lecture we will discuss these 3 methods with examples.

Example DataFrames#

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

df1

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3

df2

	A	B	C	D
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7

df3

	A	B	C	D
8	A8	B8	C8	D8
9	A9	B9	C9	D9
10	A10	B10	C10	D10
11	A11	B11	C11	D11

Concatenation#

Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:

pd.concat([df1,df2,df3])

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7
8	A8	B8	C8	D8
9	A9	B9	C9	D9
10	A10	B10	C10	D10
11	A11	B11	C11	D11

pd.concat([df1,df2,df3],axis=1)

	A	B	C	D	A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A8	B8	C8	D8
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A9	B9	C9	D9
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A10	B10	C10	D10
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A11	B11	C11	D11

Example DataFrames#

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']}) 

left

	key	A	B
0	K0	A0	B0
1	K1	A1	B1
2	K2	A2	B2
3	K3	A3	B3

right

	key	C	D
0	K0	C0	D0
1	K1	C1	D1
2	K2	C2	D2
3	K3	C3	D3

Merging#

The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:

pd.merge(left,right,how='inner',on='key')

	key	A	B	C	D
0	K0	A0	B0	C0	D0
1	K1	A1	B1	C1	D1
2	K2	A2	B2	C2	D2
3	K3	A3	B3	C3	D3

Or to show a more complicated example:

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                         'C': ['C0', 'C1', 'C2', 'C3'],
                         'D': ['D0', 'D1', 'D2', 'D3']})

left

	key1	key2	A	B
0	K0	K0	A0	B0
1	K0	K1	A1	B1
2	K1	K0	A2	B2
3	K2	K1	A3	B3

right

	key1	key2	C	D
0	K0	K0	C0	D0
1	K1	K0	C1	D1
2	K1	K0	C2	D2
3	K2	K0	C3	D3

pd.merge(left, right, on=['key1', 'key2'])

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K0	A2	B2	C1	D1
2	K1	K0	A2	B2	C2	D2

pd.merge(left, right, how='outer', on=['key1', 'key2'])

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
5	K2	K0	NaN	NaN	C3	D3

pd.merge(left, right, how='right', on=['key1', 'key2'])

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K0	A2	B2	C1	D1
2	K1	K0	A2	B2	C2	D2
3	K2	K0	NaN	NaN	C3	D3

pd.merge(left, right, how='left', on=['key1', 'key2'])

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN

Joining#

Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

left

	A	B
K0	A0	B0
K1	A1	B1
K2	A2	B2

right

	C	D
K0	C0	D0
K2	C2	D2
K3	C3	D3

left.join(right)

	A	B	C	D
K0	A0	B0	C0	D0
K1	A1	B1	NaN	NaN
K2	A2	B2	C2	D2

left.join(right, how='outer')

	A	B	C	D
K0	A0	B0	C0	D0
K1	A1	B1	NaN	NaN
K2	A2	B2	C2	D2
K3	NaN	NaN	C3	D3

left.join(right, how='inner')

	A	B	C	D
K0	A0	B0	C0	D0
K2	A2	B2	C2	D2

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7
8	A8	B8	C8	D8
9	A9	B9	C9	D9
10	A10	B10	C10	D10
11	A11	B11	C11	D11

	A	B	C	D	A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A8	B8	C8	D8
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A9	B9	C9	D9
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A10	B10	C10	D10
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A11	B11	C11	D11

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
5	K2	K0	NaN	NaN	C3	D3

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7
8	A8	B8	C8	D8
9	A9	B9	C9	D9
10	A10	B10	C10	D10
11	A11	B11	C11	D11

	A	B	C	D	A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A8	B8	C8	D8
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A9	B9	C9	D9
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A10	B10	C10	D10
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A11	B11	C11	D11

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
5	K2	K0	NaN	NaN	C3	D3

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN

Pandas

Contents

Pandas#

Series#

Creating a Series#

Data in a Series#

Using an Index#

DataFrames#

Selection and Indexing#

Conditional Selection#

More Index Details#

Multi-Index and Index Hierarchy#

Data Input and Output#

CSV#

CSV Input#

CSV Output#

Excel#

Excel Input#

Excel Output#

Dealing with Missing Data#

Operations#

Info on Unique Values#

Selecting Data#

Applying Functions#

Groupby#

Merging, Joining, and Concatenating#

Example DataFrames#

Concatenation#

Example DataFrames#

Merging#

Joining#

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7
8	A8	B8	C8	D8
9	A9	B9	C9	D9
10	A10	B10	C10	D10
11	A11	B11	C11	D11

	A	B	C	D	A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A8	B8	C8	D8
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A9	B9	C9	D9
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A10	B10	C10	D10
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A11	B11	C11	D11

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
5	K2	K0	NaN	NaN	C3	D3

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN