Pandas#

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labelled” data both easy and intuitive. Pandas can help you with cleaning, transforming, and analysing you data starting from importing data from files to removing missing values, filtering rows, calculating statistics of the data and much more.

!pip install pandas
Requirement already satisfied: pandas in d:\github\python-for-data-science\venv\lib\site-packages (2.1.0)
Requirement already satisfied: numpy>=1.23.2 in d:\github\python-for-data-science\venv\lib\site-packages (from pandas) (1.25.2)
Requirement already satisfied: python-dateutil>=2.8.2 in d:\github\python-for-data-science\venv\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in d:\github\python-for-data-science\venv\lib\site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in d:\github\python-for-data-science\venv\lib\site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in d:\github\python-for-data-science\venv\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip

Series#

The first main data type we will learn about for pandas is the Series data type. Let’s import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn’t need to hold numeric data, it can hold any arbitrary Python Object.

Let’s explore this concept through some examples:

import numpy as np
import pandas as pd

Creating a Series#

You can convert a list,numpy array, or dictionary to a Series:

labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

Using Lists

pd.Series(data=my_list)
0    10
1    20
2    30
dtype: int64
pd.Series(data=my_list,index=labels)
a    10
b    20
c    30
dtype: int64
pd.Series(my_list,labels)
a    10
b    20
c    30
dtype: int64

NumPy Arrays

pd.Series(arr)
0    10
1    20
2    30
dtype: int32
pd.Series(arr,labels)
a    10
b    20
c    30
dtype: int32

Dictionary

pd.Series(d)
a    10
b    20
c    30
dtype: int64

Data in a Series#

A pandas Series can hold a variety of object types:

pd.Series(data=labels)
0    a
1    b
2    c
dtype: object
# Even functions (although unlikely that you will use this)
pd.Series([sum,print,len])
0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

Using an Index#

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let’s see some examples of how to grab information from a Series. Let us create two series, ser1 and ser2:

ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])  
ser1
USA        1
Germany    2
USSR       3
Japan      4
dtype: int64
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])                                   
ser2
USA        1
Germany    2
Italy      5
Japan      4
dtype: int64
ser1['USA']
1

Operations are then also done based off of index:

ser1 + ser2
Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

Let’s stop here for now and move on to DataFrames, which will expand on the concept of Series!

DataFrames#

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let’s use pandas to explore this topic!

from numpy.random import randn
np.random.seed(101)
randn(5, 4)
array([[ 2.70684984,  0.62813271,  0.90796945,  0.50382575],
       [ 0.65111795, -0.31931804, -0.84807698,  0.60596535],
       [-2.01816824,  0.74012206,  0.52881349, -0.58900053],
       [ 0.18869531, -0.75887206, -0.93323722,  0.95505651],
       [ 0.19079432,  1.97875732,  2.60596728,  0.68350889]])
'A B C D E'
'A B C D E'
'A B CD E'.split()
['A', 'B', 'CD', 'E']
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
W X Y Z
A 0.302665 1.693723 -1.706086 -1.159119
B -0.134841 0.390528 0.166905 0.184502
C 0.807706 0.072960 0.638787 0.329646
D -0.497104 -0.754070 -0.943406 0.484752
E -0.116773 1.901755 0.238127 1.996652
df_test = pd.DataFrame({'A': [10, 10], 'B': [20, 25], 'C': [30, 35]}, index=['X', 'Y'])
df_test
A B C
X 10 20 30
Y 10 25 35

Selection and Indexing#

Let’s learn the various methods to grab data from a DataFrame

# Pass a list of column names
df[['W','Z']]
W Z
A 0.302665 -1.159119
B -0.134841 0.184502
C 0.807706 0.329646
D -0.497104 0.484752
E -0.116773 1.996652
# SQL Syntax (NOT RECOMMENDED!)
df.W
A    0.302665
B   -0.134841
C    0.807706
D   -0.497104
E   -0.116773
Name: W, dtype: float64

DataFrame Columns are just Series

type(df['W'])
pandas.core.series.Series
type(df)
pandas.core.frame.DataFrame

Creating a new column:

df['new'] = df['W'] + df['Y']
df
W X Y Z new
A 0.302665 1.693723 -1.706086 -1.159119 -1.403420
B -0.134841 0.390528 0.166905 0.184502 0.032064
C 0.807706 0.072960 0.638787 0.329646 1.446493
D -0.497104 -0.754070 -0.943406 0.484752 -1.440510
E -0.116773 1.901755 0.238127 1.996652 0.121354

Removing Columns

df.drop('new',axis=1)
W X Y Z
A 0.302665 1.693723 -1.706086 -1.159119
B -0.134841 0.390528 0.166905 0.184502
C 0.807706 0.072960 0.638787 0.329646
D -0.497104 -0.754070 -0.943406 0.484752
E -0.116773 1.901755 0.238127 1.996652
df.drop('A', axis=0)
W X Y Z new
B -0.134841 0.390528 0.166905 0.184502 0.032064
C 0.807706 0.072960 0.638787 0.329646 1.446493
D -0.497104 -0.754070 -0.943406 0.484752 -1.440510
E -0.116773 1.901755 0.238127 1.996652 0.121354
# Not inplace, just a view unless specified!
df
W X Y Z new
A 0.302665 1.693723 -1.706086 -1.159119 -1.403420
B -0.134841 0.390528 0.166905 0.184502 0.032064
C 0.807706 0.072960 0.638787 0.329646 1.446493
D -0.497104 -0.754070 -0.943406 0.484752 -1.440510
E -0.116773 1.901755 0.238127 1.996652 0.121354
df.drop('new',axis=1,inplace=True)
df
W X Y Z
A 0.302665 1.693723 -1.706086 -1.159119
B -0.134841 0.390528 0.166905 0.184502
C 0.807706 0.072960 0.638787 0.329646
D -0.497104 -0.754070 -0.943406 0.484752
E -0.116773 1.901755 0.238127 1.996652

Can also drop rows this way:

df.drop('E',axis=0)
W X Y Z
A 0.302665 1.693723 -1.706086 -1.159119
B -0.134841 0.390528 0.166905 0.184502
C 0.807706 0.072960 0.638787 0.329646
D -0.497104 -0.754070 -0.943406 0.484752

Selecting Rows

df.loc['A']
W    0.302665
X    1.693723
Y   -1.706086
Z   -1.159119
Name: A, dtype: float64

Or select based off of position instead of label

df.iloc[2]
W    0.807706
X    0.072960
Y    0.638787
Z    0.329646
Name: C, dtype: float64

Selecting subset of rows and columns

df
W X Y Z
A 0.302665 1.693723 -1.706086 -1.159119
B -0.134841 0.390528 0.166905 0.184502
C 0.807706 0.072960 0.638787 0.329646
D -0.497104 -0.754070 -0.943406 0.484752
E -0.116773 1.901755 0.238127 1.996652
df.loc['B','Y']
0.16690463609281317
df.loc[['A','B'],['W','Y']]
W Y
A 0.302665 -1.706086
B -0.134841 0.166905

Conditional Selection#

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

df
W X Y Z
A 0.302665 1.693723 -1.706086 -1.159119
B -0.134841 0.390528 0.166905 0.184502
C 0.807706 0.072960 0.638787 0.329646
D -0.497104 -0.754070 -0.943406 0.484752
E -0.116773 1.901755 0.238127 1.996652
df>0
W X Y Z
A True True False False
B False True True True
C True True True True
D False False False True
E False True True True
df[df>0]
W X Y Z
A 0.302665 1.693723 NaN NaN
B NaN 0.390528 0.166905 0.184502
C 0.807706 0.072960 0.638787 0.329646
D NaN NaN NaN 0.484752
E NaN 1.901755 0.238127 1.996652
df[df['W']>0]
W X Y Z
A 0.302665 1.693723 -1.706086 -1.159119
C 0.807706 0.072960 0.638787 0.329646
df[df['W']>0]['Y']
A   -1.706086
C    0.638787
Name: Y, dtype: float64
df[df['W']>0][['Y','X']]
Y X
A -1.706086 1.693723
C 0.638787 0.072960

For two conditions you can use | and & with parenthesis:

df
W X Y Z
A 0.302665 1.693723 -1.706086 -1.159119
B -0.134841 0.390528 0.166905 0.184502
C 0.807706 0.072960 0.638787 0.329646
D -0.497104 -0.754070 -0.943406 0.484752
E -0.116773 1.901755 0.238127 1.996652
df[(df['W']>0) & (df['Y'] > 1)]
W X Y Z
df[(df['W']>0) | (df['Y'] > 1)]
W X Y Z
A 0.302665 1.693723 -1.706086 -1.159119
C 0.807706 0.072960 0.638787 0.329646

More Index Details#

Let’s discuss some more features of indexing, including resetting the index or setting it something else. We’ll also talk about index hierarchy!

df
W X Y Z
A 0.302665 1.693723 -1.706086 -1.159119
B -0.134841 0.390528 0.166905 0.184502
C 0.807706 0.072960 0.638787 0.329646
D -0.497104 -0.754070 -0.943406 0.484752
E -0.116773 1.901755 0.238127 1.996652
# Reset to default 0,1...n index
df.reset_index()
index W X Y Z
0 A 0.302665 1.693723 -1.706086 -1.159119
1 B -0.134841 0.390528 0.166905 0.184502
2 C 0.807706 0.072960 0.638787 0.329646
3 D -0.497104 -0.754070 -0.943406 0.484752
4 E -0.116773 1.901755 0.238127 1.996652
df
W X Y Z
A 0.302665 1.693723 -1.706086 -1.159119
B -0.134841 0.390528 0.166905 0.184502
C 0.807706 0.072960 0.638787 0.329646
D -0.497104 -0.754070 -0.943406 0.484752
E -0.116773 1.901755 0.238127 1.996652
newind = 'CA NY WY OR CO'.split()
newind
['CA', 'NY', 'WY', 'OR', 'CO']
df['States'] = newind
df
W X Y Z States
A 0.302665 1.693723 -1.706086 -1.159119 CA
B -0.134841 0.390528 0.166905 0.184502 NY
C 0.807706 0.072960 0.638787 0.329646 WY
D -0.497104 -0.754070 -0.943406 0.484752 OR
E -0.116773 1.901755 0.238127 1.996652 CO
df.set_index('States')
W X Y Z
States
CA 0.302665 1.693723 -1.706086 -1.159119
NY -0.134841 0.390528 0.166905 0.184502
WY 0.807706 0.072960 0.638787 0.329646
OR -0.497104 -0.754070 -0.943406 0.484752
CO -0.116773 1.901755 0.238127 1.996652
df
W X Y Z States
A 0.302665 1.693723 -1.706086 -1.159119 CA
B -0.134841 0.390528 0.166905 0.184502 NY
C 0.807706 0.072960 0.638787 0.329646 WY
D -0.497104 -0.754070 -0.943406 0.484752 OR
E -0.116773 1.901755 0.238127 1.996652 CO
df.set_index('States',inplace=True)
df
W X Y Z
States
CA 0.302665 1.693723 -1.706086 -1.159119
NY -0.134841 0.390528 0.166905 0.184502
WY 0.807706 0.072960 0.638787 0.329646
OR -0.497104 -0.754070 -0.943406 0.484752
CO -0.116773 1.901755 0.238127 1.996652

Multi-Index and Index Hierarchy#

Let us go over how to work with Multi-Index, first we’ll create a quick example of what a Multi-Indexed DataFrame would look like:

# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
zip(outside, inside)
<zip at 0x18975d1c800>
for i in zip(outside, inside):
    print(i)
('G1', 1)
('G1', 2)
('G1', 3)
('G2', 1)
('G2', 2)
('G2', 3)
hier_index = list(zip(outside,inside))
print(hier_index)
[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index
MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )
np.random.randn(6,2)
array([[-9.93263500e-01,  1.96799505e-01],
       [-1.13664459e+00,  3.66479606e-04],
       [ 1.02598415e+00, -1.56597904e-01],
       [-3.15791439e-02,  6.49825833e-01],
       [ 2.15484644e+00, -6.10258856e-01],
       [-7.55325340e-01, -3.46418504e-01]])
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
A B
G1 1 0.147027 -0.479448
2 0.558769 1.024810
3 -0.925874 1.862864
G2 1 -1.133817 0.610478
2 0.386030 2.084019
3 -0.376519 0.230336

Now let’s show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

df.loc['G1']
A B
1 0.147027 -0.479448
2 0.558769 1.024810
3 -0.925874 1.862864
df.loc['G1'].loc[1]
A    0.147027
B   -0.479448
Name: 1, dtype: float64
df.index.names
FrozenList([None, None])
df.index.names = ['Gene','Experiment']
df
A B
Gene Experiment
G1 1 0.147027 -0.479448
2 0.558769 1.024810
3 -0.925874 1.862864
G2 1 -1.133817 0.610478
2 0.386030 2.084019
3 -0.376519 0.230336
df.xs('G1')
A B
Experiment
1 0.147027 -0.479448
2 0.558769 1.024810
3 -0.925874 1.862864
df.xs(('G1',1))
A    0.147027
B   -0.479448
Name: (G1, 1), dtype: float64
df.xs(1,level='Experiment')
A B
Gene
G1 0.147027 -0.479448
G2 -1.133817 0.610478

Data Input and Output#

This notebook is the reference code for getting input and output, pandas can read a variety of file types using its pd.read_ methods. Let’s take a look at the most common data types:

Download the following files example and excel_sample to continue with the commands further.

CSV#

CSV Input#

df1 = pd.read_csv('example')
df1
A B
0 0.302665 1.693723
1 -1.706086 -1.159119
2 -0.134841 0.390528
3 0.166905 0.184502
4 0.807706 0.072960
5 0.638787 0.329646

CSV Output#

df.to_csv('example.csv',index=False)

Excel#

Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.

Excel Input#

pd.read_excel('Excel_Sample.xlsx',sheet_name='Sheet1')
Gene Experiment A B
0 G1 1 0.147027 -0.479448
1 NaN 2 0.558769 1.024810
2 NaN 3 -0.925874 1.862864
3 G2 1 -1.133817 0.610478
4 NaN 2 0.386030 2.084019
5 NaN 3 -0.376519 0.230336

Excel Output#

df
A B
Gene Experiment
G1 1 0.147027 -0.479448
2 0.558769 1.024810
3 -0.925874 1.862864
G2 1 -1.133817 0.610478
2 0.386030 2.084019
3 -0.376519 0.230336
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

Dealing with Missing Data#

Let’s show a few convenient methods to deal with Missing Data in pandas:

df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})
df
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
df.dropna()
A B C
0 1.0 5.0 1
df.dropna(axis=1)
C
0 1
1 2
2 3
df.dropna(thresh=2)
A B C
0 1.0 5.0 1
1 2.0 NaN 2
df.dropna(thresh=2, axis=1)
A C
0 1.0 1
1 2.0 2
2 NaN 3
df
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
df.fillna(value=2)
A B C
0 1.0 5.0 1
1 2.0 2.0 2
2 2.0 2.0 3
df
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
df['A'].mean()
1.5
df['B'].mean()
5.0
df['A'].fillna(value=df['A'].mean())
0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

Operations#

There are lots of operations with pandas that will be really useful to you, but don’t fall into any distinct category. Let’s show them here in this notebook:

import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz

Info on Unique Values#

df['col2'].unique()
array([444, 555, 666], dtype=int64)
df['col2'].nunique()
3
df['col2'].value_counts()
col2
444    2
555    1
666    1
Name: count, dtype: int64

Selecting Data#

#Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf
col1 col2 col3
3 4 444 xyz

Applying Functions#

df['col1']
0    1
1    2
2    3
3    4
Name: col1, dtype: int64
df['col1'].sum()
10
def times2(x):
    return x*2
df['col1'].apply(times2)
0    2
1    4
2    6
3    8
Name: col1, dtype: int64
df['col3']
0    abc
1    def
2    ghi
3    xyz
Name: col3, dtype: object
df['col3'].apply(len)
0    3
1    3
2    3
3    3
Name: col3, dtype: int64

Permanently Removing a Column

del df['col1']
df
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz

Get column and index names:

df.columns
Index(['col2', 'col3'], dtype='object')
df.index
RangeIndex(start=0, stop=4, step=1)

Sorting and Ordering a DataFrame:

df
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz
df.sort_values(by='col2') #inplace=False by default
col2 col3
0 444 abc
3 444 xyz
1 555 def
2 666 ghi

Find Null Values or Check for Null Values

df.isnull()
col2 col3
0 False False
1 False False
2 False False
3 False False
# Drop rows with NaN Values
df.dropna()
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz

Filling in NaN values with something else:

import numpy as np
df = pd.DataFrame({'col1':[1,2,3,np.nan],
                   'col2':[np.nan,555,666,444],
                   'col3':['abc','def','ghi','xyz']})
df.head()
col1 col2 col3
0 1.0 NaN abc
1 2.0 555.0 def
2 3.0 666.0 ghi
3 NaN 444.0 xyz
df.fillna('FILL')
col1 col2 col3
0 1.0 FILL abc
1 2.0 555.0 def
2 3.0 666.0 ghi
3 FILL 444.0 xyz
data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)
df
A B C D
0 foo one x 1
1 foo one y 3
2 foo two x 2
3 bar two y 5
4 bar one x 4
5 bar one y 1
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])
C x y
A B
bar one 4.0 1.0
two NaN 5.0
foo one 1.0 3.0
two 2.0 NaN

Groupby#

The groupby method allows you to group rows of data together and call aggregate functions

import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df
Company Person Sales
0 GOOG Sam 200
1 GOOG Charlie 120
2 MSFT Amy 340
3 MSFT Vanessa 124
4 FB Carl 243
5 FB Sarah 350

Now you can use the .groupby() method to group rows together based off of a column name. For instance let’s group based off of Company. This will create a DataFrameGroupBy object:

df.groupby('Company')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001897737D950>

You can save this object as a new variable:

df[['Company', 'Sales']]
Company Sales
0 GOOG 200
1 GOOG 120
2 MSFT 340
3 MSFT 124
4 FB 243
5 FB 350
by_comp = df[['Company', 'Sales']].groupby("Company")

And then call aggregate methods off the object:

by_comp.mean()
Sales
Company
FB 296.5
GOOG 160.0
MSFT 232.0

More examples of aggregate methods:

by_comp.std()
Sales
Company
FB 75.660426
GOOG 56.568542
MSFT 152.735065
by_comp.min()
Sales
Company
FB 243
GOOG 120
MSFT 124
by_comp.max()
Sales
Company
FB 350
GOOG 200
MSFT 340
by_comp.count()
Sales
Company
FB 2
GOOG 2
MSFT 2
by_comp.describe()
Sales
count mean std min 25% 50% 75% max
Company
FB 2.0 296.5 75.660426 243.0 269.75 296.5 323.25 350.0
GOOG 2.0 160.0 56.568542 120.0 140.00 160.0 180.00 200.0
MSFT 2.0 232.0 152.735065 124.0 178.00 232.0 286.00 340.0
by_comp.describe().transpose()
Company FB GOOG MSFT
Sales count 2.000000 2.000000 2.000000
mean 296.500000 160.000000 232.000000
std 75.660426 56.568542 152.735065
min 243.000000 120.000000 124.000000
25% 269.750000 140.000000 178.000000
50% 296.500000 160.000000 232.000000
75% 323.250000 180.000000 286.000000
max 350.000000 200.000000 340.000000
by_comp.describe().transpose()['GOOG']
Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GOOG, dtype: float64

Merging, Joining, and Concatenating#

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this lecture we will discuss these 3 methods with examples.

Example DataFrames#

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])
df1
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
df2
A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
df3
A B C D
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11

Concatenation#

Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:

pd.concat([df1,df2,df3])
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
pd.concat([df1,df2,df3],axis=1)
A B C D A B C D A B C D
0 A0 B0 C0 D0 NaN NaN NaN NaN NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN NaN NaN NaN NaN
2 A2 B2 C2 D2 NaN NaN NaN NaN NaN NaN NaN NaN
3 A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN A4 B4 C4 D4 NaN NaN NaN NaN
5 NaN NaN NaN NaN A5 B5 C5 D5 NaN NaN NaN NaN
6 NaN NaN NaN NaN A6 B6 C6 D6 NaN NaN NaN NaN
7 NaN NaN NaN NaN A7 B7 C7 D7 NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN A8 B8 C8 D8
9 NaN NaN NaN NaN NaN NaN NaN NaN A9 B9 C9 D9
10 NaN NaN NaN NaN NaN NaN NaN NaN A10 B10 C10 D10
11 NaN NaN NaN NaN NaN NaN NaN NaN A11 B11 C11 D11

Example DataFrames#

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']}) 
left
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
right
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3

Merging#

The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:

pd.merge(left,right,how='inner',on='key')
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3

Or to show a more complicated example:

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                         'C': ['C0', 'C1', 'C2', 'C3'],
                         'D': ['D0', 'D1', 'D2', 'D3']})
left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
pd.merge(left, right, on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
pd.merge(left, right, how='outer', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
pd.merge(left, right, how='right', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
pd.merge(left, right, how='left', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN

Joining#

Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])
left
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
right
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
left.join(right)
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
left.join(right, how='outer')
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
left.join(right, how='inner')
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2