Pandas
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labelled” data both easy and intuitive. Pandas can help you with cleaning, transforming, and analysing you data starting from importing data from files to removing missing values, filtering rows, calculating statistics of the data and much more.
Requirement already satisfied: pandas in d:\github\python-for-data-science\venv\lib\site-packages (2.1.0)
Requirement already satisfied: numpy>=1.23.2 in d:\github\python-for-data-science\venv\lib\site-packages (from pandas) (1.25.2)
Requirement already satisfied: python-dateutil>=2.8.2 in d:\github\python-for-data-science\venv\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in d:\github\python-for-data-science\venv\lib\site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in d:\github\python-for-data-science\venv\lib\site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in d:\github\python-for-data-science\venv\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip
Series
The first main data type we will learn about for pandas is the Series data type. Let’s import Pandas and explore the Series object.
A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn’t need to hold numeric data, it can hold any arbitrary Python Object.
Let’s explore this concept through some examples:
Creating a Series
You can convert a list,numpy array, or dictionary to a Series:
Using Lists
0 10
1 20
2 30
dtype: int64
a 10
b 20
c 30
dtype: int64
a 10
b 20
c 30
dtype: int64
NumPy Arrays
0 10
1 20
2 30
dtype: int32
a 10
b 20
c 30
dtype: int32
Dictionary
a 10
b 20
c 30
dtype: int64
Data in a Series
A pandas Series can hold a variety of object types:
0 a
1 b
2 c
dtype: object
0 <built-in function sum>
1 <built-in function print>
2 <built-in function len>
dtype: object
Using an Index
The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).
Let’s see some examples of how to grab information from a Series. Let us create two series, ser1 and ser2:
USA 1
Germany 2
USSR 3
Japan 4
dtype: int64
USA 1
Germany 2
Italy 5
Japan 4
dtype: int64
Operations are then also done based off of index:
Germany 4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype: float64
Let’s stop here for now and move on to DataFrames, which will expand on the concept of Series!
DataFrames
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let’s use pandas to explore this topic!
array([[ 2.70684984, 0.62813271, 0.90796945, 0.50382575],
[ 0.65111795, -0.31931804, -0.84807698, 0.60596535],
[-2.01816824, 0.74012206, 0.52881349, -0.58900053],
[ 0.18869531, -0.75887206, -0.93323722, 0.95505651],
[ 0.19079432, 1.97875732, 2.60596728, 0.68350889]])
|
W |
X |
Y |
Z |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
|
A |
B |
C |
X |
10 |
20 |
30 |
Y |
10 |
25 |
35 |
Selection and Indexing
Let’s learn the various methods to grab data from a DataFrame
|
W |
Z |
A |
0.302665 |
-1.159119 |
B |
-0.134841 |
0.184502 |
C |
0.807706 |
0.329646 |
D |
-0.497104 |
0.484752 |
E |
-0.116773 |
1.996652 |
A 0.302665
B -0.134841
C 0.807706
D -0.497104
E -0.116773
Name: W, dtype: float64
DataFrame Columns are just Series
pandas.core.series.Series
pandas.core.frame.DataFrame
Creating a new column:
|
W |
X |
Y |
Z |
new |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
-1.403420 |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
0.032064 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
1.446493 |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
-1.440510 |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
0.121354 |
Removing Columns
|
W |
X |
Y |
Z |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
|
W |
X |
Y |
Z |
new |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
0.032064 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
1.446493 |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
-1.440510 |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
0.121354 |
|
W |
X |
Y |
Z |
new |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
-1.403420 |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
0.032064 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
1.446493 |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
-1.440510 |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
0.121354 |
|
W |
X |
Y |
Z |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
Can also drop rows this way:
|
W |
X |
Y |
Z |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
Selecting Rows
W 0.302665
X 1.693723
Y -1.706086
Z -1.159119
Name: A, dtype: float64
Or select based off of position instead of label
W 0.807706
X 0.072960
Y 0.638787
Z 0.329646
Name: C, dtype: float64
Selecting subset of rows and columns
|
W |
X |
Y |
Z |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
|
W |
Y |
A |
0.302665 |
-1.706086 |
B |
-0.134841 |
0.166905 |
Conditional Selection
An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
|
W |
X |
Y |
Z |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
|
W |
X |
Y |
Z |
A |
True |
True |
False |
False |
B |
False |
True |
True |
True |
C |
True |
True |
True |
True |
D |
False |
False |
False |
True |
E |
False |
True |
True |
True |
|
W |
X |
Y |
Z |
A |
0.302665 |
1.693723 |
NaN |
NaN |
B |
NaN |
0.390528 |
0.166905 |
0.184502 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
D |
NaN |
NaN |
NaN |
0.484752 |
E |
NaN |
1.901755 |
0.238127 |
1.996652 |
|
W |
X |
Y |
Z |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
A -1.706086
C 0.638787
Name: Y, dtype: float64
|
Y |
X |
A |
-1.706086 |
1.693723 |
C |
0.638787 |
0.072960 |
For two conditions you can use | and & with parenthesis:
|
W |
X |
Y |
Z |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
|
W |
X |
Y |
Z |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
More Index Details
Let’s discuss some more features of indexing, including resetting the index or setting it something else. We’ll also talk about index hierarchy!
|
W |
X |
Y |
Z |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
|
index |
W |
X |
Y |
Z |
0 |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
1 |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
2 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
3 |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
4 |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
|
W |
X |
Y |
Z |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
['CA', 'NY', 'WY', 'OR', 'CO']
|
W |
X |
Y |
Z |
States |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
CA |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
NY |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
WY |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
OR |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
CO |
|
W |
X |
Y |
Z |
States |
|
|
|
|
CA |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
NY |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
WY |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
OR |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
CO |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
|
W |
X |
Y |
Z |
States |
A |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
CA |
B |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
NY |
C |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
WY |
D |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
OR |
E |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
CO |
|
W |
X |
Y |
Z |
States |
|
|
|
|
CA |
0.302665 |
1.693723 |
-1.706086 |
-1.159119 |
NY |
-0.134841 |
0.390528 |
0.166905 |
0.184502 |
WY |
0.807706 |
0.072960 |
0.638787 |
0.329646 |
OR |
-0.497104 |
-0.754070 |
-0.943406 |
0.484752 |
CO |
-0.116773 |
1.901755 |
0.238127 |
1.996652 |
Multi-Index and Index Hierarchy
Let us go over how to work with Multi-Index, first we’ll create a quick example of what a Multi-Indexed DataFrame would look like:
('G1', 1)
('G1', 2)
('G1', 3)
('G2', 1)
('G2', 2)
('G2', 3)
[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]
MultiIndex([('G1', 1),
('G1', 2),
('G1', 3),
('G2', 1),
('G2', 2),
('G2', 3)],
)
array([[-9.93263500e-01, 1.96799505e-01],
[-1.13664459e+00, 3.66479606e-04],
[ 1.02598415e+00, -1.56597904e-01],
[-3.15791439e-02, 6.49825833e-01],
[ 2.15484644e+00, -6.10258856e-01],
[-7.55325340e-01, -3.46418504e-01]])
|
|
A |
B |
G1 |
1 |
0.147027 |
-0.479448 |
2 |
0.558769 |
1.024810 |
3 |
-0.925874 |
1.862864 |
G2 |
1 |
-1.133817 |
0.610478 |
2 |
0.386030 |
2.084019 |
3 |
-0.376519 |
0.230336 |
Now let’s show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:
|
A |
B |
1 |
0.147027 |
-0.479448 |
2 |
0.558769 |
1.024810 |
3 |
-0.925874 |
1.862864 |
A 0.147027
B -0.479448
Name: 1, dtype: float64
|
|
A |
B |
Gene |
Experiment |
|
|
G1 |
1 |
0.147027 |
-0.479448 |
2 |
0.558769 |
1.024810 |
3 |
-0.925874 |
1.862864 |
G2 |
1 |
-1.133817 |
0.610478 |
2 |
0.386030 |
2.084019 |
3 |
-0.376519 |
0.230336 |
|
A |
B |
Experiment |
|
|
1 |
0.147027 |
-0.479448 |
2 |
0.558769 |
1.024810 |
3 |
-0.925874 |
1.862864 |
A 0.147027
B -0.479448
Name: (G1, 1), dtype: float64
|
A |
B |
Gene |
|
|
G1 |
0.147027 |
-0.479448 |
G2 |
-1.133817 |
0.610478 |
Dealing with Missing Data
Let’s show a few convenient methods to deal with Missing Data in pandas:
|
A |
B |
C |
0 |
1.0 |
5.0 |
1 |
1 |
2.0 |
NaN |
2 |
2 |
NaN |
NaN |
3 |
|
A |
B |
C |
0 |
1.0 |
5.0 |
1 |
1 |
2.0 |
NaN |
2 |
|
A |
C |
0 |
1.0 |
1 |
1 |
2.0 |
2 |
2 |
NaN |
3 |
|
A |
B |
C |
0 |
1.0 |
5.0 |
1 |
1 |
2.0 |
NaN |
2 |
2 |
NaN |
NaN |
3 |
|
A |
B |
C |
0 |
1.0 |
5.0 |
1 |
1 |
2.0 |
2.0 |
2 |
2 |
2.0 |
2.0 |
3 |
|
A |
B |
C |
0 |
1.0 |
5.0 |
1 |
1 |
2.0 |
NaN |
2 |
2 |
NaN |
NaN |
3 |
0 1.0
1 2.0
2 1.5
Name: A, dtype: float64
Operations
There are lots of operations with pandas that will be really useful to you, but don’t fall into any distinct category. Let’s show them here in this notebook:
|
col1 |
col2 |
col3 |
0 |
1 |
444 |
abc |
1 |
2 |
555 |
def |
2 |
3 |
666 |
ghi |
3 |
4 |
444 |
xyz |
Info on Unique Values
array([444, 555, 666], dtype=int64)
col2
444 2
555 1
666 1
Name: count, dtype: int64
Selecting Data
|
col1 |
col2 |
col3 |
3 |
4 |
444 |
xyz |
Applying Functions
0 1
1 2
2 3
3 4
Name: col1, dtype: int64
0 2
1 4
2 6
3 8
Name: col1, dtype: int64
0 abc
1 def
2 ghi
3 xyz
Name: col3, dtype: object
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
Permanently Removing a Column
|
col2 |
col3 |
0 |
444 |
abc |
1 |
555 |
def |
2 |
666 |
ghi |
3 |
444 |
xyz |
Get column and index names:
Index(['col2', 'col3'], dtype='object')
RangeIndex(start=0, stop=4, step=1)
Sorting and Ordering a DataFrame:
|
col2 |
col3 |
0 |
444 |
abc |
1 |
555 |
def |
2 |
666 |
ghi |
3 |
444 |
xyz |
|
col2 |
col3 |
0 |
444 |
abc |
3 |
444 |
xyz |
1 |
555 |
def |
2 |
666 |
ghi |
Find Null Values or Check for Null Values
|
col2 |
col3 |
0 |
False |
False |
1 |
False |
False |
2 |
False |
False |
3 |
False |
False |
|
col2 |
col3 |
0 |
444 |
abc |
1 |
555 |
def |
2 |
666 |
ghi |
3 |
444 |
xyz |
Filling in NaN values with something else:
|
col1 |
col2 |
col3 |
0 |
1.0 |
NaN |
abc |
1 |
2.0 |
555.0 |
def |
2 |
3.0 |
666.0 |
ghi |
3 |
NaN |
444.0 |
xyz |
|
col1 |
col2 |
col3 |
0 |
1.0 |
FILL |
abc |
1 |
2.0 |
555.0 |
def |
2 |
3.0 |
666.0 |
ghi |
3 |
FILL |
444.0 |
xyz |
|
A |
B |
C |
D |
0 |
foo |
one |
x |
1 |
1 |
foo |
one |
y |
3 |
2 |
foo |
two |
x |
2 |
3 |
bar |
two |
y |
5 |
4 |
bar |
one |
x |
4 |
5 |
bar |
one |
y |
1 |
|
C |
x |
y |
A |
B |
|
|
bar |
one |
4.0 |
1.0 |
two |
NaN |
5.0 |
foo |
one |
1.0 |
3.0 |
two |
2.0 |
NaN |
Groupby
The groupby method allows you to group rows of data together and call aggregate functions
|
Company |
Person |
Sales |
0 |
GOOG |
Sam |
200 |
1 |
GOOG |
Charlie |
120 |
2 |
MSFT |
Amy |
340 |
3 |
MSFT |
Vanessa |
124 |
4 |
FB |
Carl |
243 |
5 |
FB |
Sarah |
350 |
Now you can use the .groupby() method to group rows together based off of a column name. For instance let’s group based off of Company. This will create a DataFrameGroupBy object:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001897737D950>
You can save this object as a new variable:
|
Company |
Sales |
0 |
GOOG |
200 |
1 |
GOOG |
120 |
2 |
MSFT |
340 |
3 |
MSFT |
124 |
4 |
FB |
243 |
5 |
FB |
350 |
And then call aggregate methods off the object:
|
Sales |
Company |
|
FB |
296.5 |
GOOG |
160.0 |
MSFT |
232.0 |
More examples of aggregate methods:
|
Sales |
Company |
|
FB |
75.660426 |
GOOG |
56.568542 |
MSFT |
152.735065 |
|
Sales |
Company |
|
FB |
243 |
GOOG |
120 |
MSFT |
124 |
|
Sales |
Company |
|
FB |
350 |
GOOG |
200 |
MSFT |
340 |
|
Sales |
Company |
|
FB |
2 |
GOOG |
2 |
MSFT |
2 |
|
Sales |
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
Company |
|
|
|
|
|
|
|
|
FB |
2.0 |
296.5 |
75.660426 |
243.0 |
269.75 |
296.5 |
323.25 |
350.0 |
GOOG |
2.0 |
160.0 |
56.568542 |
120.0 |
140.00 |
160.0 |
180.00 |
200.0 |
MSFT |
2.0 |
232.0 |
152.735065 |
124.0 |
178.00 |
232.0 |
286.00 |
340.0 |
|
Company |
FB |
GOOG |
MSFT |
Sales |
count |
2.000000 |
2.000000 |
2.000000 |
mean |
296.500000 |
160.000000 |
232.000000 |
std |
75.660426 |
56.568542 |
152.735065 |
min |
243.000000 |
120.000000 |
124.000000 |
25% |
269.750000 |
140.000000 |
178.000000 |
50% |
296.500000 |
160.000000 |
232.000000 |
75% |
323.250000 |
180.000000 |
286.000000 |
max |
350.000000 |
200.000000 |
340.000000 |
Sales count 2.000000
mean 160.000000
std 56.568542
min 120.000000
25% 140.000000
50% 160.000000
75% 180.000000
max 200.000000
Name: GOOG, dtype: float64
Merging, Joining, and Concatenating
There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this lecture we will discuss these 3 methods with examples.
Example DataFrames
|
A |
B |
C |
D |
0 |
A0 |
B0 |
C0 |
D0 |
1 |
A1 |
B1 |
C1 |
D1 |
2 |
A2 |
B2 |
C2 |
D2 |
3 |
A3 |
B3 |
C3 |
D3 |
|
A |
B |
C |
D |
4 |
A4 |
B4 |
C4 |
D4 |
5 |
A5 |
B5 |
C5 |
D5 |
6 |
A6 |
B6 |
C6 |
D6 |
7 |
A7 |
B7 |
C7 |
D7 |
|
A |
B |
C |
D |
8 |
A8 |
B8 |
C8 |
D8 |
9 |
A9 |
B9 |
C9 |
D9 |
10 |
A10 |
B10 |
C10 |
D10 |
11 |
A11 |
B11 |
C11 |
D11 |
Concatenation
Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:
|
A |
B |
C |
D |
0 |
A0 |
B0 |
C0 |
D0 |
1 |
A1 |
B1 |
C1 |
D1 |
2 |
A2 |
B2 |
C2 |
D2 |
3 |
A3 |
B3 |
C3 |
D3 |
4 |
A4 |
B4 |
C4 |
D4 |
5 |
A5 |
B5 |
C5 |
D5 |
6 |
A6 |
B6 |
C6 |
D6 |
7 |
A7 |
B7 |
C7 |
D7 |
8 |
A8 |
B8 |
C8 |
D8 |
9 |
A9 |
B9 |
C9 |
D9 |
10 |
A10 |
B10 |
C10 |
D10 |
11 |
A11 |
B11 |
C11 |
D11 |
|
A |
B |
C |
D |
A |
B |
C |
D |
A |
B |
C |
D |
0 |
A0 |
B0 |
C0 |
D0 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
1 |
A1 |
B1 |
C1 |
D1 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
2 |
A2 |
B2 |
C2 |
D2 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
3 |
A3 |
B3 |
C3 |
D3 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
4 |
NaN |
NaN |
NaN |
NaN |
A4 |
B4 |
C4 |
D4 |
NaN |
NaN |
NaN |
NaN |
5 |
NaN |
NaN |
NaN |
NaN |
A5 |
B5 |
C5 |
D5 |
NaN |
NaN |
NaN |
NaN |
6 |
NaN |
NaN |
NaN |
NaN |
A6 |
B6 |
C6 |
D6 |
NaN |
NaN |
NaN |
NaN |
7 |
NaN |
NaN |
NaN |
NaN |
A7 |
B7 |
C7 |
D7 |
NaN |
NaN |
NaN |
NaN |
8 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
A8 |
B8 |
C8 |
D8 |
9 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
A9 |
B9 |
C9 |
D9 |
10 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
A10 |
B10 |
C10 |
D10 |
11 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
A11 |
B11 |
C11 |
D11 |
Example DataFrames
|
key |
A |
B |
0 |
K0 |
A0 |
B0 |
1 |
K1 |
A1 |
B1 |
2 |
K2 |
A2 |
B2 |
3 |
K3 |
A3 |
B3 |
|
key |
C |
D |
0 |
K0 |
C0 |
D0 |
1 |
K1 |
C1 |
D1 |
2 |
K2 |
C2 |
D2 |
3 |
K3 |
C3 |
D3 |
Merging
The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:
|
key |
A |
B |
C |
D |
0 |
K0 |
A0 |
B0 |
C0 |
D0 |
1 |
K1 |
A1 |
B1 |
C1 |
D1 |
2 |
K2 |
A2 |
B2 |
C2 |
D2 |
3 |
K3 |
A3 |
B3 |
C3 |
D3 |
Or to show a more complicated example:
|
key1 |
key2 |
A |
B |
0 |
K0 |
K0 |
A0 |
B0 |
1 |
K0 |
K1 |
A1 |
B1 |
2 |
K1 |
K0 |
A2 |
B2 |
3 |
K2 |
K1 |
A3 |
B3 |
|
key1 |
key2 |
C |
D |
0 |
K0 |
K0 |
C0 |
D0 |
1 |
K1 |
K0 |
C1 |
D1 |
2 |
K1 |
K0 |
C2 |
D2 |
3 |
K2 |
K0 |
C3 |
D3 |
|
key1 |
key2 |
A |
B |
C |
D |
0 |
K0 |
K0 |
A0 |
B0 |
C0 |
D0 |
1 |
K1 |
K0 |
A2 |
B2 |
C1 |
D1 |
2 |
K1 |
K0 |
A2 |
B2 |
C2 |
D2 |
|
key1 |
key2 |
A |
B |
C |
D |
0 |
K0 |
K0 |
A0 |
B0 |
C0 |
D0 |
1 |
K0 |
K1 |
A1 |
B1 |
NaN |
NaN |
2 |
K1 |
K0 |
A2 |
B2 |
C1 |
D1 |
3 |
K1 |
K0 |
A2 |
B2 |
C2 |
D2 |
4 |
K2 |
K1 |
A3 |
B3 |
NaN |
NaN |
5 |
K2 |
K0 |
NaN |
NaN |
C3 |
D3 |
|
key1 |
key2 |
A |
B |
C |
D |
0 |
K0 |
K0 |
A0 |
B0 |
C0 |
D0 |
1 |
K1 |
K0 |
A2 |
B2 |
C1 |
D1 |
2 |
K1 |
K0 |
A2 |
B2 |
C2 |
D2 |
3 |
K2 |
K0 |
NaN |
NaN |
C3 |
D3 |
|
key1 |
key2 |
A |
B |
C |
D |
0 |
K0 |
K0 |
A0 |
B0 |
C0 |
D0 |
1 |
K0 |
K1 |
A1 |
B1 |
NaN |
NaN |
2 |
K1 |
K0 |
A2 |
B2 |
C1 |
D1 |
3 |
K1 |
K0 |
A2 |
B2 |
C2 |
D2 |
4 |
K2 |
K1 |
A3 |
B3 |
NaN |
NaN |
Joining
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.
|
A |
B |
K0 |
A0 |
B0 |
K1 |
A1 |
B1 |
K2 |
A2 |
B2 |
|
C |
D |
K0 |
C0 |
D0 |
K2 |
C2 |
D2 |
K3 |
C3 |
D3 |
|
A |
B |
C |
D |
K0 |
A0 |
B0 |
C0 |
D0 |
K1 |
A1 |
B1 |
NaN |
NaN |
K2 |
A2 |
B2 |
C2 |
D2 |
|
A |
B |
C |
D |
K0 |
A0 |
B0 |
C0 |
D0 |
K1 |
A1 |
B1 |
NaN |
NaN |
K2 |
A2 |
B2 |
C2 |
D2 |
K3 |
NaN |
NaN |
C3 |
D3 |
|
A |
B |
C |
D |
K0 |
A0 |
B0 |
C0 |
D0 |
K2 |
A2 |
B2 |
C2 |
D2 |