Seaborn Demo on a complete dataset#

Data visualization and exploration using the mtcars dataset#

The mtcars dataset consists of data extracted from the 1974 Motor Trend US magazine, and comprises of fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

import warnings
warnings.simplefilter(action='ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
mtcars = pd.read_csv('mtcars.csv')
mtcars.head()
model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

The data frame consists of the following rows:

  1. mpg: miles/gallon

  2. cyl: number of cylinders

  3. disp: displacement in cubic inches

  4. hp: gross horsepower

  5. drat - rear axle ratio

  6. wt - weight of car in pounds (wt * 100 pounds)

  7. qsec - 1/4 mile time

  8. vs - type of engine (v shaped or straight)

  9. am - transmission (1 - manual, 0 - automatic)

  10. gear - no of gears

  11. carb - no of carburettors

We will first set the model column as the index for the dataset, so we can access rows using the model names if we want later on.

mtcars.set_index('model', inplace=True)
mtcars.head()
mpg cyl disp hp drat wt qsec vs am gear carb
model
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

We use the info function to get more information about the different columns in the dataset

mtcars.info()
<class 'pandas.core.frame.DataFrame'>
Index: 32 entries, Mazda RX4 to Volvo 142E
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   mpg     32 non-null     float64
 1   cyl     32 non-null     int64  
 2   disp    32 non-null     float64
 3   hp      32 non-null     int64  
 4   drat    32 non-null     float64
 5   wt      32 non-null     float64
 6   qsec    32 non-null     float64
 7   vs      32 non-null     int64  
 8   am      32 non-null     int64  
 9   gear    32 non-null     int64  
 10  carb    32 non-null     int64  
dtypes: float64(5), int64(6)
memory usage: 3.0+ KB
mtcars.shape
(32, 11)

We have a lot of variables, and there can be so many interesting relationships we can visualize and try to figure out.
First of all, we can use a basic countplot to see if there is a certain trend in the cars produced, e.g., do most cars have 6 cylinders or do they have 4 carburettors?

res = sns.countplot(x='cyl', data=mtcars)
_images/eff8b37e9a77ef75c99ff89bcc04bf32f468e5a90f4eb10561ff3f050de26f0c.png
sns.countplot(y='carb', data=mtcars)
<Axes: xlabel='count', ylabel='carb'>
_images/50ba170e4a5886ae8a83e274cf4f078583e9a20882e7ea8acbafcdb7d278192a.png

we can also visualize easily, proportions of different variables with relation to other variables.
For example, if we want to see if cars with 4 gears are more likely to have 4 cylinders, or if cars with 3 gears have 6 cylinders in general?

sns.countplot(x='gear', hue='cyl', data=mtcars)
<Axes: xlabel='gear', ylabel='count'>
_images/009dc25e02cd662ba77cdd744acba5c324144352a43cc3d3b1face1ad3157a3b.png

Another simple thing to look at would be the distribution of some variable. Mileage looks like the most interesting variable for a car, so we will go ahead and just make a histogram for mileage:

sns.histplot(mtcars.mpg, bins=10, color='b')
<Axes: xlabel='mpg', ylabel='Count'>
_images/f103f17b50d0d7ff53534d0a55682809e665c58e8d1a8f48dd2e66f229750ee2.png

We see that most vehicles have a mileage around 17 mpg, however there are a few cars that provide very high mileage (around 35 mpg).

This is just about one variable, we can look at multiple variables and their relationships. For example, we want to figure out if the number of cylinders affects the mileage. Its a reasonable guess, but how to visualize? We can use a scatterplot to plot both variables.

res = sns.scatterplot(data=mtcars, x='cyl', y='mpg')
_images/62b258004f97eef84005398a784c45759a6f8cb4ac8e66b6c345b40957cc7cf2.png

We can see that there is a decreasing trend in general, as the number of cylinders increase, the mileage goes down.
However, since our x axis (cyl) is discrete, the points in the plots above line up, and its not really easy to make sense of anything regarding the distribution of points.
We can visualize this relationship better using a boxplot.

res = sns.catplot(data=mtcars, x='cyl', y='mpg', kind='box')
_images/06ac28afd18f18484d75014fca4fb1448b41b2a50a150d8923db9c3bf51dd02c.png

Next question can be, does the mode of transmission affect the mileage? Is an automatic system more efficient?

sns.catplot(x='am', y='mpg', data=mtcars, kind='box')
<seaborn.axisgrid.FacetGrid at 0x16d5ab0c1d0>
_images/6f9dd5aed45396d098ed69139c08e673a55ff48fe110c891163608536c7a5a6a.png

Looks like transmission also affects mileage, but it might not be a very good predictor of mileage, since there is a lot of variance and overlap between both the boxplots presented above (unlike the case for number of cylinders).

However, what if we want to know what all different variables affect the mileage?
We can make individual plots of each variable vs the mileage, or we can use seaborn to simplify our lives and just make one pairplot:

sns.pairplot(mtcars)
<seaborn.axisgrid.PairGrid at 0x16d5ab026d0>
_images/138854a577206858208765d868888f216117f0c8295e22bbf3dcdaab247e7bf4.png

Given the number of variables, it is really difficult to visualize or find relationships between different variables. What to do?
We can get over this issue by making a heatmap of different variables.
We can then use scatter plots or joint plots to look at variables we are interested in.

sns.heatmap(mtcars.corr(), cbar=True, linewidths=0.5, annot=True)
<Axes: >
_images/8c731b71db2635ed17833c5f7f83081aedd58be0a90984fadff9f6ec14c08ed6.png

It looks like mileage is highly negatively correlated with the displacement and weight, both of which are numeric variables. It also has a high positive correlation with drat which is again a numeric variable.

sns.scatterplot(x='drat', y='mpg', data=mtcars)
<Axes: xlabel='drat', ylabel='mpg'>
_images/2d6e9f3c06a84fcde8d47af21d0795dd1814a1f36d08334ce2f35db3d2675fd5.png
sns.scatterplot(x='disp', y='mpg', data=mtcars)
<Axes: xlabel='disp', ylabel='mpg'>
_images/f18c73a855d3c15eb104e15ad656398d09076af2e9ea64d19fa69461032f04a4.png
sns.scatterplot(x='wt', y='mpg', data=mtcars)
<Axes: xlabel='wt', ylabel='mpg'>
_images/b8e407a131ddce73294c2026cdbceb00848aea775fb0cc0e0bc66f5d2abccf13.png

Now, we can think of doing regressions or using these variables to predict the value of mileage of a car given its different features.
We will be learning more about regression, modelling, and other data science techniques in the next module when we learn about sci-kit learn.

As a teaser, given below is a simple visualization for regression carried out on the variables weights and mileage.

sns.lmplot(x='wt', y='mpg', data=mtcars)
<seaborn.axisgrid.FacetGrid at 0x16d63698cd0>
_images/b2df1cc1ab8c857a98e0f7174d6ecb59c0068427dfb691476874b37f8fed4ca9.png