Skip to main content

Pandas Guide


What is pandas?#

If you're getting into machine learning and data science and you're using Python, you're going to use pandas.
Pandas is an open source library which helps you analyse and manipulate data.

Pandas provides a simple to use but very capable set of functions you can use on your data.

It's integrated with many other data science and machine learning tools which use Python so having an understanding of it will be helpful throughout your journey.

One of the main use cases you'll come across is using pandas to transform your data in a way which makes it usable with machine learning algorithms.

Importing pandas#

To get started using pandas, the first step is to import it.
The most common way (and method you should use) is to import pandas as the abbreviation pd.

import pandas as pd

Datatypes#

Pandas has two main datatypes, Series and DataFrame.

  • Series - a 1-dimensional column of data.
  • DataFrame (most common) - a 2-dimesional table of data with rows and columns.

You can create a Series using pd.Series() and passing it a Python list.

# Creating a series of car types
cars = pd.Series(["BMW", "Toyota", "Honda"])
cars
"""
0 BMW
1 Toyota
2 Honda
dtype: object
"""
# Creating a series of colours
colours = pd.Series(["Blue", "Red", "White"])
colours
"""
0 Blue
1 Red
2 White
dtype: object
"""

You can create a DataFrame by using pd.DataFrame() and passing it a Python dictionary.
Let's use our two Series as the values.

# Creating a DataFrame of cars and colours
car_data = pd.DataFrame({"Car type": cars,
"Colour": colours})
car_data
"""
0 BMW Blue
1 Toyota Red
2 Honda White
"""

Anatomy of a DataFrame#

Different functions use different labels for different things. This graphic sums up some of the main components of DataFrame's and their different names.
PandasDataframeAnatomy

Importing data#

Creating Series and DataFrame's from scratch is nice but what you'll usually be doing is importing your data in the form of a .csv (comma separated value) or spreadsheet file.

Pandas allows for easy importing of data like this through functions such as pd.read_csv() and pd.read_excel() (for Microsoft Excel files).

# Import the car sales data and save it to df
df = pd.read_csv("../data/car-sales.csv")
df
"""
Make Colour Odometer (KM) Doors Price
Toyota White 150043 4 $4,000.00
Toyota Blue 32549 3 $7,000.00
Nissan White 213095 4 $3,500.00
"""

Exporting data#

After you've made a few changes to your data, you might want to export it and save it so someone else can access the changes.
Pandas allows you to export DataFrame's to .csv format using .to_csv() or spreadsheet format using .to_excel().

# Export the car sales DataFrame to csv
car_sales.to_csv("../data/exported-car-sales.csv")

Describing data#

One of the first things you'll want to do after you import some data into a pandas DataFrame is to start exploring it.

Pandas has many built in functions which allow you to quickly get information about a DataFrame.

Let's explore some using the car_sales DataFrame.

# .dtypes shows us what datatype each column contains.
car_sales.dtypes
"""
Make object
Colour object
Odometer int64
Doors int64
Price object
dtype: object
"""
# .describe() gives you a quick statistical overview of the numerical columns.
car_sales.describe()
"""
Odometer Doors
count 10.000000 10.000000
mean 78601.400000 4.000000
std 61983.471735 0.471405
min 11179.000000 3.000000
25% 35836.250000 4.000000
50% 57369.000000 4.000000
75% 96384.500000 4.000000
max 213095.000000 5.000000
"""
# .info() shows a handful of useful information about a DataFrame such as:
# How many entries (rows) there are
# Whether there are missing values
# (if a columns non-null value is less than the number of entries, it has missing values)
# The datatypes of each column
car_sales.info()
"""
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
Make 10 non-null object
Colour 10 non-null object
Odometer 10 non-null int64
Doors 10 non-null int64
Price 10 non-null object
dtypes: int64(2), object(3)
"""
# Calling .mean() on a DataFrame
car_sales.mean()
"""
dometer 78601.4
Doors 4.0
dtype: float64
"""
# Calling .mean() on a Series
car_prices = pd.Series([3000, 3500, 11250])
car_prices.mean()
# 5916.666666666667
# Calling .sum() on a DataFrame
car_sales.sum()
"""
Make ToyotaHondaToyotaBMWNissanToyotaHondaHondaToyo...
Colour WhiteRedBlueBlackWhiteGreenBlueBlueWhiteWhite
Odometer 786014
Doors 40
Price $4,000$5,000$7,000$22,000$3,500$4,500$7,500$7,...
dtype: object
"""
# Calling .sum() on a Series
car_prices.sum()
# 17750
# Calling these on a whole DataFrame may not be as helpful as targeting an individual column.
# But it's helpful to know they're there.
# .columns will show you all the columns of a DataFrame.
car_sales.columns
# Index(['Make', 'Colour', 'Odometer', 'Doors', 'Price'], dtype='object')
# Convert df to lsit
list(car_sales.columns)
# ['Make', 'Colour', 'Odometer', 'Doors', 'Price']
# Show the length of a DataFrame
len(car_sales)
# 10

Deleting a column#

dataset.drop('GroupId', axis=1, inplace=True)

Viewing and selecting data#

  • head()
  • tail()
  • loc
  • iloc
  • columns - df['A']
  • boolean indexing - df[df['A'] > 5]
  • crosstab()
  • .plot()
  • hist()
# Show the first 5 rows of car_sales
car_sales.head()
"""
Make Colour Odometer Doors Price
0 Toyota White 150043 4 $4,000
2 Toyota Blue 32549 3 $7,000
4 Nissan White 213095 4 $3,500
"""
# Show the first 7 rows of car_sales
car_sales.head(7)
# Show bottom 5 rows of car_sales
car_sales.tail()
# You can use .loc[] and .iloc[] to select data from your Series and DataFrame's.
animals = pd.Series(["cat", "dog", "bird", "snake", "ox", "lion"],
index=[0, 3, 9, 8, 67, 3])
animals
"""
0 cat
3 dog
9 bird
8 snake
67 ox
3 lion
dtype: object
"""
# .loc[] takes an integer as input.
# And it chooses from your Series or DataFrame whichever index matches the number.
animals.loc[3]
"""
3 dog
3 lion
dtype: object
"""
# iloc[] does a similar thing but works with exact positions.
animals.iloc[3]
# snake
# Get all rows up to position 3
animals.iloc[:3]
"""
0 cat
3 dog
9 bird
dtype: object
"""
# Get all rows up to (and including) index 3
car_sales.loc[:3]
"""
Make Colour Odometer (KM) Doors Price
0 Toyota White 150043 4 $4,000.00
1 Honda Red 87899 4 $5,000.00
2 Toyota Blue 32549 3 $7,000.00
3 BMW Black 11179 5 $22,000.00
"""
# When should you use .loc[] or .iloc[]?
# Use .loc[] when you're referring to indexes.
# Use .iloc[] when you're referring to positions in the DataFrame (index is out of order).
# Select Make column
car_sales['Make']
"""
0 Toyota
1 Honda
2 Toyota
3 BMW
Name: Make, dtype: object
"""
# Select cars with over 100,000 on the Odometer
car_sales[car_sales["Odometer (KM)"] > 100000]
# Select cars which are made by Toyota
car_sales[car_sales["Make"] == "Toyota"]
# pd.crosstab() is a great way to view two different columns together and compare them.
# Compare car Make with number of Doors
pd.crosstab(car_sales["Make"], car_sales["Doors"])
"""
Doors 3 4 5
Make
BMW 0 0 1
Honda 0 3 0
Nissan 0 2 0
Toyota 1 3 0
"""
# If you want to compare more columns in the context of another column, you can use .groupby().
# Group by the Make column and find the mean of the other columns
car_sales.groupby(["Make"]).mean()

Plotting with Pandas#

Pandas even allows for quick plotting of columns so you can see your data visualling.
To plot, you'll have to import matplotlib. If your plots aren't showing, try running the two lines of code below.

%matplotlib inline is a special command which tells Jupyter to show your plots.
Commands with % at the front are called magic commands.

# Import matplotlib and tell Jupyter to show plots
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot') # Change the plot style
# You can visualize a column by calling .plot() on it.
car_sales["Odometer (KM)"].plot()

odometer

# You can see the distribution of a column by calling .hist() on you.
# The distribution of something is a way of describing the spread of different values.
car_sales["Odometer (KM)"].hist()

bar

# Change Price column to integers
car_sales["Price"] = car_sales["Price"].str.replace('[\$\,\.]', '').astype(int)
car_sales["Price"].plot()

price

Plot Types#

There are several plot types built-in to pandas, most of them statistical plots by nature:

  • df.plot.area
  • df.plot.barh
  • df.plot.density
  • df.plot.hist
  • df.plot.line
  • df.plot.scatter
  • df.plot.bar
  • df.plot.box
  • df.plot.hexbin
  • df.plot.kde
  • df.plot.pie

You can also just call df.plot(kind='hist') or replace that kind argument with any of the key terms shown in the list above (e.g. 'box','barh', etc..)

# Area Plot
df.plot.area(alpha=0.4)

area_plot1

# Bar Plot
df.plot.bar()

bar_plot1

# Bar Plot (stacked)
df.plot.bar(stacked=True)

bar_plot2

# Histograms Plot
df1['A'].plot.hist(bins=50)

hist

# Line Plot
df1.plot.line(x=df1.index,y='B',figsize=(12,3),lw=1)

line

# Scatter Plot
df1.plot.scatter(x='A',y='B',c='C',cmap='coolwarm')

scatter

# Scatter Plot
df1.plot.scatter(x='A',y='B',s=df1['C']*200)

scatter

# Box Plot
df2.plot.box() # Can also pass a by= argument for groupby

box

Manipulating data#

You've seen an example of one way to manipulate data but pandas has many more. How many more? Put it this way, if you can imagine it, chances are, pandas can do it.

Let's start with string methods. Because pandas is based on Python, however you can manipulate strings in Python, you can do the same in pandas.

You can access the string value of a column using .str. Knowing this, how do you think you'd set a column to lowercase?

# Lower the Make column
car_sales["Make"].str.lower()
"""
0 toyota
1 honda
2 toyota
3 bmw
4 nissan
5 toyota
6 honda
7 honda
8 toyota
9 nissan
"""
# Set Make column to be lowered
car_sales["Make"] = car_sales["Make"].str.lower()
car_sales.head()
"""
Make Colour Odometer Doors Price
0 toyota White 150043 4 4000
1 honda Red 87899 4 5000
2 toyota Blue 32549 3 7000
3 bmw Black 11179 5 22000
4 nissan White 213095 4 3500
"""
# Fill the Odometer missing values with the mean and with inplace=True ( No need to reassign car_sales )
"""
Make Colour Odometer Doors Price
0 toyota White 150043 4 4000
1 honda Red 87899 4 5000
2 toyota Blue NaN 3 7000
"""
car_sales["Odometer"].fillna(car_sales["Odometer"].mean(),
inplace=True)
"""
Make Colour Odometer Doors Price
0 toyota White 150043 4 4000
1 honda Red 87899 4 5000
2 toyota Blue 92302 3 7000
"""
# Remove missing data
# The following two lines do the same thing
car_sales.dropna(inplace=True) # Operation happens inplace without reassignment
car_sales = car_sales.dropna() # car_sales gets reassigned to same DataFrame but with dropped values
# Adding new columns
# Create a column from a pandas Series
seats_column = pd.Series([5, 5, 5, 5, 5, 5, 5, 5, 5, 5])
car_sales["Seats"] = seats_column
# Create a column from a Python list
engine_sizes = [1.3, 2.0, 3.0, 4.2, 1.6, 1, 2.0, 2.3, 2.0, 3.0]
car_sales["Engine Size"] = engine_sizes
# Column from other columns
car_sales["Price per KM"] = car_sales["Price"] / car_sales["Odometer"]
# You can also create a new column setting all values to a one standard value.
# Column to all 1 value (number of wheels)
car_sales["Number of wheels"] = 4
# Removing columns
# You can remove a column using .drop('COLUMN_NAME', axis=1).
# Drop the Price per KM column
car_sales = car_sales.drop("Price per KM", axis=1)
# Why axis=1? Because that's the axis columns live on. Rows live on axis=0.
# Let's say you wanted to shuffle the order of your DataFrame so
# you could split it into train, validation and test sets.
# And even though the order of your samples was random, you wanted to make sure.
# To do so you could use .sample(frac=1).
# .sample() randomly samples different rows from a DataFrame.
# The frac parameter dictates the fraction, where 1 = 100% of rows, 0.5 = 50% of rows, 0.01 = 1% of rows.
# Sample car_sales
car_sales_sampled = car_sales.sample(frac=1)
# What if you wanted to get the indexes back in order?
# You could do so using .reset_index().
# Reset the indexes of car_sales_sampled
car_sales_sampled.reset_index()
# Finally, what if you wanted to apply a function to a column.
# Such as, converting the Odometer column from kilometers to miles.
# You can do so using the .apply() function and passing it a lambda function.
# We know there's about 1.6 kilometers in a mile, so if you divide the value
# in the Odometer column by 1.6, it should convert it to miles.
# Change the Odometer values from kilometres to miles
car_sales["Odometer"] = car_sales["Odometer"].apply(lambda x: x / 1.6)
# If you've never seen a lambda function they can be tricky. What the line above is saying is "take the value in the Odometer (KM) column (x) and set it to be itself divided by 1.6".
Last updated on