If you're getting into machine learning and data science and you're using Python, you're going to use pandas.
Pandas is an open source library which helps you analyse and manipulate data.
Pandas provides a simple to use but very capable set of functions you can use on your data.
It's integrated with many other data science and machine learning tools which use Python so having an understanding of it will be helpful throughout your journey.
One of the main use cases you'll come across is using pandas to transform your data in a way which makes it usable with machine learning algorithms.
To get started using pandas, the first step is to import it.
The most common way (and method you should use) is to import pandas as the abbreviation pd.
Pandas has two main datatypes, Series and DataFrame.
- Series - a 1-dimensional column of data.
- DataFrame (most common) - a 2-dimesional table of data with rows and columns.
You can create a Series using pd.Series() and passing it a Python list.
You can create a DataFrame by using pd.DataFrame() and passing it a Python dictionary.
Let's use our two Series as the values.
Different functions use different labels for different things. This graphic sums up some of the main components of DataFrame's and their different names.
Creating Series and DataFrame's from scratch is nice but what you'll usually be doing is importing your data in the form of a .csv (comma separated value) or spreadsheet file.
Pandas allows for easy importing of data like this through functions such as pd.read_csv() and pd.read_excel() (for Microsoft Excel files).
After you've made a few changes to your data, you might want to export it and save it so someone else can access the changes.
Pandas allows you to export DataFrame's to .csv format using .to_csv() or spreadsheet format using .to_excel().
One of the first things you'll want to do after you import some data into a pandas DataFrame is to start exploring it.
Pandas has many built in functions which allow you to quickly get information about a DataFrame.
Let's explore some using the car_sales DataFrame.
- columns - df['A']
- boolean indexing - df[df['A'] > 5]
Pandas even allows for quick plotting of columns so you can see your data visualling.
To plot, you'll have to import matplotlib. If your plots aren't showing, try running the two lines of code below.
%matplotlib inline is a special command which tells Jupyter to show your plots.
Commands with % at the front are called magic commands.
There are several plot types built-in to pandas, most of them statistical plots by nature:
You can also just call df.plot(kind='hist') or replace that kind argument with any of the key terms shown in the list above (e.g. 'box','barh', etc..)
You've seen an example of one way to manipulate data but pandas has many more. How many more? Put it this way, if you can imagine it, chances are, pandas can do it.
Let's start with string methods. Because pandas is based on Python, however you can manipulate strings in Python, you can do the same in pandas.
You can access the string value of a column using .str. Knowing this, how do you think you'd set a column to lowercase?