Pandas: Python Data Analysis Library

Introduction

Pandas is a Python library for data analysis. It provides high-performance, easy-to-use data structures and data analysis tools. All of it revolves around the DataFrame object, which is a tabular data structure with labeled axes (rows and columns). It is similar to a spreadsheet or a SQL table, or the data.frame in R. Wikipedia describes it as...

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Series

Introduction

Data series in Pandas are similar to lists in Python. They are one-dimensional arrays of data. They can be created from lists or dictionaries. But in Pandas, you always have to have an index, whether specified or not. Let's examine the below code snippet.

Creating a Series

import pandas as pd
import numpy as np
alist = [1,2,3,4]
s1 = pd.Series(alist)
s1

When you just create a Series object from a list, a default index is created for you, starting from 0 and counting up by 1. Below is the result of the above code snippet.

0    1
1    2
2    3
3    4
dtype: int64

You may want to actually specify the index yourself. To do this you can either create a dictionary, where the keys are the index values. It's also possible to create a Series object from a list or np.array, where it is the same size as the data Series in the same order. Modifying the above code snippet with a second list of alphabetical characters, you get the following result.

import pandas as pd
import numpy as np
alist = [1,2,3,4]
blist = ['a', 'b', 'c', 'd']
s1 = pd.Series(data=alist, index=blist)
s1

a    1
b    2
c    3
d    4
dtype: int64

You can also create Series from dictionaries. Simply make the keys the index values, and the values associated with the keys the data in the series.

mydict = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
s3 = pd.Series(mydict)
s3

And you get the same result as before with two lists.

Accessing an Element in a Series

To access a specific value in a Series object, you can use the index value the same way you'd access a value in a dictionary or list.

s1['a']
# Output: 1

Accessing a value in a series by its index, like s1 from before, gets you 1.

Slice a Series

Series are sliced a bit differently than lists. Because they have an index of potentially ordinal values, you can't assume that the index values are sequential. So instead of using the : operator, you have two nested brackets, with the inner brackets containing the index values you want to slice.

s1[ ['a', 'c']]

Results in :

a    1
c    3
dtype: int64

Now, is the slice taken a copy or reference? Let's find out by modifying the end of the code from before.

s2 = s1[ ['a', 'c']]
s2['a'] = 10
s1['a']

The result is 1. So clearly slicing in pandas copies the data.

Read & Write Files in Pandas

Pandas I/O API

The pandas I/O API is a set of top level reader functions accessed like pd.read_csv() that generally return a pandas object. It is fairly modular so over time it has accumulated a large number of specialized readers for various formats.

For example, suppose your manager asks you to analyze some data saved in HTML format. As a data scientist, you would then load the data in your Python program using the appropriate pandas function to read the HTML files and perform the analysis.

Some Common File Formats with API References

Function	Description
read_json()	Read JSON files
read_csv()	Read CSV files
read_html()	Read HTML files
read_excel()	Read Excel files
read_sql()	Read SQL files
to_json()	Write JSON files
to_csv()	Write CSV files
to_html()	Write HTML files
to_excel()	Write Excel files
to_sql()	Write SQL files
pandas IO tool	*

* Set of functions to read/write custom formats of data to handle data in more bespoke ways

Read & Write JSON Files

Pandas can be used in conjunction with JSON files. JSON is a data format that is commonly used for storage and communication of data between systems, particularly on the internet.

Write Pandas Objects to JSON File

To make it easier to start, let's begin by creating a JSON string.

myseries = pd.Series([1,2,3,4], ['a', 'b', 'c', 'd'])
myseries.to_json(orient='index')
# Output: '{"a":1,"b":2,"c":3,"d":4}'

Using the to_json() method, with parameter orient='index', the index becomes the key in the JSON string and the data becomes the value.

To actually write the JSON string to a file, you can use the to_json() method again, but this time with the path_or_buf parameter or the first positional argument with a file path.

myseries = pd.Series([1,2,3,4], ['a', 'b', 'c', 'd'])
myseries.to_json('myseries.json', orient='index')

Read JSON File into Pandas Object

Now that there's a file to read, it's possible to read it into a Series object. It's mainly done using the pd.read_json() method.

myseries = pd.read_json('myseries.json', orient='index')
myseries

Which results in this nicely formatted table:

	0
a	1
b	2
c	3
d	4

Note that because there isn't an obvious indication of the index, the orient='index' parameter is used again to indicate that the index is the key. It's also important to note that because there is no pandas object yet, the read_json() method is called from the abstract pd module, not a method of a pandas object like a Series or DataFrame.

DataFrame

Basics

A pandas Dataframe is a 2-dimensional labeled data structure where data is organized in rows & columns. Pandas dataframes have four pricnipal components: rows, columns, index, and data. As shown below in the annotated screenshot.

dataframe-components

Given some data, each row represents an observation and each column a variable. In the example above, the first row contains the dirnks consumption for Afghanistan, whereas the second row contains the drinks consumption for Albania.

In a dataframe, columns can be seen as a label for each measurement taken. In the image above, each of the columns describes what data is stored in them; for example, the column continent contains the continent of the country.

The index defines the location, address, of each data point in the dataframe. Thus, the index can be used to access data in a dataframe. In the image from before, the first column contains the index of the dataframe. Columns also have an index: the column name. Therefore, the entry "Afghanistan" in the first row has row index 0 and column index "country".

Finally, the data is simply the information stored in the dataframe. One particular thing to consider about dataframes is that they can contain different types of data. However, a column in a dataframe can only have one data type. Observe the dataframe above. You can determine that the data across the dataframe is of different types: integers and floats. However, the data stored in a single column of the dataframe is always going to be of the same type. For example, all the entries in the column "total_litres_of_pure_alcohol" are of type float.

Exploring a DataFrame

Once you have defined or imported your dataframe in your program, it is important that you become familiar with your data. Pandas offer a range of functions to facilitate this.

For example, suppose that you have read and saved the dataframe in the image below in your code as df. Often, one of the first things you would like to do is visualize the first few rows of your dataframe to see what type of data it contains. This can be achieved by using the head() function, like this:

df.head()

	Unnamed: 0	country	beer_servings	spirit_servings	wine_servings	*	continent
0	0	Afghan	0	0	0	0.0	Asia
1	1	Albania	89	132	54	4.9	Europe
2	2	Algeria	25	0	14	0.7	Africa
3	3	Andorra	245	138	312	12.4	Europe
4	4	Angola	217	57	45	5.9	Africa

* total_litres_of_pure_alcohol

By default, the head() function returns the first five rows of the dataframe. This number can be changed by passing an integer to the function.

To know the dimensions of your dataframe, i.e. the number of rows and columns, you can use the shape attribute, like this:

df.shape

The output is a tuple with the number of rows as the first element and the number of columns as the second element. In the example above, the dataframe has 193 rows and 8 columns.

You can see that the command returns a tuple containing the number of rows (193) and columns (7) in the dataframe.

To display the data types in each column of your dataframe, you can use the .info() function, like this:

df.info()

This function returns a list of all the columns in your data set and the type of data that each column contains. Here, you can see the data types int64, float64, and object. Int64 and float64 are used to describe integers and floats, respectively. The object data type is used for columns that pandas doesn’t recognize as any other specific type. It means that all of the values in the column are strings.

Create a DataFrame

Using Lists

Let's jump right into an example. It's possible to create a DataFrame object from random data via NumPy. Then to give the columns names, it's possible to pass a list of column names to the columns parameter.

import pandas as pd
import numpy as np
cols = ['A', 'B', 'C', 'D']
my_array = np.random.randint(0, 100, size=(5,4))
print(my_array)
# Output:
# [[49 81 18 22]
# [ 0  7 29 70]
# [97 19 49 75]
# [50 58 96 34]
# [18 30  7  2]]
df = pd.DataFrame(my_array, columns=cols)
df

The resulting data frame is this nice table:

	A	B	C	D
0	46	45	46	96
1	86	40	98	47
2	79	11	92	77
3	28	24	49	61
4	98	78	42	35

The first positional argument is the data, then any number of optional parameters can be passed, like the columns parameter.

You can even create date ranges using the pd.date_range function to define columns.

import pandas as pd
import numpy as np
cols = pd.date_range('2019-01-01', periods=4, freq='D')
my_array = np.random.randint(0, 100, size=(5,4))
print(my_array)
# Output:
# [[49 81 18 22]
# [ 0  7 29 70]
# [97 19 49 75]
# [50 58 96 34]
# [18 30  7  2]]
df = pd.DataFrame(my_array, columns=cols)
df

The resulting data frame is this nice table:

	2019-01-01	2019-01-02	2019-01-03	2019-01-04
0	46	45	46	96
1	86	40	98	47
2	79	11	92	77
3	28	24	49	61
4	98	78	42	35

Performing Operations on Columns

It's possible to perform operations on columns. For example, let's say you want to add a column that is the sum of two other columns.

df['A + B'] = df['A'] + df['B']
df

The resulting data frame is this nice table:

	A	B	C	D	A + B
0	46	45	46	96	91
1	86	40	98	47	126
2	79	11	92	77	90
3	28	24	49	61	52
4	98	78	42	35	176

Using Dictionaries

There are many situations where defining dataframes using dictionaries is more convenient.

import pandas as pd
contact = {'Name': 'john', 'Email': 'john@mit.edu', 'Course': 1}
contacts = {
  'Name': ['john', 'jane', 'joe'],
  'Email': ['john@mit.edu', 'jane@mit.edu', 'joe@mit.edu'],
  'Course': [1, 2, 3]
}
df = pd.DataFrame(contacts)
df

The keys become the column names. And the values become the rows. Since there can be multiple values for each key, like if a list is used, the order of the values matches the order in all the other lists.

The resulting dataframe is this nice table:

	Name	Email	Course
0	john	`john@mit.edu`	1
1	jane	`jane@mit.edu`	2
2	joe	`joe@mit.edu`	3

Note: The order of the keys in the dictionary is not guaranteed. So the order of the columns in the resulting dataframe is not guaranteed. It's possible to specify the order of the columns by passing a list of column names to the columns parameter.

Cleaning Data

Cleaning up NaNs

When dealing with datasets, it's common to have missing values. One way to represent missing values is with NaN (not a number). This can take different meanings depending on the context, but is generally different from nullish values like None.

Say we create some random data and insert some NaN values.

import numpy as np
import pandas as pd
cols = list('ABCD')
np.random.seed(seed=1)
myArray = np.random.ranf(size=(5, 4))
myArray[2, 3] = np.nan
myArray[3, 2] = np.nan
df = pd.DataFrame(myArray, columns=cols)
df

This results in the following dataframe:

	A	B	C	D
0	0.417022	0.720324	0.000114	0.302333
1	0.146756	0.092339	0.186260	0.345561
2	0.396767	0.538817	0.419195	NaN
3	0.204452	0.878117	NaN	0.670468
4	0.417305	0.558690	0.140387	0.198101

There are several ways of dealing with these missing values, besides just leaving them in and dealing with the missing data later.

Dropping NaNs

First is to simply drop an axis (row or column) that contains a NaN value.

df.dropna(axis=1, inplace=True)

This results in this dataframe, but now the two columns with NaN values are gone.

	A	B
0	0.417022	0.720324
1	0.146756	0.092339
2	0.396767	0.538817
3	0.204452	0.878117
4	0.417305	0.558690

The axis parameter specifies whether to drop rows (0) or columns (1). The inplace parameter specifies whether to modify the dataframe in place or return a new dataframe.

Filling in Missing Values

An alternative to dropping the NaN values is to fill them in with some other value. This can be done with the fillna method.

df.fillna(value=0)

The value parameter specifies what value to fill in the NaN values with. In this case it is 0 which gets cast to a float.

	A	B	C	D
0	0.417022	0.720324	0.000114	0.302333
1	0.146756	0.092339	0.186260	0.345561
2	0.396767	0.538817	0.419195	0.000000
3	0.204452	0.878117	0.000000	0.670468
4	0.417305	0.558690	0.140387	0.198101

Again, the inplace parameter specifies whether to modify the dataframe in place or return a new dataframe as a view.

Categorical Data

Another common type of data is categorical data, or ordinal data. This is any type of data that isn't inherently numeric. Think of a label or a category like a kind of fruit, an apple, pear or orange. What is the numeric value of an apple?

Encoding Categorical Data

Although there isn't an inherent numeric value to a category, it is possible to encode the categories as numbers and sometimes those numbers can be meaningful in numerical analysis.

One-hot encoding is a common way to encode categorical data. The phrase "one-hot" comes from the fact that only one of the categories is "hot" or "on" at a time. Conversely, one can one-cool encode the data, where only one of the categories is "cool" or "off" at a time. Typicall one-hot gets used more often than one-cool, because it tends to make more sense in most contexts.

Within pandas there's a get_dummies method that can be used to one-hot encode categorical data. Let's demonstrate this first by creating a dataframe about fruit types and their price and stock.

import pandas as pd
import numpy as np
df_dict = {
    'fruit': ['Apple', 'Pear', 'Orange', 'Banana'],
    'count': [3, 2, 4, 1],
    'price': [0.5, 0.75, 1.2, 0.65]
}
df = pd.DataFrame(df_dict)
df

Which gets us this dataframe:

	fruit	count	price
0	Apple	3	0.5
1	Pear	2	0.75
2	Orange	4	1.2
3	Banana	1	0.65

Now if we must perform numerical analysis on the fruit, they need to be one-hot encoded. Using the get_dummies method, we can do this.

df = (pd.get_dummies(df, columns=['fruit'])
        .rename(columns=lambda x: x.replace('fruit_', '')))
df

Which gets us this dataframe:

	count	price	Apple	Banana	Orange	Pear
0	3	0.5	1	0	0	0
1	2	0.75	0	0	0	1
2	4	1.2	0	0	1	0
3	1	0.65	0	1	0	0

Now the fruit types are encoded as numbers and it should be easier to see how the encoding works. A column is made for each possible value of the category. And only one of the columns is "hot" or "on" at a time.

The rest of the code in the previous snippet, is just about renaming the columns to make them more readable using rename. By default pandas get_dummies method will prefix the column names with the original column name and an underscore.

Time & Date

Time & Dates are a common type of data that is used many contexts within pandas. Their nature means that different handling is required to make them useful.

Epoch Time

The first important concept to know is of the epoch time. The epoch time is the number of seconds since January 1st, 1970. This is the time that is used by computers to keep track of time. When you see a date printed on a computer, it is typically the calculated number of years, months, days, hours, etc since the epoch.

UTC

The epoch time is also known as UTC or Coordinated Universal Time. UTC is the time that is used by the world to keep track of time. It is a single standardised time that is used by all countries, and their different time zones are offset from UTC.

ISO 8601

The standard format for dates and times is ISO 8601. This is a standardised way of representing dates and times. It is a string of numbers and letters that can be parsed by computers.

To put it simply, the format is YYYY-MM-DDTHH:MM:SSZ. Where:

YYYY is the year
MM is the month
DD is the day
HH is the hour
MM is the minute
SS is the second
- SS can be a decimal number so SS.SSS is also valid
Z is the timezone offset
- Z is UTC
- +HH:MM is a positive offset from UTC
- -HH:MM is a negative offset from UTC
- So +01:00 is one hour ahead of UTC, and is also Central European Time (CET)

Date & Time Index

To work with time containing data, we may need to create a DatetimeIndex object. This is a special type of index that is used to index data by time. It is a special type of index because it can be used to index data by a time range, and not just a single time.

import pandas as pd
import numpy as np
myArray = np.array([1, 2, 3, 4, 5, 6, 7])
mySeries = pd.Series(myArray)
mySeries.index = pd.date_range(start='1/1/19', periods=7)
mySeries.index.dayofweek
# Output: Int64Index([6, 0, 1, 2, 3, 4, 5], dtype='int64')

Note that with dayofweek we can get the day of the week 0 indexed, where 0 is Monday and 6 is Sunday.

PLACEHOLDER FOR ACCESS VIA LOC ILOC AND INDEXING

PLACEHOLDER FOR PIVOTS

Join Dataframes

Why Join Dataframes?

In many real world applications, data is presented to you in separate but related structures. Depending on your needs, pandas provides two ways of combining dataframes: unions & joins.

Union

With the union, you can append the columns of one dataframe to the other. The union can be achieved with the concat method.

Suppose you have two dataframes: df1 and df2.

df1 = pd.DataFrame({'student_id': ['S1234', 'S4321'], 'name': ['Daniel', 'John']})
df2 = pd.DataFrame({'student_id': ['S3333', 'S4444'], 'name': ['Mary', 'Jane']})

Which produces these two tables:

	student_id	name
0	S1234	Daniel
1	S4321	John

	student_id	name
0	S3333	Mary
1	S4444	Jane

Since the columns of the two dataframes are the same, we can use the concat method to combine, put them into a union.

df = pd.concat([df1, df2]).reset_index(drop=True)

Which produces this table:

	student_id	name
0	S1234	Daniel
1	S4321	John
2	S3333	Mary
3	S4444	Jane

The reset_index method is used to reset the index of the dataframe. The index in the resulting dataframe is now a new index with both dataframes, but with the indexing redone in some order.

Join

Sometimes, you may want to combine columns in different dataframes that contain common values. This technique is called joining, and is done with the merge method.

When performing a join, the columns containing the common values are called join keys.

The join concept explored here is similar to how joins are done in SQL.

There are four different types of joins: inner, outer, left, and right. Here they will be explained in detail.

Inner Join

An inner join combines two dataframes by matching the join keys. Usually, inner joins are the default type of join. Take for example the previous student dataframes. If you wanted to combine another dataframe containing last names, with a join key of student_id, you would use an inner join.

df3 = pd.DataFrame({'student_id': ['S1234', 'S4321'], 'last_name': ['Smith', 'Doe']})
df = pd.merge(df, df3, on='student_id')

which results in this table:

	student_id	name	last_name
0	S1234	Daniel	Smith
1	S4321	John	Doe

Note: That inner joins are only done on data matching the join keys. So the students 'Mary' & 'Jane' are not included in the resulting dataframe, because they do not have a join key in df3.

Outer Join

An outer join combines two dataframes by matching the join keys, but it also includes all the data from both dataframes. This means that often the resulting dataframe will have nullish values in the case where there are columns from one dataframe not present in the other.

df4 = pd.merge(df2, df3, how='outer', on='student_id')
# Remember df2 is Mary & Jane, df3 is John & Daniel with a last name column
df4

which results in this table:

	student_id	name	last_name
0	S3333	Mary	NaN
1	S4444	Jane	NaN
2	S1234	Daniel	Smith
3	S4321	John	Doe

Left Join

A left join combines two dataframes by matching the join keys, but it also includes all the data from the left dataframe. This means that often the resulting dataframe will have nullish values in the case where there are columns from the right dataframe not present in the left.

df5 = pd.merge(df2, df3, how='left', on='student_id')
# Remember df2 is Mary & Jane, df3 is John & Daniel with a last name column
df5

which results in this table:

	student_id	name	last_name
0	S3333	Mary	NaN
1	S4444	Jane	NaN

Right Join

A right join combines two dataframes by matching the join keys, but it also includes all the data from the right dataframe. This means that often the resulting dataframe will have nullish values in the case where there are columns from the left dataframe not present in the right.

df6 = pd.merge(df2, df3, how='right', on='student_id')
# Remember df2 is Mary & Jane, df3 is John & Daniel with a last name column
df6

which results in this table:

	student_id	name	last_name
0	S1234	Daniel	Smith
1	S4321	John	Doe

Reading and Writing Big Data

Parquet

To read and write large amounts of data, CSV files might not be efficient enough. This is where Apache's Parquet file format comes in. The main difference between traditional storage and the type of storage offered by the Parquet format resides in the way that data is processed and stored. Parquet stores data by column, thus the term columnar storage format, rather than by row. This offers far more efficient access to data because having the the data stored in columns keeps the data of te same type together, thus reducing the querying time.

Pandas has built-in support for reading and writing Parquet files with heavily optimized C code for chunking and memory and I/O optimization. For more information about how to use Parquet files with Pandas, read the parquet notes.

Feather

Another columnar file format that is supported by Pandas is Feather. Feather is a fast, lightweight, and easy-to-use binary file format for fast access and large data storage and manipulation. Unlike Parquet, Feather is not designed for long-term storage and isn't compressed. This and its in-memory storage allows it to perform much better than Parquet for some tasks.

For more information about how to use Pandas with Feather files, read the notes about Apache Feather.

Pandas: Python Data Analysis Library

Introduction

Series

Introduction

Creating a Series

Accessing an Element in a Series

Slice a Series

Read & Write Files in Pandas

Pandas I/O API

Some Common File Formats with API References

Read & Write JSON Files

Write Pandas Objects to JSON File

Read JSON File into Pandas Object

Further Reading

DataFrame

Basics

Exploring a DataFrame

Create a DataFrame

Using Lists

Performing Operations on Columns

Using Dictionaries

Cleaning Data

Cleaning up NaNs

Dropping NaNs

Filling in Missing Values

Categorical Data

Encoding Categorical Data

Time & Date

Epoch Time

UTC

ISO 8601

Date & Time Index

PLACEHOLDER FOR ACCESS VIA LOC ILOC AND INDEXING

PLACEHOLDER FOR PIVOTS

Join Dataframes

Why Join Dataframes?

Union

Join

Inner Join

Outer Join

Left Join

Right Join

Reading and Writing Big Data

Parquet

Feather

References

Web Links

Note Links