Pandas

Pandas is usually used for loading, transforming and dumping data. Pandas is build in NumPy {which is implemented in C}, which improves speed and getting rid of Garbage collector.

There are some key data-structures while working with Pandas

  1. Series: One dimensional array containing sequences of values

  2. DataFrame {containers for Series}: Represents rectangular table for data and contains an ordered collection of columns, each of which can be of different values. It has both:
    1. Row Index

    2. Column Index

  3. Panel {containers for DataFrame}: This is a three-dimensional data-structure

Peeking data from CSV file

active_cases_and_patents = pd.read_csv("./data/active_case_patents.csv")
active_cases_and_patents.head()

Counting rows in frame

print(f"Total case sample:{active_cases_and_patents.shape[0]}")

Series from NumPy arrays

import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)

Series from Dictionary

data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)

Retrieve multiple elements from Series

s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve multiple elements
print(s[['a','c','d']])

Create frame from nested list

data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)

Create frame from dict of list

data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)

Create frame from list of dict

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])

Selecting column from frame

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df ['one'])

Appending columns to frame

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series

print("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)

print("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']
print(df)

Deleting columns

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print(df)

# using del function
print("Deleting the first column using DEL function:")
del df['one']

Selecting by row label

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.loc['b'])

Selecting by row integer

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.iloc[2])

# for multiple rows
print(df[2:4])

Appending rows

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

Deleting rows

# Drop rows with label 0
df = df.drop(0)

Describing Data

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

# Create a DataFrame
df = pd.DataFrame(d)
print(df.describe())

Applying functions to frames

There are three methods of doing this:

  1. Table-wise by using pipe

  2. Row/Column wise using apply

  3. Element wise using applymap

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])

# column wise
df.apply(np.mean)
print(df.apply(np.mean))

# row wise
df.apply(np.mean,axis=1)
print(df.apply(np.mean))

Note

Anytime you want to something by Row, you need to pass axis=1 parameter

Iterating

To iterate by key-value paris

df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
for key,value in df.iteritems():
    print(key,value)

To iterate row wise

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row_index,row in df.iterrows():
print(row_index,row)

Sorting

By label

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns = ['col2','col1'])

sorted_df=unsorted_df.sort_index()
print(sorted_df)

# for descending order use following
sorted_df = unsorted_df.sort_index(ascending=False)

By columns

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns = ['col2','col1'])

sorted_df=unsorted_df.sort_index(axis=1)

By value

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1')

By multiple columns using by

sorted_df = unsorted_df.sort_values(by=['col1','col2'])

Plot a chart distribution

Here value_counts is kind of like SELECT COUNT(*) FROM something GROUP BY year

%matplotlib inline
import matplotlib.pyplot as plt

ax = active_cases_and_patents['Year'].value_counts().plot.barh()
ax.set_title("Cases count by Year")

Generating histogram

ax = active_cases_and_patents['Patent Count'].hist(bins=50, range=[1,10])
ax.set_xlabel("Patents per Case")
ax.set_ylabel("Number of Cases")
ax.set_title("Histogram of Patents and Cases")

Filtering data frame

spurious_court_name_filter = lambda x: x.find(":") == -1
top_25_courts = list(filter(spurious_court_name_filter, top_25_courts))

References:

  1. Pandas Cheat Sheet

  2. Python Pandas Tutorial

  3. Thinking like a Panda