Pandas¶
Pandas is usually used for loading, transforming and dumping data. Pandas is build in NumPy {which is implemented in C}, which improves speed and getting rid of Garbage collector.
There are some key data-structures while working with Pandas
Series: One dimensional array containing sequences of values
- DataFrame {containers for Series}: Represents rectangular table for data and contains an ordered collection of columns, each of which can be of different values. It has both:
Row Index
Column Index
Panel {containers for DataFrame}: This is a three-dimensional data-structure
Peeking data from CSV file¶
active_cases_and_patents = pd.read_csv("./data/active_case_patents.csv")
active_cases_and_patents.head()
Counting rows in frame¶
print(f"Total case sample:{active_cases_and_patents.shape[0]}")
Series from NumPy arrays¶
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
Series from Dictionary¶
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
Retrieve multiple elements from Series¶
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve multiple elements
print(s[['a','c','d']])
Create frame from nested list¶
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
Create frame from dict of list¶
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
Create frame from list of dict¶
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
Selecting column from frame¶
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df ['one'])
Appending columns to frame¶
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
# Adding a new column to an existing DataFrame object with column label by passing new series
print("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)
print("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']
print(df)
Deleting columns¶
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print(df)
# using del function
print("Deleting the first column using DEL function:")
del df['one']
Selecting by row label¶
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df.loc['b'])
Selecting by row integer¶
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df.iloc[2])
# for multiple rows
print(df[2:4])
Appending rows¶
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
Deleting rows¶
# Drop rows with label 0
df = df.drop(0)
Describing Data¶
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
# Create a DataFrame
df = pd.DataFrame(d)
print(df.describe())
Applying functions to frames¶
There are three methods of doing this:
Table-wise by using pipe
Row/Column wise using apply
Element wise using applymap
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
# column wise
df.apply(np.mean)
print(df.apply(np.mean))
# row wise
df.apply(np.mean,axis=1)
print(df.apply(np.mean))
Note
Anytime you want to something by Row, you need to pass axis=1 parameter
Iterating¶
To iterate by key-value paris
df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
for key,value in df.iteritems():
print(key,value)
To iterate row wise
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row_index,row in df.iterrows():
print(row_index,row)
Sorting¶
By label
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns = ['col2','col1'])
sorted_df=unsorted_df.sort_index()
print(sorted_df)
# for descending order use following
sorted_df = unsorted_df.sort_index(ascending=False)
By columns
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns = ['col2','col1'])
sorted_df=unsorted_df.sort_index(axis=1)
By value
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1')
By multiple columns using by
sorted_df = unsorted_df.sort_values(by=['col1','col2'])
Plot a chart distribution¶
Here value_counts is kind of like SELECT COUNT(*) FROM something GROUP BY year
%matplotlib inline
import matplotlib.pyplot as plt
ax = active_cases_and_patents['Year'].value_counts().plot.barh()
ax.set_title("Cases count by Year")
Generating histogram¶
ax = active_cases_and_patents['Patent Count'].hist(bins=50, range=[1,10])
ax.set_xlabel("Patents per Case")
ax.set_ylabel("Number of Cases")
ax.set_title("Histogram of Patents and Cases")
Filtering data frame¶
spurious_court_name_filter = lambda x: x.find(":") == -1
top_25_courts = list(filter(spurious_court_name_filter, top_25_courts))
References: