Pandas ====== Pandas is usually used for loading, transforming and dumping data. Pandas is build in `NumPy` {which is implemented in C}, which improves speed and getting rid of Garbage collector. There are some key data-structures while working with Pandas 1. `Series`: One dimensional array containing sequences of values 2. `DataFrame` {containers for `Series`}: Represents rectangular table for data and contains an ordered collection of columns, each of which can be of different values. It has both: 1. Row Index 2. Column Index 3. `Panel` {containers for `DataFrame`}: This is a three-dimensional data-structure Peeking data from CSV file `````````````````````````` .. code-block:: python active_cases_and_patents = pd.read_csv("./data/active_case_patents.csv") active_cases_and_patents.head() Counting rows in frame `````````````````````` .. code-block:: python print(f"Total case sample:{active_cases_and_patents.shape[0]}") Series from NumPy arrays ```````````````````````` .. code-block:: python import pandas as pd import numpy as np data = np.array(['a','b','c','d']) s = pd.Series(data) Series from Dictionary `````````````````````` .. code-block:: python data = {'a' : 0., 'b' : 1., 'c' : 2.} s = pd.Series(data) Retrieve multiple elements from Series `````````````````````````````````````` .. code-block:: python s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) #retrieve multiple elements print(s[['a','c','d']]) Create frame from nested list ````````````````````````````` .. code-block:: python data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data,columns=['Name','Age'],dtype=float) Create frame from dict of list `````````````````````````````` .. code-block:: python data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data) Create frame from list of dict `````````````````````````````` .. code-block:: python data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] #With two column indices, values same as dictionary keys df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b']) #With two column indices with one index with other name df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1']) Selecting column from frame ``````````````````````````` .. code-block:: python d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df ['one']) Appending columns to frame `````````````````````````` .. code-block:: python d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) # Adding a new column to an existing DataFrame object with column label by passing new series print("Adding a new column by passing as Series:") df['three']=pd.Series([10,20,30],index=['a','b','c']) print(df) print("Adding a new column using the existing columns in DataFrame:") df['four']=df['one']+df['three'] print(df) Deleting columns ```````````````` .. code-block:: python d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 'three' : pd.Series([10,20,30], index=['a','b','c'])} df = pd.DataFrame(d) print ("Our dataframe is:") print(df) # using del function print("Deleting the first column using DEL function:") del df['one'] Selecting by row label `````````````````````` .. code-block:: python d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df.loc['b']) Selecting by row integer ```````````````````````` .. code-block:: python d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df.iloc[2]) # for multiple rows print(df[2:4]) Appending rows `````````````` .. code-block:: python df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b']) df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b']) df = df.append(df2) Deleting rows ````````````` .. code-block:: python # Drop rows with label 0 df = df.drop(0) Describing Data ``````````````` .. code-block:: python d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]) } # Create a DataFrame df = pd.DataFrame(d) print(df.describe()) Applying functions to frames ```````````````````````````` There are three methods of doing this: 1. Table-wise by using `pipe` 2. Row/Column wise using `apply` 3. Element wise using `applymap` .. code-block:: python df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3']) # column wise df.apply(np.mean) print(df.apply(np.mean)) # row wise df.apply(np.mean,axis=1) print(df.apply(np.mean)) .. note:: Anytime you want to something by `Row`, you need to pass `axis=1` parameter Iterating ````````` To iterate by key-value paris .. code-block:: python df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3']) for key,value in df.iteritems(): print(key,value) To iterate row wise .. code-block:: python df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3']) for row_index,row in df.iterrows(): print(row_index,row) Sorting ``````` By label .. code-block:: python unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu mns = ['col2','col1']) sorted_df=unsorted_df.sort_index() print(sorted_df) # for descending order use following sorted_df = unsorted_df.sort_index(ascending=False) By columns .. code-block:: python unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu mns = ['col2','col1']) sorted_df=unsorted_df.sort_index(axis=1) By value .. code-block:: python unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]}) sorted_df = unsorted_df.sort_values(by='col1') By multiple columns using `by` .. code-block:: python sorted_df = unsorted_df.sort_values(by=['col1','col2']) Plot a chart distribution ````````````````````````` Here `value_counts` is kind of like `SELECT COUNT(*) FROM something GROUP BY year` .. code-block:: python %matplotlib inline import matplotlib.pyplot as plt ax = active_cases_and_patents['Year'].value_counts().plot.barh() ax.set_title("Cases count by Year") Generating histogram ```````````````````` .. code-block:: python ax = active_cases_and_patents['Patent Count'].hist(bins=50, range=[1,10]) ax.set_xlabel("Patents per Case") ax.set_ylabel("Number of Cases") ax.set_title("Histogram of Patents and Cases") Filtering data frame ```````````````````` .. code-block:: python spurious_court_name_filter = lambda x: x.find(":") == -1 top_25_courts = list(filter(spurious_court_name_filter, top_25_courts)) References: 1. `Pandas Cheat Sheet `_ 2. `Python Pandas Tutorial `_ 3. `Thinking like a Panda `_