Pandas
======

Pandas is usually used for loading, transforming and dumping data. Pandas is build in `NumPy` {which is implemented in C}, which improves speed and getting rid of Garbage collector.

There are some key data-structures while working with Pandas

1. `Series`: One dimensional array containing sequences of values
2. `DataFrame` {containers for `Series`}: Represents rectangular table for data and contains an ordered collection of columns, each of which can be of different values. It has both:
    1. Row Index
    2. Column Index
3. `Panel` {containers for `DataFrame`}: This is a three-dimensional data-structure

Peeking data from CSV file
``````````````````````````
.. code-block:: python

    active_cases_and_patents = pd.read_csv("./data/active_case_patents.csv")
    active_cases_and_patents.head()

Counting rows in frame
``````````````````````

.. code-block:: python

    print(f"Total case sample:{active_cases_and_patents.shape[0]}")

Series from NumPy arrays
````````````````````````

.. code-block:: python

    import pandas as pd
    import numpy as np
    data = np.array(['a','b','c','d'])
    s = pd.Series(data)

Series from Dictionary
``````````````````````

.. code-block:: python

    data = {'a' : 0., 'b' : 1., 'c' : 2.}
    s = pd.Series(data)

Retrieve multiple elements from Series
``````````````````````````````````````

.. code-block:: python

    s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
    #retrieve multiple elements
    print(s[['a','c','d']])

Create frame from nested list
`````````````````````````````

.. code-block:: python

    data = [['Alex',10],['Bob',12],['Clarke',13]]
    df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)

Create frame from dict of list
``````````````````````````````

.. code-block:: python

    data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
    df = pd.DataFrame(data)

Create frame from list of dict
``````````````````````````````

.. code-block:: python

    data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

    #With two column indices, values same as dictionary keys
    df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

    #With two column indices with one index with other name
    df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])

Selecting column from frame
```````````````````````````

.. code-block:: python

    d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

    df = pd.DataFrame(d)
    print(df ['one'])

Appending columns to frame
``````````````````````````

.. code-block:: python

    d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

    df = pd.DataFrame(d)

    # Adding a new column to an existing DataFrame object with column label by passing new series

    print("Adding a new column by passing as Series:")
    df['three']=pd.Series([10,20,30],index=['a','b','c'])
    print(df)

    print("Adding a new column using the existing columns in DataFrame:")
    df['four']=df['one']+df['three']
    print(df)

Deleting columns
````````````````

.. code-block:: python

    d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
    'three' : pd.Series([10,20,30], index=['a','b','c'])}

    df = pd.DataFrame(d)
    print ("Our dataframe is:")
    print(df)

    # using del function
    print("Deleting the first column using DEL function:")
    del df['one']

Selecting by row label
``````````````````````

.. code-block:: python

    d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

    df = pd.DataFrame(d)
    print(df.loc['b'])

Selecting by row integer
````````````````````````

.. code-block:: python

    d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

    df = pd.DataFrame(d)
    print(df.iloc[2])

    # for multiple rows
    print(df[2:4])

Appending rows
``````````````

.. code-block:: python

    df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
    df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

    df = df.append(df2)

Deleting rows
`````````````

.. code-block:: python

    # Drop rows with label 0
    df = df.drop(0)

Describing Data
```````````````
.. code-block:: python

    d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
    'Lee','David','Gasper','Betina','Andres']),
    'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
    }

    # Create a DataFrame
    df = pd.DataFrame(d)
    print(df.describe())

Applying functions to frames
````````````````````````````

There are three methods of doing this:

1. Table-wise by using `pipe`
2. Row/Column wise using `apply`
3. Element wise using `applymap`

.. code-block:: python

    df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
    
    # column wise
    df.apply(np.mean)
    print(df.apply(np.mean))

    # row wise
    df.apply(np.mean,axis=1)
    print(df.apply(np.mean))

.. note:: Anytime you want to something by `Row`, you need to pass `axis=1` parameter

Iterating
`````````

To iterate by key-value paris

.. code-block:: python

    df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
    for key,value in df.iteritems():
        print(key,value)

To iterate row wise

.. code-block:: python

    df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
    for row_index,row in df.iterrows():
    print(row_index,row)


Sorting
```````

By label

.. code-block:: python

    unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
    mns = ['col2','col1'])

    sorted_df=unsorted_df.sort_index()
    print(sorted_df)

    # for descending order use following
    sorted_df = unsorted_df.sort_index(ascending=False)

By columns

.. code-block:: python

    unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
    mns = ['col2','col1'])
    
    sorted_df=unsorted_df.sort_index(axis=1)

By value

.. code-block:: python

    unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
    sorted_df = unsorted_df.sort_values(by='col1')

By multiple columns using `by`

.. code-block:: python

    sorted_df = unsorted_df.sort_values(by=['col1','col2'])

Plot a chart distribution
`````````````````````````

Here `value_counts` is kind of like `SELECT COUNT(*) FROM something GROUP BY year`

.. code-block:: python

    %matplotlib inline
    import matplotlib.pyplot as plt

    ax = active_cases_and_patents['Year'].value_counts().plot.barh()
    ax.set_title("Cases count by Year")

Generating histogram
````````````````````

.. code-block:: python

    ax = active_cases_and_patents['Patent Count'].hist(bins=50, range=[1,10])
    ax.set_xlabel("Patents per Case")
    ax.set_ylabel("Number of Cases")
    ax.set_title("Histogram of Patents and Cases")

Filtering data frame
````````````````````
.. code-block:: python

    spurious_court_name_filter = lambda x: x.find(":") == -1
    top_25_courts = list(filter(spurious_court_name_filter, top_25_courts))

References:

1. `Pandas Cheat Sheet <https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf>`_
2. `Python Pandas Tutorial <https://www.tutorialspoint.com/python_pandas/index.htm>`_
3. `Thinking like a Panda <https://www.youtube.com/watch?v=ObUcgEO4N8w>`_