Scikit Learn
============

Introduction
````````````

1. Common Notations in ML
    - `X` {collection of feature vector `x`}
    - `y` {collection of labels/target variables}
    - `Theta` model parameters
2. Types of Machine Learning
    1. Based on Methods
        1. Supervised vs Un Supervised
        2. Semi-supervised {E.g. Google Photos clustering images based on Faces, then you just have to Tag a single instance per person}
        3. Reinforcement Learning {E.g. AlphaGo - Agents, Environment, Action, Reward and Policy}
    2. Based on Output
        1. Regression {Continuos}
        2. Classification
        3. Ranking
    3. Based on requirement for Updates
        1. Batched {Offline}
        2. Online Algorithms
            - Receive data continuously {E.g. Ticker Data}
            - Out-of-core learning: Cannot fit entire dataset in Memory
            - Bad Data == Gradual Decline
            - Hence require continuos monitoring and if possible revert to older state of Model
3. Few Example of ML Algorithms
    1. Supervised
        - kNN
        - Linear Regression
        - Logistic Regression
        - SVM
        - Decision Trees & Random Forest
        - Neural Networks
    2. Unsupervised
        - k-Means
        - Hierarchial Clustering Analysis
        - Expectation Maximization
        - Principal Component Analysis
        - Kernel PCA
        - Locally-Linear Embedding
        - t-SNE: t-distributed stochastic neighbor embedding
        - Apriori
        - Eclat
4. Representation of data in scikit learn
    * All data has to be in form of NumPy matrix or Array 
    * Input data in 2D matrix and size of the matrix: n_samples (number of rows), n_featres (number of columns)
    * Iris dataset is a classic dataset for machine learning in classification settings
    * `sklearn.datasets` has a few datasets that we can use for testing/training
    * When you have new dataset to study relation amongst attributes it might make sense to use Scatter Plots
    * `dataset.fetch_` → autocomplete will get you list of all dataset available to scikit learn for download 
5. Basic Principles of Machine Learning
    * Every algorithm/model in scikit learn is is exposed via `Estimator` object (Each represented by a class).
    * Estimator parameter when it is instantiated e.g. `model = LinearRegression(normalized=True)`
    * Scikit learn separates model from data 
    * `Convention`: `Capital letters` for matrices and `lower cases` letters for arrays/vectors 
    * `model.fit(X, y)` used to train our model/algorithm
6. Supervised Learning: Classification and Regression
    * Classification - output is discrete
    * Regression - output is continous
    * KNeighboursClassifier -> `knn.fit`, `knn.predict`
    * `knn.predict_proba` → gives probability distribution over output targets
    * Putting ? at the end brings up the documentation
    * SVC - Support Vector Machine Classifier (based on LibSVM)
    * Different models will produce different predictions
    * How to choose which model to choose? (This is best answer using Model validation)
    * RandomForestRegressor → RandomForest model for doing regression tasks
    * Tip: In iPython if you hit Shift + Tab between round brackets it will show you list of parameters
7. Unsupervised Learning: Dimensionality Reduction and Clustering
    * Find combination of features that will best allow us to classify
    * Dimensionality Reduction PCA. Maps higher dimensional data in to lower dimensional
    * Unsupervised don’t have output y in fit e.g. pca.fit(X) (

Challenges of ML
````````````````

1. Insufficient Quantity of Training Data
    - `Unreasonable effectiveness of data <https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf>`_ - basically this research show adding more data to crappy algorithms give comparable results as sophisticated algorithm. `So then choice between spending time to collect more data vs. spending time to improve algorithms becomes obvious`
2. Non-representative Training Data - Black Swarn effect
3. Poor Quality Data
4. Irrelevant Features
5. Overfitting the Training Data
    - Noisy training dataset tunes model to detect noise as pattern
    - Possible solutions to over-fitting
        - Simplify model selection
        - Gather more training data
        - Reduce noise in dataset {E.g. fix data errors, remove outlines}
    - Constraining model to make it simple and reduce the risk of overfitting is called `regularization`
    - The amount of regularization to apply during learning can be controlled by `hyperparameter` {NOTE: **Hyperparameters is parameter of learning and not algorithm itself**}
6. Underfitting the Data
    - It occurs when your model is too simple to learn the underlying structure of data. 
    - Signal of when this happens is **predictions are inaccurate even on the training examples**
    - Possible solutions
        - Selecting a more powerful model
        - Feeding better features to learning algorithm
        - Reducing the constrain on model {E.g. reducing the regularization hyperparameter}

.. note:: `Estimator` objects implement `fit` and `predict` methods. 

ML Framework
````````````

`Approaching (Almost) Any Machine Learning Problem by Abhishek Thakur <http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/>`_ is a good tutorial on building abstract ML framework, as shown below:

.. image:: images/ml_framework.png


Linear Regression
`````````````````
Following is example of linear regression from Scikit's official documents

.. code-block:: python

    >>> import numpy as np
    >>> from sklearn.linear_model import LinearRegression
    >>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
    >>> # y = 1 * x_0 + 2 * x_1 + 3
    >>> y = np.dot(X, np.array([1, 2])) + 3
    >>> reg = LinearRegression().fit(X, y)
    >>> reg.score(X, y)
    1.0
    >>> reg.coef_
    array([1., 2.])
    >>> reg.intercept_ 
    3.0000...
    >>> reg.predict(np.array([[3, 5]]))
    array([16.])

Ensemble Method
```````````````

Ensemble techniques allow you to create a strong model from collection of weak models. Following is example in context of text classification

.. code-block:: python

    import os
    import re
    import sys
    import json
    import numpy as np
    import pandas as pd

    from sklearn import metrics
    from sklearn.svm import SVC
    from sklearn.neural_network import MLPClassifier
    from sklearn.decomposition import TruncatedSVD
    from sklearn.metrics import accuracy_score, log_loss, confusion_matrix
    from sklearn.ensemble import RandomForestClassifier, VotingClassifier
    from sklearn.naive_bayes import BernoulliNB
    from sklearn.linear_model import SGDClassifier
    from sklearn.preprocessing import LabelEncoder
    from sklearn.model_selection import cross_val_score
    from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

    from gensim.models import KeyedVectors
    from gensim.test.utils import datapath

    print("Loading word2vec")
    word2vec = KeyedVectors.load_word2vec_format(datapath("/Users/sidharthshah/Code/machine_learning/news_scout/data/pre_trained_models/word2vec/GoogleNews-vectors-negative300.bin"), binary=True)
    stoplist = [word.strip() for word in open("stopwords.txt").readlines()]
    DIM = 300

    def cleanup_text(snippet):
        """
        this method is used to cleanup text
        """
        snippet = snippet.lower().strip()
        snippet = re.sub(r'[^\w\s]', '', snippet)
        snippet = snippet.replace("\n", "")
        snippet = " ".join(snippet.split())
        return snippet

    def embeddings(doc):
        """
        this function is used to get additive embedding
        on all words in document
        """
        stop_word_filter = lambda x: x not in stoplist
        vector = [np.zeros(DIM)]

        for current in list(set(filter(stop_word_filter, doc.split()))):
            if current in word2vec:
                vector = np.add(vector , word2vec[current])
            else:
                vector = np.add(vector, [np.zeros(DIM)])
        return vector

    def pre_process(directory):
        """
        this method is used to pre-process data, it does

        1. Reads JSON files
        2. Returns Pandas Frame with Text and Label
        """
        results = []
        MAX_WORDS = 0
        for current in os.listdir(directory):
            file_to_read = os.path.join(directory, current)
            data = json.loads(open(file_to_read).read())
            for instance in data:
                row = {}
                row['title'] = instance['title']
                row['blurb'] = instance['blurb']
                row['text'] = cleanup_text(instance['title']) + " " + cleanup_text(instance['blurb'])
                if MAX_WORDS != 0:
                    row['text'] = " ".join(row['text'].split()[:min(len(row['text']), MAX_WORDS)])
                else:
                    row['text'] = " ".join(row['text'].split())
                row['target'] = file_to_read.split("/")[1].replace(".json", "")
                results.append(row)
        return pd.DataFrame(results)


    def gen_counting_features(dataset):
        """
        this is used to generate various counting features
        """
        results = []
        for _, row in dataset.iterrows():
            rec = {}
            rec['title_char_len'] = len(row['title'])
            rec['title_word_len'] = len(row['title'].split())
            rec['title_density'] = rec['title_char_len'] / float(rec['title_word_len'])
            rec['blurb_char_len'] = len(row['blurb'])
            rec['blurb_word_len'] = len(row['blurb'].split())
            rec['blurb_density'] = rec['blurb_char_len'] / float(rec['blurb_word_len'])
            results.append(rec)
        return pd.DataFrame(results)

    def gen_embedding_features(dataset):
        results = []
        for _, row in dataset.iterrows():
            vector = embeddings(row['text']).flatten()
            results.append(vector)
        return results

    train = pre_process("train")
    print("Vectorizing")

    # select between CountVectorizer or TfidfVectorizer
    # vectorizer = CountVectorizer(stop_words='english', min_df=5)
    vectorizer = TfidfVectorizer(stop_words='english', min_df=30)
    
    le = LabelEncoder()
    svd = TruncatedSVD(n_components=300, n_iter=7, random_state=42)
    print("Extracting features")
    
    X_train = svd.fit_transform(vectorizer.fit_transform(train['text']))

    # this is how you can stack hand generated features
    # X_train_counting_features = gen_counting_features(train)
    # X_train = np.hstack((X_train, X_train_counting_features.values))

    X_train_embedding_features = gen_embedding_features(train)
    X_train = np.hstack((X_train, X_train_embedding_features))
    print(X_train.shape)
    y_train = le.fit_transform(train['target'])


    print("Training model")

    rf_clf = RandomForestClassifier(n_estimators=50, oob_score=True, random_state=42)
    bnb_clf = BernoulliNB()
    svc_clf = SVC(gamma='scale', probability=True)
    clf = VotingClassifier(estimators=[('RandomForest', rf_clf), ('BernoulliNB', bnb_clf), ('SVC', svc_clf)], voting='soft')

    # altenative classifiers
    # clf = BernoulliNB()
    # clf = SVC(gamma='scale', probability=True)

    scores = cross_val_score(clf, X_train, y_train, cv=5)
    print(scores)

    clf.fit(X_train, y_train)
    print("Completed training model")

    test = pre_process("test")
    print("Extracting features for test dataset")

    # X_test_counting_features = gen_counting_features(test)
    # X_test = np.hstack((X_test, X_test_counting_features.values))

    X_test = svd.transform(vectorizer.transform(test['text']))

    X_test_embedding_features = gen_embedding_features(test)
    X_test = np.hstack((X_test, X_test_embedding_features))
    y_test = le.transform(test['target'])
    print("Testing model")
    y_test_predicted = clf.predict(X_test)
    y_test_predicted_probab = clf.predict_proba(X_test)
    accuracy = accuracy_score(y_test, y_test_predicted)
    print(f"Accuracy Score:{accuracy}")
    print(metrics.classification_report(y_test, y_test_predicted, target_names=le.classes_))

    print(f"Log Loss:{log_loss(y_test, y_test_predicted_probab)}")

    for i, label in enumerate(le.classes_):
        print(i, label)

    print(confusion_matrix(y_test, y_test_predicted))

References:

1. `Scikit Learn Cheatsheet <https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf>`_ 
2. `Introduction to Machine Learning with Python and Scikit-Learn <https://kukuruku.co/post/introduction-to-machine-learning-with-python-andscikit-learn/>`_
3. `Using scikit-learn Pipelines and FeatureUnions <http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html>`_