Introduction

Machine Learning Practice

Dr. Ashish Tendulkar

IIT Madras

What is Machine Learning and what are we set to achieve in this course?

  • Machine learning tries to learn from data.
  • NO DATA NO ML.
  • Loss function.
  • Optimize the loss via optimization algorithms.
  • Patterns or model parameters.
  • Make predictions.
  • Generalize well.
  • Train, validation and test sets.
  • Cross validation based performance evaluation.

Challenges faced by data scientist

  • What accuracy can we expect from the model?
  • What model would give the best performance for the given task?
  • How do we know that the model has learned sensible relationships or parameters?
  • How would the model perform in the wild - on unseen data?
  • What is the best way to divide the data into training, dev and test sets?
  • How do we set hyper-parameters (HPTs) of the model?
  • What are some of the best practices in data explorations? What visualizations make sense?

Important terms in ML

  • Model
  • Parameters
  • Training data
  • Training, dev, test division
  • Cross validation
  • Evaluation metrics
  • Loss functions
  • Optimization algorithms

Requirements of an ML Library

  • Data loading and manipulation
  • Preprocess the data, select and extract features. Also called feature engineering.
  • Model selection.
    • Cross validation.
  • Training model.
    • Loss functions.
    • Gradient descent variations.
    • Closed form solution.
  • Evaluation Metrics.

Scikit-learn support all of these!

Sklearn modules

Function Module
Dataset loading sklearn.datasets
Preprocessing sklearn.preprocessing
Feature imputation sklearn.impute
Feature extraction sklearn.feature_extraction
Feature selection sklearn.feature_selection

Requirements of an ML Library

Function Module Model Name
Model building
sklearn.linear_model Supervised linear models
sklearn.svm SVM
sklearn.tree Trees
sklearn.neural_network Artificial neural networks
sklearn.cluster Clustering

Machine Learning Summary

Estimator Object

  • It learns from data.
  • It may solve regression, classification or clustering.

Transformers:

  • Implement fit(), transform() and fit_transform() methods.
  • Data preprocessing objects are transformers.

Predictors:

  • Implement fit(), predict() and fit_predict() methods.
  • This encompasses classifiers, regressors, clusterers and outlier detectors.

Meta estimators

  • A meta estimator takes other estimators as input
  • Examples:
    • Pipeline
    • Ensemble methods
    • Model based feature selection

Some resources: