Decision trees

Dr. Ashish Tendulkar

IIT Madras

Machine Learning Practice

Decision Trees

  • Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression.

  • The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Decision tree

Classification

Regression

Decision tree classifier

Class: \(\colorbox{lightgrey}{sklearn.tree.DecisionTreeClassifier}\)

Some parameters:

  • criterion ("gini", "entropy") - The function to measure the quality of a split. The default is "gini".
  • splitter("best", "random") - The strategy used to choose the split at each node. The default splitter used is "best".
  • max_depth(int) - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  • min_samples_split (int or float) - The minimum number of samples required to split an internal node. The default is 2.
  • min_samples_leaf (int or float) -The minimum number of samples required to be at a leaf node. The default is 1.

Decision tree regressor

Class: \(\colorbox{lightgrey}{sklearn.tree.DecisionTreeRegressor}\)

Some parameters:

  • criterion("squared error", "friedman_mse", "absolute_error", "poisson") - The function to measure the quality of a split. Default is "squared error".

  • splitter("best", "random") - The strategy used to choose the split at each node. The default splitter used is "best".

  • max_depth(int) - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • min_samples_split(int or float) - The minimum number of samples required to split an internal node. The default is 2.
  • min_samples_leaf (int or float) -The minimum number of samples required to be at a leaf node. The default is 1.

Tree algorithms

  • ​ID3 (Iterative Dichotomiser 3) - The algorithm creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical targets.

  • C4.5 - is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables).

  • C5.0- It uses less memory, and it is more accurate than C4.5.

  • CART(Classification and Regression Trees)- CART constructs binary trees using the feature and threshold that yield the largest information gain at each node.

Tips on practical use

  • Visualize a tree as you are training by using the export function. Use max_depth=3 as an initial tree depth to get a feel for how the tree is fitting to your data, and then increase the depth
  • Decision trees tend to overfit on data with a large number of features. Getting the right ratio of samples to number of features is important, since a tree with few samples in high dimensional space is very likely to overfit.

  • Performing dimensionality reduction (PCA, or Feature Selection) on a data will give decision tree a better chance of finding features that are discriminative.

  • Balance the dataset before training to prevent the tree from being biased toward the classes that are dominant.

 

  • All decision trees use np.float32 arrays internally. If training data is not in this format, a copy of the dataset will be made.

Tips on practical use

  • If the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruning criterion such as min_weight_fraction_leaf, which ensure that leaf nodes contain at least a fraction of the overall sum of the sample weights.

  • Use min_samples_split or min_samples_leaf to ensure that multiple samples inform every decision in the tree, by controlling which splits will be considered. A very small number will usually mean the tree will overfit, whereas a large number will prevent the tree from learning the data.

What is the label vector for a set of training vectors?

  • Given training vectors,                                                    

\color{blue}{x_i \in R^n}, i = 1,\dots, l
\color{blue}{y \in R^l}
  • A label vector, 

A decision tree recursively partitions the feature space such that the samples with the same labels or similar target values are grouped together.

Appendix

\color{blue}{\bar{y}_m=1/N_m \sum_{y\in Q_m}y}

Regression criteria

Common criteria for minimizing error are following:                                 

\color{blue}{H(Q_m)=\frac{1}{N_m}\sum_{y \in Q_m}(y\frac{log⁡y}{\bar{y}}_m−y+\bar{y})}
\color{blue}{median(y)_m= median_{y\in Q_m}(y)}
  • Half Poisson deviance:                                 

  • Mean squared error:                                                       

Where \(\color{red}{\bar{y}_m}\) is mean value at the node \(\color{red}{m}\)

\color{blue}{H(Q_m)=\frac{1}{N_m}\sum_{y\in Q_m}(y−\bar{y}_m)^2}
  • Mean Absolute Error:                                   

\color{blue}{H(Q_m)=\frac{1}{N_m}\sum_{y\in Q_m}|y−median(y)_m}

Where \(\color{red}{median(y)_m}\) is median value at the node \(\color{red}{m}\)

Complexity

  • In general, the run time cost to construct a balanced binary tree is                                                      and query time                           .

  • Although the tree construction algorithm attempts to generate balanced trees, they will not always be balanced.

  • Assuming that the subtrees remain approximately balanced, the cost at each node consists of searching through                     to find the feature that offers the largest reduction in entropy.

\color{red}{O(n_{samples}n_{features}log⁡(n_{samples}))}
\color{red}{O(log⁡(n_{samples}))}
\color{red}{O(n_{features})}

Advantages of decision trees

  • Simple to understand and to interpret. Trees can be visualised.

  • Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed.

  • The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.

  • Able to handle multi-output problems.

Continued...

Let the data at node \(\color{red}{m}\) be represented by \(\color{red}{Q_m}\) with \(\color{red}{N_m}\) samples. For each candidate split \(\color{red}{\theta = (j, t_m)}\) consisting of a feature \(\color{red}{j}\) and threshold \(\color{red}{t_m}\), partition the data into \(\color{blue}{Q_m^{left}\theta}\) and \(\color{blue}{Q_m^{right}\theta}\) subsets.

\color{blue}{Q_m^{left}\theta=\{(x,y)|x_j<=t_m\} }
\color{blue}{Q_m^{right}\theta = Q_m/Q_m^{left}\theta}

The quality of a candidate split of node \(\color{red}{m}\) is then computed using an impurity function or loss function \(\color{red}{H}\), the choice of which depends on the task being solved (classification or regression).

\color{blue}{G(Q_m,\theta)=\dfrac{N_m^{left}}{N_m}H(Q_m^{left}(\theta))+ \dfrac{N_m^{right}}{N_m}H(Q_m^{right}(\theta))}

Select the parameters that minimises the impurity

Continued...

\color{blue}{θ^∗=argmin_θ⁡G(Q_m,\theta)}

Recurse for subsets \(\color{blue}{Q_m^{left}(\theta^*)}\) and  \(\color{blue}{Q_m^{right}(\theta^*)}\) until the maximum allowable depth is reached, \(\color{blue}{N_m<min_{samples} \space \text{or} \space N_m=1}\).

Classification criteria

If a target is a classification outcome taking on values 0,1,…,K-1, for node \(\color{red}{m}\).

Let \(\color{red}{P_{mk}}\) be the proportion of class \(\color{red}{k}\) observations in node \(\color{red}{m}\).

\color{blue}{P_{mk}=1/N_m\sum_{y\in Q_m}I(y=k)}

Common measures of impurity are following:                                 

\color{blue}{H(Q_m)=\sum_{k} P_{mk}(1−P_{mk})}
\color{blue}{H(Q_m)=−\sum_k P_{mk}log⁡(P_{mk})}
\color{blue}{H(Q_m)=1−max(p_{mk})}
  • Entropy:                                                           

  • Gini:                                                               

  • Misclassication error:                                   

Disadvantages of decision trees

  • Creating over-complex trees that do not generalise the data well leads to overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.

  • Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This issue is mitigated by using decision trees within an ensemble.

  • Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations. Therefore, they are not good at extrapolation.

  • Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

Minimal cost complexity pruning

  • Minimal cost-complexity pruning is an algorithm used to prune a tree to avoid over-fitting. This algorithm is parameterized by            known as the complexity parameter. The complexity parameter is used to define the cost complexity measure,          of a given tree   .

\color{blue}{\alpha \geq 0}
\color{red}{T}
\color{blue}{R_{\alpha}T}