Decision trees

Dr. Ashish Tendulkar

IIT Madras

Machine Learning Practice

Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression.
The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Decision tree

Classification

Regression

Decision tree classifier

Class: \(\colorbox{lightgrey}{sklearn.tree.DecisionTreeClassifier}\)

Some parameters:

criterion ("gini", "entropy") - The function to measure the quality of a split. The default is "gini".
splitter("best", "random") - The strategy used to choose the split at each node. The default splitter used is "best".
max_depth(int) - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_split (int or float) - The minimum number of samples required to split an internal node. The default is 2.
min_samples_leaf (int or float) -The minimum number of samples required to be at a leaf node. The default is 1.

Decision tree regressor

Class: \(\colorbox{lightgrey}{sklearn.tree.DecisionTreeRegressor}\)

Some parameters:

criterion("squared error", "friedman_mse", "absolute_error", "poisson") - The function to measure the quality of a split. Default is "squared error".
splitter("best", "random") - The strategy used to choose the split at each node. The default splitter used is "best".
max_depth(int) - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_split(int or float) - The minimum number of samples required to split an internal node. The default is 2.
min_samples_leaf (int or float) -The minimum number of samples required to be at a leaf node. The default is 1.

Tree algorithms

ID3 (Iterative Dichotomiser 3) - The algorithm creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical targets.
C4.5 - is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables).
C5.0- It uses less memory, and it is more accurate than C4.5.
CART(Classification and Regression Trees)- CART constructs binary trees using the feature and threshold that yield the largest information gain at each node.

Tips on practical use

Visualize a tree as you are training by using the export function. Use max_depth=3 as an initial tree depth to get a feel for how the tree is fitting to your data, and then increase the depth

Decision trees tend to overfit on data with a large number of features. Getting the right ratio of samples to number of features is important, since a tree with few samples in high dimensional space is very likely to overfit.

Performing dimensionality reduction (PCA, or Feature Selection) on a data will give decision tree a better chance of finding features that are discriminative.

Balance the dataset before training to prevent the tree from being biased toward the classes that are dominant.

All decision trees use np.float32 arrays internally. If training data is not in this format, a copy of the dataset will be made.

Tips on practical use

If the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruning criterion such as min_weight_fraction_leaf, which ensure that leaf nodes contain at least a fraction of the overall sum of the sample weights.

Use min_samples_split or min_samples_leaf to ensure that multiple samples inform every decision in the tree, by controlling which splits will be considered. A very small number will usually mean the tree will overfit, whereas a large number will prevent the tree from learning the data.

What is the label vector for a set of training vectors?

Given training vectors,

\color{blue}{x_i \in R^n}, i = 1,\dots, l

\color{blue}{y \in R^l}

A label vector,

A decision tree recursively partitions the feature space such that the samples with the same labels or similar target values are grouped together.

Appendix

\color{blue}{\bar{y}_m=1/N_m \sum_{y\in Q_m}y}

Regression criteria

Common criteria for minimizing error are following:

\color{blue}{H(Q_m)=\frac{1}{N_m}\sum_{y \in Q_m}(y\frac{log⁡y}{\bar{y}}_m−y+\bar{y})}

\color{blue}{median(y)_m= median_{y\in Q_m}(y)}

Half Poisson deviance:

Mean squared error:

Where \(\color{red}{\bar{y}_m}\) is mean value at the node \(\color{red}{m}\)

\color{blue}{H(Q_m)=\frac{1}{N_m}\sum_{y\in Q_m}(y−\bar{y}_m)^2}

Mean Absolute Error:

\color{blue}{H(Q_m)=\frac{1}{N_m}\sum_{y\in Q_m}|y−median(y)_m}

Where \(\color{red}{median(y)_m}\) is median value at the node \(\color{red}{m}\)

Complexity

In general, the run time cost to construct a balanced binary tree is and query time .
Although the tree construction algorithm attempts to generate balanced trees, they will not always be balanced.
Assuming that the subtrees remain approximately balanced, the cost at each node consists of searching through to find the feature that offers the largest reduction in entropy.

\color{red}{O(n_{samples}n_{features}log⁡(n_{samples}))}

\color{red}{O(log⁡(n_{samples}))}

\color{red}{O(n_{features})}

Advantages of decision trees

Simple to understand and to interpret. Trees can be visualised.
Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed.
The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
Able to handle multi-output problems.

Continued...

Let the data at node \(\color{red}{m}\) be represented by \(\color{red}{Q_m}\) with \(\color{red}{N_m}\) samples. For each candidate split \(\color{red}{\theta = (j, t_m)}\) consisting of a feature \(\color{red}{j}\) and threshold \(\color{red}{t_m}\), partition the data into \(\color{blue}{Q_m^{left}\theta}\) and \(\color{blue}{Q_m^{right}\theta}\) subsets.

\color{blue}{Q_m^{left}\theta=\{(x,y)|x_j<=t_m\} }

\color{blue}{Q_m^{right}\theta = Q_m/Q_m^{left}\theta}

The quality of a candidate split of node \(\color{red}{m}\) is then computed using an impurity function or loss function \(\color{red}{H}\), the choice of which depends on the task being solved (classification or regression).

\color{blue}{G(Q_m,\theta)=\dfrac{N_m^{left}}{N_m}H(Q_m^{left}(\theta))+ \dfrac{N_m^{right}}{N_m}H(Q_m^{right}(\theta))}

Select the parameters that minimises the impurity

Continued...

\color{blue}{θ^∗=argmin_θ⁡G(Q_m,\theta)}

Recurse for subsets \(\color{blue}{Q_m^{left}(\theta^)}\) and \(\color{blue}{Q_m^{right}(\theta^)}\) until the maximum allowable depth is reached, \(\color{blue}{N_m<min_{samples} \space \text{or} \space N_m=1}\).

Classification criteria

If a target is a classification outcome taking on values 0,1,…,K-1, for node \(\color{red}{m}\).

Let \(\color{red}{P_{mk}}\) be the proportion of class \(\color{red}{k}\) observations in node \(\color{red}{m}\).

\color{blue}{P_{mk}=1/N_m\sum_{y\in Q_m}I(y=k)}

Common measures of impurity are following:

\color{blue}{H(Q_m)=\sum_{k} P_{mk}(1−P_{mk})}

\color{blue}{H(Q_m)=−\sum_k P_{mk}log⁡(P_{mk})}

\color{blue}{H(Q_m)=1−max(p_{mk})}

Entropy:

Gini:

Misclassication error:

Disadvantages of decision trees

Creating over-complex trees that do not generalise the data well leads to overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This issue is mitigated by using decision trees within an ensemble.
Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations. Therefore, they are not good at extrapolation.
Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

Minimal cost complexity pruning

Minimal cost-complexity pruning is an algorithm used to prune a tree to avoid over-fitting. This algorithm is parameterized by known as the complexity parameter. The complexity parameter is used to define the cost complexity measure, of a given tree .

\color{blue}{\alpha \geq 0}

\color{red}{T}

\color{blue}{R_{\alpha}T}

Decision trees

Dr. Ashish Tendulkar

IIT Madras

Machine Learning Practice

Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression.

The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Decision tree

Classification

Regression

Decision tree classifier

Class: \(\colorbox{lightgrey}{sklearn.tree.DecisionTreeClassifier}\)

Decision tree regressor

Class: \(\colorbox{lightgrey}{sklearn.tree.DecisionTreeRegressor}\)

Tree algorithms

Tips on practical use

Tips on practical use

What is the label vector for a set of training vectors?

Appendix

Regression criteria

Common criteria for minimizing error are following:

Half Poisson deviance:

Mean squared error:

Where \(\color{red}{\bar{y}_m}\) is mean value at the node \(\color{red}{m}\)

Mean Absolute Error:

Where \(\color{red}{median(y)_m}\) is median value at the node \(\color{red}{m}\)

Complexity

In general, the run time cost to construct a balanced binary tree is and query time .

Although the tree construction algorithm attempts to generate balanced trees, they will not always be balanced.

Assuming that the subtrees remain approximately balanced, the cost at each node consists of searching through to find the feature that offers the largest reduction in entropy.

Advantages of decision trees

Continued...

The quality of a candidate split of node \(\color{red}{m}\) is then computed using an impurity function or loss function \(\color{red}{H}\), the choice of which depends on the task being solved (classification or regression).

Select the parameters that minimises the impurity

Continued...

Recurse for subsets \(\color{blue}{Q_m^{left}(\theta^*)}\) and \(\color{blue}{Q_m^{right}(\theta^*)}\) until the maximum allowable depth is reached, \(\color{blue}{N_m<min_{samples} \space \text{or} \space N_m=1}\).

Classification criteria

If a target is a classification outcome taking on values 0,1,…,K-1, for node \(\color{red}{m}\).

​Let \(\color{red}{P_{mk}}\) be the proportion of class \(\color{red}{k}\) observations in node \(\color{red}{m}\).

Common measures of impurity are following:

Entropy:

Gini:

Misclassication error:

Disadvantages of decision trees

Minimal cost complexity pruning

Minimal cost-complexity pruning is an algorithm used to prune a tree to avoid over-fitting. This algorithm is parameterized by known as the complexity parameter. The complexity parameter is used to define the cost complexity measure, of a given tree .

Recurse for subsets \(\color{blue}{Q_m^{left}(\theta^)}\) and \(\color{blue}{Q_m^{right}(\theta^)}\) until the maximum allowable depth is reached, \(\color{blue}{N_m<min_{samples} \space \text{or} \space N_m=1}\).

Let \(\color{red}{P_{mk}}\) be the proportion of class \(\color{red}{k}\) observations in node \(\color{red}{m}\).