Some parameters:
Some parameters:
criterion("squared error", "friedman_mse", "absolute_error", "poisson") - The function to measure the quality of a split. Default is "squared error".
splitter("best", "random") - The strategy used to choose the split at each node. The default splitter used is "best".
max_depth(int) - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
ID3 (Iterative Dichotomiser 3) - The algorithm creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical targets.
C4.5 - is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables).
C5.0- It uses less memory, and it is more accurate than C4.5.
CART(Classification and Regression Trees)- CART constructs binary trees using the feature and threshold that yield the largest information gain at each node.
Decision trees tend to overfit on data with a large number of features. Getting the right ratio of samples to number of features is important, since a tree with few samples in high dimensional space is very likely to overfit.
Performing dimensionality reduction (PCA, or Feature Selection) on a data will give decision tree a better chance of finding features that are discriminative.
Balance the dataset before training to prevent the tree from being biased toward the classes that are dominant.
All decision trees use np.float32 arrays internally. If training data is not in this format, a copy of the dataset will be made.
If the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruning criterion such as min_weight_fraction_leaf, which ensure that leaf nodes contain at least a fraction of the overall sum of the sample weights.
Use min_samples_split or min_samples_leaf to ensure that multiple samples inform every decision in the tree, by controlling which splits will be considered. A very small number will usually mean the tree will overfit, whereas a large number will prevent the tree from learning the data.
Given training vectors,
A label vector,
A decision tree recursively partitions the feature space such that the samples with the same labels or similar target values are grouped together.
Simple to understand and to interpret. Trees can be visualised.
Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed.
The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
Able to handle multi-output problems.
Creating over-complex trees that do not generalise the data well leads to overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This issue is mitigated by using decision trees within an ensemble.
Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations. Therefore, they are not good at extrapolation.
Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.