Decision Tree
A decision tree is a flowchart-like structure used to make decision or prediction. It is a decision support recursive partitioning structure that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resources costs, and utility.
The decision tree is a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
Structure
Nodes: represent decisions or tests on attributes.
- Root node: represents the entire dataset and the initial decision to be made. - Internal node: represents decisions or tests on attributes. Each internal node has one or more branches.
- Leaf node: represents the final decision or prediction. No further splits occur at these nodes.
Branche: represents the outcome of a decision or test, leading to another node.
Metrics
-
Gini impurity
\[Gini = 1 - \sum_{i = 1}^{n} {p_i^{2}}\]
Measure the likelihood that a randomly classified new instance will be misclassified based on the class distribution of the dataset.where \(p_i\) is the probability of an instance being classified into a particular class.
-
Entropy
\[entropy = -\sum_{i = 1}^{n} {p_i \log_2{p_i}}\]
Measures the amount of uncertainty or impurity in the dataset.where \(p_i\) is the probability of an instance being classified into a particular class.
-
Information gain
Measures the reduction in entropy or Gini impurity after a dataset is split on an attribute.
Advantages
-
Simplicity and Interpretability
- Easy to understand and interpret
- Visual representation mirrors human decision-making processes
- White-box model: Conditions and results can be explained using boolean logic
-
Minimal Data Preparation
- No need for normalization or scaling
- No need for dummy variables
- Can handle missing values (depending on the algorithm)
-
Versatile
- Suitable for both classification and regression tasks
- Can handle numerical and catagorical data
- supports multi-output problems
-
Efficiency
- Prediction cost is logarithmic in the number of training data points
-
Robustness
- Performs well even if assumptions are somewhat violated
-
Validation-Friendly
- Models can be validated using statistical tests, ensuring reliability
-
Non-linear Relationships
- Capable of capturing complex, non-linear relationships between features and target variables
Disadvantages
-
Overfitting
- Trees can become overly complex and fail to generalize well
- Mitigation: Pruning, setting minimum samples per leaf node, or maximum tree depth
-
Instability
- Small changes in data can lead to completely different tree structures
- Mitigation: Use ensembel methods, such as Random Forest, Gradient Boosting
-
Limited Extrapolation
- Predictions are piecewise constant and not smooth
- Poor performance in extrapolating beyond the training data range
-
Suboptimal Solutions
- Finding the optimal decision tree is NP-complete
- Algorithms rely on heuristic methods, such as greedy algorithms
- Mitigation: Train multiple trees using ensemble methods
-
difficulty with Certain Patterns
- Struggles with learning patterns like XOR, parity, or multiplexer problems
-
Bias Toward Dominant Classes
- Decision trees may become biased if certain classes dominate the dataset
- Mitigation: Ensure a balanced dataset before training
-
Feature Bias
- Features with more levels/categories may dominate the tree structure
Application
- Business Decision Making
- Healthcare
- Finance
- Marketing
Reference
- “1.10. Decision Trees.” Scikit, scikit-learn.org/1.5/modules/tree.html. Accessed 30 Dec. 2024.
- “Decision Tree.” Wikipedia, Wikimedia Foundation, 20 Oct. 2024, en.wikipedia.org/wiki/Decision_tree.
- “Decision Tree.” GeeksforGeeks, 17 May 2024, www.geeksforgeeks.org/decision-tree/.
Enjoy Reading This Article?
Here are some more articles you might like to read next: