Shaputa: A Feature Selection Method That Combines SHAP and Boruta
Shaputa is a hybrid feature selection technique that combines SHAP (SHapley Additive exPlanations) with Boruta’s shadow-feature mechanism. By constructing a random baseline for every feature and comparing it with model-derived SHAP importance, Shaputa produces more robust and structure-aware feature selection results in high-dimensional, complex datasets. It is especially useful in nonlinear settings where traditional methods often struggle.
Traditional Feature Selection Paradigms
Classical feature selection methods generally fall into three categories: filter, wrapper, and embedded methods.
Filter methods score features independently based on their own statistical properties. Common examples include:
- Correlation-based Feature Selection (CFS): selects features that are highly correlated with the target but weakly correlated with one another.
- Chi-squared test: measures dependency between categorical features and the target.
- Information Gain: evaluates how much uncertainty is reduced after splitting on a feature.
Wrapper methods treat the learning algorithm as a black box and search for the optimal feature subset by observing performance changes. Stepwise regression is a typical greedy wrapper method. It can capture interactions that filter methods may miss, but the computational cost is high.
Embedded methods perform feature selection during model training itself, including:
- Lasso: L1 regularization shrinks some coefficients to zero, producing sparse selection.
- Ridge regression: L2 regularization shrinks coefficients and indirectly filters features.
- Decision-tree feature importance: measures contribution through impurity reduction.
However, when the number of features is large, relationships are complex, and nonlinearity is strong, these methods often become unstable or fail to capture deeper structure.
A Random Baseline Through Shadow Features
Shaputa borrows Boruta’s key idea: create a random control group for each original feature to form a natural baseline. The procedure includes:
- Generate shadow features: randomly permute the values of each original feature to destroy its relationship with the target and create a random baseline.
- Build an expanded feature space: concatenate the original features with their shadow counterparts so the model compares both under the same input space.
- Extract a dynamic threshold: use the highest importance among the shadow features as the significance threshold.
This mechanism:
- substantially reduces selection bias by comparing features against a random baseline;
- makes the threshold entirely data-driven rather than manually chosen;
- effectively distinguishes truly useful features from those that only appear informative but are actually noise.
The Role of SHAP in Shaputa
SHAP values quantify each feature’s marginal contribution to model predictions and provide a theoretically consistent, fine-grained basis for Shaputa’s importance estimates.
Compared with permutation importance, SHAP offers several advantages:
- Consistency and interpretability: grounded in cooperative game theory, SHAP satisfies local accuracy and consistency, making importance values more trustworthy.
- Captures feature interactions: by enumerating marginal contributions across feature subsets, SHAP naturally models complex interaction effects instead of relying on random perturbations.
- Lower estimation variance: importance estimates are more stable across repeated training or evaluation runs.
- Practical in high-dimensional settings: optimized algorithms such as TreeSHAP remain efficient even with many features.
Within Shaputa, SHAP values are used over multiple iterations to evaluate feature importance and compare it with the random baseline defined by the shadow features, enabling progressive convergence and screening.
Iterative Feature Selection Workflow
The core Shaputa workflow is as follows:
Data expansion
Create shadow features by randomly permuting the original features, then concatenate them with the original features into an expanded feature space.Model training
Train a model on the expanded space, such as Random Forest or XGBoost.Feature importance evaluation
Compute SHAP importance scores for all features.Multi-round iteration
Repeat the above steps until the importance ranking stabilizes.Feature screening
Keep the original features whose importance significantly exceeds the upper bound of the shadow features.
In essence, Shaputa uses a closed loop of “model learning -> SHAP evaluation -> comparison with a random baseline” so that feature selection gradually converges to a stable and reliable set of features.
Conclusion
Shaputa tightly integrates SHAP’s interpretability with Boruta’s random-baseline strategy, offering a more robust and more sensitive approach to feature selection for high-dimensional, nonlinear data with rich interaction effects. Compared with traditional methods, it not only improves selection quality but also has clear practical advantages in tree-model settings.
As modern datasets continue to grow in both dimensionality and complexity, hybrid, data-driven feature selection methods of this kind are likely to become an important part of future modeling workflows.
Shaputa: A Feature Selection Method That Combines SHAP and Boruta