Simple Bootstrap and Simulation Approaches to Quantifying Reliability of High-Dimensional Feature Selection

big-data
bioinformatics
machine-learning
multiplicity
prediction
regression
validation
Feature selection in the large p non-large n case is known to be unreliable, but most biomedical researchers are not aware of the magnitude of the problem. They assume for example that setting a false discovery rate makes the results reliable, forgetting about the false negative rate and decades of research showing unreliability of stepwise variable selection even in the low p case. A related problem is the unreliability in the estimate of the effect (e.g., an odds ratio) of a feature found by selecting ‘winners’. This talk will demonstrate some simple bootstrap and Monte Carlo simulation procedures for teaching biomedical researchers how to quantify these problems. One of the bootstrap examples exposes the difficulty of the task by computing confidence intervals for importance rankings of features.
Author
Published

July 31, 2018

ENAR-Sponsored Session, Joint Statistical Meetings Session 331: Statistical and Practical Issues for Reproducible Molecular Prediction in Biomedical Studies, Vancouver, BC