# On the Use of Random Forest for Two-Sample Testing.

@article{Hediger2019OnTU, title={On the Use of Random Forest for Two-Sample Testing.}, author={Simon Hediger and Loris Michel and Jeffrey Naf}, journal={arXiv: Methodology}, year={2019} }

We follow the line of using classifiers for two-sample testing and propose several tests based on the Random Forest classifier. The developed tests are easy to use, require no tuning and are applicable for any distribution on $\mathbb{R}^p$, even in high-dimensions. We provide a comprehensive treatment for the use of classification for two-sample testing, derive the distribution of our tests under the Null and provide a power analysis, both in theory and with simulations. To simplify the use of… Expand

#### Supplemental Code

#### Figures and Tables from this paper

#### 10 Citations

PKLM: A flexible MCAR test using Classification

- Mathematics
- 2021

We develop a fully non-parametric, fast, easy-to-use, and powerful test for the missing completely at random (MCAR) assumption on the missingness mechanism of a data set. The test compares… Expand

Global and local two-sample tests via regression

- Mathematics
- Electronic Journal of Statistics
- 2019

Two-sample testing is a fundamental problem in statistics. Despite its long history, there has been renewed interest in this problem with the advent of high-dimensional and complex data.… Expand

A Fast and Effective Large-Scale Two-Sample Test Based on Kernels

- Mathematics
- 2021

Kernel two-sample tests have been widely used and the development of efficient methods for high-dimensional large-scale data is gaining more and more attention as we are entering the big data era.… Expand

Local Two-Sample Testing over Graphs and Point-Clouds by Random-Walk Distributions.

- Mathematics
- 2020

Two-sample testing is a fundamental tool for scientific discovery. Yet, aside from concluding that two samples do not come from the same probability distribution, it is often of interest to… Expand

High Probability Lower Bounds for the Total Variation Distance

- Mathematics
- 2020

The statistics and machine learning communities have recently seen a growing interest in classification-based approaches to two-sample testing (e.g. Kim et al. [2016]; Rosenblatt et al. [2016];… Expand

WMW-A: Rank-based two-sample independent test for smallsample sizes through an auxiliary sample

- Biology
- 2021

The extensive simulation experiments and real applications on microarray gene expression data sets show the WMW-A test could significantly improve the test power for two-sample problem with small sample sizes, by either available unlabelled auxiliary data or generated auxiliary data. Expand

Optimizing the synthesis of clinical trial data using sequential trees

- Medicine, Computer Science
- J. Am. Medical Informatics Assoc.
- 2021

The optimization approach presented in this study gives a reliable way to synthesize high-utility clinical trial datasets by evaluating the variability in the utility of synthetic clinical trial data as variable order is randomly shuffled and implemented to find a good order if variability is too high. Expand

Evaluating the utility of synthetic COVID-19 case data

- Medicine
- JAMIA open
- 2021

A gradient boosted classification tree was built to predict death using Ontario’s 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics and could be used as a proxy for the real dataset. Expand

APPLYING KERNEL CHANGE POINT DETECTION TO FINANCIAL MARKETS

- 2020

Applying Kernel Change Point Detection to Financial Markets

Test for non-negligible adverse shifts

- Mathematics, Computer Science
- ArXiv
- 2021

This work proposes a framework to detect adverse shifts based on outlier scores, D-SOS, which is uniquely tailored to serve as a robust metric for model monitoring and data validation. Expand

#### References

SHOWING 1-10 OF 36 REFERENCES

Classification Accuracy as a Proxy for Two Sample Testing

- Mathematics, Computer Science
- ArXiv
- 2016

This work proves two results that hold for all classifiers in any dimensions: if its true error remains $\epsilon-better than chance for some $\epSilon>0$ as $d,n \to \infty$, then (a) the permutation-based test is consistent (has power approaching to one), and (b) a computationally efficient test based on a Gaussian approximation of the null distribution is also consistent. Expand

Revisiting Classifier Two-Sample Tests

- Mathematics, Computer Science
- ICLR
- 2017

The properties, performance, and uses of C2ST are established and their main theoretical properties are analyzed, and their use to evaluate the sample quality of generative models with intractable likelihoods, such as Generative Adversarial Networks, are proposed. Expand

Consistency of Random Forests and Other Averaging Classifiers

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2008

A number of theorems are given that establish the universal consistency of averaging rules, and it is shown that some popular classifiers, including one suggested by Breiman, are not universally consistent. Expand

Optimal kernel choice for large-scale two-sample tests

- Computer Science, Mathematics
- NIPS
- 2012

The new kernel selection approach yields a more powerful test than earlier kernel selection heuristics, and makes the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. Expand

A Kernel Two-Sample Test

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2012

This work proposes a framework for analyzing and comparing distributions, which is used to construct statistical tests to determine if two samples are drawn from different distributions, and presents two distribution free tests based on large deviation bounds for the maximum mean discrepancy (MMD). Expand

Fast Two-Sample Testing with Analytic Representations of Probability Measures

- Computer Science, Mathematics
- NIPS
- 2015

A class of nonparametric two-sample tests with a cost linear in the sample size based on an ensemble of distances between analytic functions representing each of the distributions that give a better power/time tradeoff than competing approaches and in some cases better outright power than even the most expensive quadratic-time tests. Expand

An Empirical Study of Learning from Imbalanced Data Using Random Forest

- Computer Science
- 2007

A comprehensive suite of experiments that analyze the performance of the random forest (RF) learner implemented in Weka are discussed, providing an extensive empirical evaluation of RF learners built from imbalanced data. Expand

B-test: A Non-parametric, Low Variance Kernel Two-sample Test

- Mathematics, Computer Science
- NIPS
- 2013

The B-test uses a smaller than quadratic number of kernel evaluations and avoids completely the computational burden of complex null-hypothesis approximation while maintaining consistency and probabilistically conservative thresholds on Type I error. Expand

Random Forests

- Mathematics, Computer Science
- Machine Learning
- 2004

Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression. Expand

Do we need hundreds of classifiers to solve real world classification problems?

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2014

The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in theTop-20, respectively). Expand