19 releases (7 stable)

1.1.4 Jan 8, 2024
1.1.1 Oct 3, 2023
1.0.1 Jun 1, 2023
0.7.2 May 9, 2022
0.6.0 Mar 17, 2022

#45 in Machine learning

BSD-3-Clause

355KB
2K SLoC

Random Forests for Change Point Detection

Change point detection aims to identify structural breaks in the probability distribution of a time series. Existing methods either assume a parametric model for within-segment distributions or are based on ranks or distances and thus fail in scenarios with a reasonably large dimensionality.

changeforest implements a classifier-based algorithm that consistently estimates change points without any parametric assumptions, even in high-dimensional scenarios. It uses the out-of-bag probability predictions of a random forest to construct a classifier log-likelihood ratio that gets optimized using a computationally feasible two-step method.

See [1] for details.

changeforest is available as rust crate, a Python package (on PyPI and conda-forge), and an R package (on conda-forge , linux and MacOS only). See below for their respective user guides.

Python

Installation

To install from conda-forge (recommended), run

conda install -c conda-forge changeforest

To install from PyPI, run

pip install changeforest

Example

In the following example, we perform random forest-based change point detection on a simulated dataset with n=600 observations and covariance shifts at t=200, 400.

In [1]: import numpy as np
   ...: 
   ...: Sigma = np.full((5, 5), 0.7)
   ...: np.fill_diagonal(Sigma, 1)
   ...: 
   ...: rng = np.random.default_rng(12)
   ...: X = np.concatenate(
   ...:     (
   ...:         rng.normal(0, 1, (200, 5)),
   ...:         rng.multivariate_normal(np.zeros(5), Sigma, 200, method="cholesky"),
   ...:         rng.normal(0, 1, (200, 5)),
   ...:     ),
   ...:     axis=0,
   ...: )

The simulated dataset X coincides with the change in covariance (CIC) setup described in [1]. Observations in the first and last segments are independently drawn from a standard multivariate Gaussian distribution. Observations in the second segment are i.i.d. normal with mean zero and unit variance, but with a covariance of ρ = 0.7 between coordinates. This is a challenging scenario.

In [2]: from changeforest import changeforest
   ...: 
   ...: result = changeforest(X, "random_forest", "bs")
   ...: result
Out[2]: 
                    best_split max_gain p_value
(0, 600]                   400   14.814   0.005
 ¦--(0, 400]               200   59.314   0.005
 ¦   ¦--(0, 200]             6    -1.95    0.67
 ¦   °--(200, 400]         393   -8.668    0.81
 °--(400, 600]             412   -9.047    0.66

In [3]: result.split_points()
Out[3]: [200, 400]

changeforest correctly identifies the change points at t=200 and t=400. The changeforest function returns a BinarySegmentationResult. We use its plot method to investigate the gain curves maximized by the change point estimates:

In [4]: result.plot().show()

Change point estimates are marked in red.

For method="random_forest" and method="knn", the changeforest algorithm uses a two-step approach to find an optimizer of the gain. This fits a classifier for three split candidates at the segment's 1/4, 1/2 and 3/4 quantiles, computes approximate gain curves using the resulting classifier log-likelihood ratios and selects the overall optimizer as a second guess. We can investigate the gain curves from the optimizer using the plot method of OptimizerResult. The initial guesses are marked in blue.

In [5]: result.optimizer_result.plot().show()

One can observe that the approximate gain curves are piecewise linear, with maxima around the true underlying change points.

The BinarySegmentationResult returned by changeforest is a tree-like object with attributes start, stop, best_split, max_gain, p_value, is_significant, optimizer_result, model_selection_result, left, right and segments. These can be interesting to investigate the output of the algorithm further.

The changeforest algorithm can be tuned with hyperparameters. See here for their descriptions and default values. In Python, the parameters can be specified with the Control class, which can be passed to changeforest. The following will build random forests with 50 trees:

In [6]: from changeforest import Control
   ...: changeforest(X, "random_forest", "bs", Control(random_forest_n_estimators=50))
Out[6]: 
                    best_split max_gain p_value
(0, 600]                   416    7.463    0.01
 ¦--(0, 416]               200   43.935   0.005
 ¦   ¦--(0, 200]           193  -14.993   0.945
 ¦   °--(200, 416]         217    -9.13   0.085
 °--(416, 600]             591   -12.07       1 

The changeforest algorithm still detects change points at t=200, but is slightly off with t=416.

Due to the nature of the change, method="change_in_mean" is unable to detect any change points at all:

In [7]: changeforest(X, "change_in_mean", "bs")
Out[7]: 
          best_split max_gain p_value
(0, 600]         589    8.625  

R

To install from conda-forge, run

conda install -c conda-forge r-changeforest

See here for a detailed description on installing the changeforest R package with conda.

Example

In the following example, we perform random forest-based change point detection on a simulated dataset with n=600 observations and covariance shifts at t=200, 400.

> library(MASS)

> set.seed(0)
> Sigma = matrix(0.7, nrow=5, ncol=5)
> diag(Sigma) = 1
> mu = rep(0, 5)
> X = rbind(
    mvrnorm(n=200, mu=mu, Sigma=diag(5)),
    mvrnorm(n=200, mu=mu, Sigma=Sigma),
    mvrnorm(n=200, mu=mu, Sigma=diag(5))
)

The simulated dataset X coincides with the change in covariance (CIC) setup described in [1]. Observations in the first and last segments are independently drawn from a standard multivariate Gaussian distribution. Observations in the second segment are i.i.d. normal with mean zero and unit variance, but with a covariance of ρ = 0.7 between coordinates. This is a challenging scenario.

> library(changeforest)

> result = changeforest(X, "random_forest", "bs")
> result
                 name best_split  max_gain p_value is_significant
1 (0, 600]                   410  13.49775   0.005           TRUE
2  ¦--(0, 410]               199  61.47201   0.005           TRUE
3  ¦    ¦--(0, 199]          192 -22.47364   0.955          FALSE
4  ¦    °--(199, 410]        396  11.50559   0.190          FALSE
5  °--(410, 600]             416 -23.52932   0.965          FALSE

> result$split_points()
[1] 199 410

changeforest correctly identifies the change point around t=200 but is slightly off at t=410. The changeforest function returns an object of class binary_segmentation_result. We use its plot method to investigate the gain curves maximized by the change point estimates:

> plot(result)

Change point estimates are marked in red.

For method="random_forest" and method="knn", the changeforest algorithm uses a two-step approach to find an optimizer of the gain. This fits a classifier for three split candidates at the segment's 1/4, 1/2 and 3/4 quantiles computes approximate gain curves using the resulting classifier log-likelihood ratios and selects the overall optimizer as a second guess. We can investigate the gain curves from the optimizer using the plot method of optimizer_result. The initial guesses are marked in blue.

> plot(result$optimizer_result)

One can observe that the approximate gain curves are piecewise linear, with maxima around the true underlying change points.

The binary_segmentation_result object returned by changeforest is a tree-like object with attributes start, stop, best_split, max_gain, p_value, is_significant, optimizer_result, model_selection_result, left, right and segments. These can be interesting to investigate the output of the algorithm further.

The changeforest algorithm can be tuned with hyperparameters. See here for their descriptions and default values. In R, the parameters can be specified with the Control class, which can be passed to changeforest. The following will build random forests with 20 trees:

> changeforest(X, "random_forest", "bs", Control$new(random_forest_n_estimators=20))
                         name best_split   max_gain p_value is_significant
1 (0, 600]                            15  -6.592136   0.010           TRUE
2  ¦--(0, 15]                          6 -18.186534   0.935          FALSE
3  °--(15, 600]                      561  -4.282799   0.005           TRUE
4      ¦--(15, 561]                  116  -8.084126   0.005           TRUE
5      ¦    ¦--(15, 116]              21 -17.780523   0.130          FALSE
6      ¦    °--(116, 561]            401  11.782002   0.005           TRUE
7      ¦        ¦--(116, 401]        196  22.792401   0.150          FALSE
8      ¦        °--(401, 561]        554 -16.338703   0.800          FALSE
9      °--(561, 600]                 568  -5.230075   0.120          FALSE    

The changeforest algorithm still detects the change point around t=200 but also returns false positives.

Due to the nature of the change, method="change_in_mean" is unable to detect any change points at all:

> changeforest(X, "change_in_mean", "bs")
      name best_split max_gain p_value is_significant
1 (0, 600]        498 17.29389      NA          FALSE

References

[1] M. Londschien, P. Bühlmann and S. Kovács (2023). "Random Forests for Change Point Detection" Journal of Machine Learning Research

Dependencies

~3MB
~58K SLoC