### 19 releases (7 stable)

1.1.4 | Jan 8, 2024 |
---|---|

1.1.1 | Oct 3, 2023 |

1.0.1 | Jun 1, 2023 |

0.7.2 | May 9, 2022 |

0.6.0 | Mar 17, 2022 |

#**24** in Machine learning

**36** downloads per month

**BSD-3-Clause**

355KB

2K
SLoC

# Random Forests for Change Point Detection

Change point detection aims to identify structural breaks in the probability distribution of a time series. Existing methods either assume a parametric model for within-segment distributions or are based on ranks or distances and thus fail in scenarios with a reasonably large dimensionality.

implements a classifier-based algorithm that consistently estimates
change points without any parametric assumptions, even in high-dimensional scenarios.
It uses the out-of-bag probability predictions of a random forest to construct a
classifier log-likelihood ratio that gets optimized using a computationally feasible two-step
method.`changeforest`

See [1] for details.

is available as rust crate, a Python package (on
`changeforest`

and
`PyPI`

),
and an R package (on `conda-forge`

, linux and MacOS only). See below for their respective user guides.`conda-forge`

## Python

### Installation

To install from

(recommended), run`conda-forge`

`conda`` install`` -`c conda-forge changeforest

To install from

, run`PyPI`

`pip`` install changeforest`

### Example

In the following example, we perform random forest-based change point detection on
a simulated dataset with

observations and covariance shifts at `n =600`

`t``=``200``,` `400`

.`In` `[``1``]``:` `import` `numpy` `as` `np`
`...``:`
`...``:` `Sigma` `=` `np``.``full``(``(``5``,` `5``)``,` `0``.`7`)`
`...``:` `np``.``fill_diagonal``(``Sigma``,` `1``)`
`...``:`
`...``:` `rng` `=` `np``.``random``.``default_rng``(``12``)`
`...``:` `X` `=` `np``.``concatenate``(`
`...`: `(`
`...`: `rng``.``normal``(``0``,` `1``,` `(``200``,` `5``)``)``,`
`...`: `rng``.``multivariate_normal``(``np``.``zeros``(``5``)``,` `Sigma``,` `200``,` `method``=``"``cholesky``"``)``,`
`...`: `rng``.``normal``(``0``,` `1``,` `(``200``,` `5``)``)``,`
`...`: `)``,`
`...`: `axis``=``0``,`
`...`: `)`

The simulated dataset

coincides with the `X`*change in covariance* (CIC) setup
described in [1]. Observations in the first and last segments are independently drawn
from a standard multivariate Gaussian distribution. Observations in the second segment
are i.i.d. normal with mean zero and unit variance, but with a covariance of ρ = 0.7
between coordinates. This is a challenging scenario.

`In` `[``2``]``:` `from`` changeforest ``import`` ``changeforest`
`...``:`
`...``:` `result` `=` `changeforest``(``X``,` `"``random_forest``"``,` `"``bs``"``)`
`...``:` `result`
`Out``[``2``]``:`
`best_split` `max_gain` `p_value`
`(``0``,` `600``]` `400` `14``.`814 `0``.`005
¦`-``-``(``0``,` `400``]` `200` `59``.`314 `0``.`005
¦ ¦`-``-``(``0``,` `200``]` `6` `-``1``.`95 `0``.`67
¦ °`-``-``(``200``,` `400``]` `393` `-``8``.`668 `0``.`81
°`-``-``(``400``,` `600``]` `412` `-``9``.`047 `0``.`66
`In` `[``3``]`: `result``.``split_points``(``)`
`Out``[``3``]`: `[``200``,` `400``]`

correctly identifies the change points at `changeforest`

and `t =200`

`t``=``400`

. The
`changeforest`

function returns a `BinarySegmentationResult`

. We use its `plot`

method
to investigate the gain curves maximized by the change point estimates:`In` `[``4``]``:` `result``.``plot``(``)``.``show``(``)`

Change point estimates are marked in red.

For

and `method ="random_forest"`

`method``=``"`knn`"`

, the `changeforest`

algorithm uses a two-step approach to
find an optimizer of the gain. This fits a classifier for three split candidates
at the segment's 1/4, 1/2 and 3/4 quantiles, computes approximate gain curves using
the resulting classifier log-likelihood ratios and selects the overall optimizer as a second guess.
We can investigate the gain curves from the optimizer using the `plot`

method of `OptimizerResult`

.
The initial guesses are marked in blue.`In` `[``5``]``:` `result``.``optimizer_result``.``plot``(``)``.``show``(``)`

One can observe that the approximate gain curves are piecewise linear, with maxima around the true underlying change points.

The

returned by `BinarySegmentationResult`

is a tree-like object with attributes
`changeforest`

, `start`

, `stop`

, `best_split`

, `max_gain`

, `p_value`

, `is_significant`

, `optimizer_result`

, `model_selection_result`

, `left`

and `right`

.
These can be interesting to investigate the output of the algorithm further.`segments`

The

algorithm can be tuned with hyperparameters. See
here
for their descriptions and default values. In Python, the parameters can
be specified with the `changeforest`

class,
which can be passed to `Control`

. The following will build random forests with
50 trees:`changeforest`

`In` `[``6``]``:` `from`` changeforest ``import`` ``Control`
`...``:` `changeforest``(``X``,` `"``random_forest``"``,` `"``bs``"``,` `Control``(``random_forest_n_estimators``=``50``)``)`
`Out``[``6``]``:`
`best_split` `max_gain` `p_value`
`(``0``,` `600``]` `416` `7``.`463 `0``.`01
¦`-``-``(``0``,` `416``]` `200` `43``.`935 `0``.`005
¦ ¦`-``-``(``0``,` `200``]` `193` `-``14``.`993 `0``.`945
¦ °`-``-``(``200``,` `416``]` `217` `-``9``.`13 `0``.`085
°`-``-``(``416``,` `600``]` `591` `-``12``.`07 `1`

The

algorithm still detects change points at `changeforest`

, but is slightly off
with `t =200`

`t``=``416`

.Due to the nature of the change,

is unable to detect any
change points at all:`method ="change_in_mean"`

`In` `[``7``]``:` `changeforest``(``X``,` `"``change_in_mean``"``,` `"``bs``"``)`
`Out``[``7``]``:`
`best_split` `max_gain` `p_value`
`(``0``,` `600``]` `589` `8``.`625

## R

To install from

, run`conda-forge`

`conda`` install`` -`c conda-forge r-changeforest

See here for a detailed description
on installing the

R package with `changeforest`

.`conda`

### Example

In the following example, we perform random forest-based change point detection on
a simulated dataset with

observations and covariance shifts at `n =600`

`t``=``200``,` `400`

.`>` `library``(`MASS`)`
`>` `set.seed``(``0``)`
`>` Sigma `=` `matrix``(``0``.`7`,` `nrow``=``5``,` `ncol``=``5``)`
`>` `diag``(`Sigma`)` `=` `1`
`>` mu `=` `rep``(``0``,` `5``)`
`>` X `=` `rbind``(`
`mvrnorm``(``n``=``200``,` `mu``=`mu`,` `Sigma``=``diag``(``5``)``)``,`
`mvrnorm``(``n``=``200``,` `mu``=`mu`,` `Sigma``=`Sigma`)``,`
`mvrnorm``(``n``=``200``,` `mu``=`mu`,` `Sigma``=``diag``(``5``)``)`
`)`

The simulated dataset

coincides with the `X`*change in covariance* (CIC) setup
described in [1]. Observations in the first and last segments are independently drawn
from a standard multivariate Gaussian distribution. Observations in the second segment
are i.i.d. normal with mean zero and unit variance, but with a covariance of ρ = 0.7
between coordinates. This is a challenging scenario.

`>` `library``(`changeforest`)`
`>` result `=` `changeforest``(`X`,` `"`random_forest`"``,` `"`bs`"``)`
`>` result
name best_split max_gain p_value is_significant
`1` `(``0`, `600`] `410` `13``.`49775 `0``.`005 `TRUE`
`2` ¦`-``-``(``0`, `410`] `199` `61``.`47201 `0``.`005 `TRUE`
`3` ¦ ¦`-``-``(``0`, `199`] `192` `-``22``.`47364 `0``.`955 `FALSE`
`4` ¦ °`-``-``(``199`, `410`] `396` `11``.`50559 `0``.`190 `FALSE`
`5` °`-``-``(``410`, `600`] `416` `-``23``.`52932 `0``.`965 `FALSE`
`>` result`$``split_points``(``)`
`[``1``]` `199` `410`

correctly identifies the change point around `changeforest`

but is slightly
off at `t =200`

`t``=``410`

. The `changeforest`

function returns an object of class `binary_segmentation_result`

.
We use its `plot`

method to investigate the gain curves maximized by the change point estimates:`>` `plot``(`result`)`

Change point estimates are marked in red.

For

and `method ="random_forest"`

`method``=``"`knn`"`

, the `changeforest`

algorithm uses a two-step approach to
find an optimizer of the gain. This fits a classifier for three split candidates
at the segment's 1/4, 1/2 and 3/4 quantiles computes approximate gain curves using
the resulting classifier log-likelihood ratios and selects the overall optimizer as a second guess.
We can investigate the gain curves from the optimizer using the `plot`

method of `optimizer_result`

.
The initial guesses are marked in blue.`>` `plot``(`result`$`optimizer_result`)`

One can observe that the approximate gain curves are piecewise linear, with maxima around the true underlying change points.

The

object returned by `binary_segmentation_result`

is a tree-like object with attributes
`changeforest`

, `start`

, `stop`

, `best_split`

, `max_gain`

, `p_value`

, `is_significant`

, `optimizer_result`

, `model_selection_result`

, `left`

and `right`

.
These can be interesting to investigate the output of the algorithm further.`segments`

The

algorithm can be tuned with hyperparameters. See
here
for their descriptions and default values. In R, the parameters can
be specified with the `changeforest`

class,
which can be passed to `Control`

. The following will build random forests with
20 trees:`changeforest`

`>` `changeforest``(`X`,` `"`random_forest`"``,` `"`bs`"``,` Control`$``new``(``random_forest_n_estimators``=``20``)``)`
name best_split max_gain p_value is_significant
`1` `(``0`, `600`] `15` `-``6``.`592136 `0``.`010 `TRUE`
`2` ¦`-``-``(``0`, `15`] `6` `-``18``.`186534 `0``.`935 `FALSE`
`3` °`-``-``(``15`, `600`] `561` `-``4``.`282799 `0``.`005 `TRUE`
`4` ¦`-``-``(``15`, `561`] `116` `-``8``.`084126 `0``.`005 `TRUE`
`5` ¦ ¦`-``-``(``15`, `116`] `21` `-``17``.`780523 `0``.`130 `FALSE`
`6` ¦ °`-``-``(``116`, `561`] `401` `11``.`782002 `0``.`005 `TRUE`
`7` ¦ ¦`-``-``(``116`, `401`] `196` `22``.`792401 `0``.`150 `FALSE`
`8` ¦ °`-``-``(``401`, `561`] `554` `-``16``.`338703 `0``.`800 `FALSE`
`9` °`-``-``(``561`, `600`] `568` `-``5``.`230075 `0``.`120 `FALSE`

The

algorithm still detects the change point around `changeforest`

but also
returns false positives.`t =200`

Due to the nature of the change,

is unable to detect any
change points at all:`method ="change_in_mean"`

`>` `changeforest``(`X`,` `"`change_in_mean`"``,` `"`bs`"``)`
name best_split max_gain p_value is_significant
`1` `(``0`, `600`] `498` `17``.`29389 `NA` `FALSE`

## References

[1] M. Londschien, P. Bühlmann and S. Kovács (2023). "Random Forests for Change Point Detection" Journal of Machine Learning Research

#### Dependencies

~3MB

~56K SLoC