correlation_threshold returns list of variables such that no two variables have a correlation greater than a specified threshold.

correlation_threshold(variables, sample, cutoff = 0.9, method = "pearson")

Arguments

variables

character vector specifying observation variables.

sample

tbl containing sample used to estimate parameters.

cutoff

threshold between [0,1] that defines the minimum correlation of a selected feature.

method

optional character string specifying method for calculating correlation. This must be one of the strings "pearson" (default), "kendall", "spearman".

Value

character vector specifying observation variables to be excluded.

Details

correlation_threshold is a wrapper for caret::findCorrelation.

Examples

suppressMessages(suppressWarnings(library(magrittr))) sample <- tibble::tibble( x = rnorm(30), y = rnorm(30) / 1000 ) sample %<>% dplyr::mutate(z = x + rnorm(30) / 10) variables <- c("x", "y", "z") head(sample)
#> # A tibble: 6 x 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 -1.40 0.000935 -1.29 #> 2 0.255 0.000176 0.189 #> 3 -2.44 0.000244 -2.33 #> 4 -0.00557 0.00162 -0.0302 #> 5 0.622 0.000112 0.504 #> 6 1.15 -0.000134 1.05
cor(sample)
#> x y z #> x 1.00000000 0.05080466 0.99554738 #> y 0.05080466 1.00000000 0.06550465 #> z 0.99554738 0.06550465 1.00000000
# `x` and `z` are highly correlated; one of them will be removed correlation_threshold(variables, sample)
#> [1] "z"