+
Skip to content

Conversation

mb706
Copy link
Contributor

@mb706 mb706 commented Feb 11, 2020

See #61: Filter that emulates caret::findFilterCorrelation(exact = FALSE). Only trouble is: findFilterCorrelation(cutoff = 0.3) excludes more features than findFilterCorrelation(cutoff = 0.7), whereas for filter scores, lower cutoff values mean fewer features get excluded. Therefore this filter produces negative scores.

> task = tsk("sonar")
> #what are the features names dropped by findCorrelation(cutoff = 0.9)?
> task$feature_names[caret::findCorrelation(cor(task$data(cols = task$feature_names)), cutoff = 0.9, exact = FALSE)]
[1] "V18" "V15" "V20"
>
> # what are the features with scores < -0.9?
> which(flt("findcorrelation")$calculate(task)$scores < -0.9)
V20 V15 V18 
 58  59  60 
>
> # what are the features dropped by PipeOpFilter(filter.cutoff = -0.9)?
> filtered_task = po("filter", flt("findcorrelation"), filter.cutoff = -0.9)$train(list(task))[[1]]
> setdiff(task$feature_names, filtered_task$feature_names)
[1] "V15" "V18" "V20"

This fixes #61 (at least all that can be fixed).

Copy link
Member

@pat-s pat-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not necessarily need to match the cutoff numbers of {caret} with the cutoff used in mlr3.

Usually we assume that higher filter values are better in mlr3.
I would like to keep this behavior consistent.
See my proposed change to simply change to "1 - max(x)".

In this case a mlr3 cutoff value of 0.1 would correspond to {carets}'s 0.9 value: A feature with a high value is "better" and means that it has a low correlation value, according to the findcorrelation filter.
Imo documenting this should suffice.

This way users can trust the "1 to x" ranking of filters in mlr3 which means that the "best" feature of a filter comes in at rank 1.

Reprex

library(mlr3verse)
#> Loading required package: mlr3
#> Loading required package: mlr3db
#> Loading required package: mlr3filters
#> Loading required package: mlr3learners
#> Loading required package: mlr3pipelines
#> Loading required package: mlr3tuning
#> Loading required package: mlr3viz
#> Loading required package: paradox

task = tsk("sonar")

filtered_task = po("filter", flt("findcorrelation"), 
                   filter.cutoff = 0.1)$train(list(task))[[1]]
setdiff(task$feature_names, filtered_task$feature_names)
#> [1] "V15" "V18" "V20"

which(flt("findcorrelation")$calculate(task)$scores < 0.1)
#> V20 V15 V18 
#>  58  59  60

Created on 2020-02-19 by the reprex package (v0.3.0)

public = list(
initialize = function() {
super$initialize(
id = "correlation",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a filter with id = "correlation".

Can we find a more matching name that deviates clearly from the existing one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is obviously a typo, it should be "findcorrelation", because it imitates the caret::findCorrelation function. What other name would you suggest / what naming scheme do you use in mlr3filters?

DESCRIPTION Outdated
rpart,
testthat
testthat,
caret
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this already implemented in {tidymodels}? I am worried about adding this to suggest because {caret}'s lifecycle is coming to its end in the foreseeable future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caret is only used for comparison in tests and not actually loaded by the filter, but I can have a look if they copied the behaviour to somewhere in another package.

@mb706
Copy link
Contributor Author

mb706 commented Feb 20, 2020

Usually we assume that higher filter values are better in mlr3.
I would like to keep this behavior consistent.
See my proposed change to simply change to "1 - max(x)".

The code before had essentially -max(x), so your change just adds 1. Do you prefer to have positive scores? Currently a findCorrelation cutoff value of 0.9 corresponds to a filter value of -0.9 .

@pat-s
Copy link
Member

pat-s commented Feb 21, 2020

The code before had essentially -max(x), so your change just adds 1. Do you prefer to have positive scores?

Yes, because positive scores / a ranking with the "best" features in front aligns with the way cutoff can be tuned. I don't care about the sign so much but having it positive stays in line with all other filters.

We just need a clear statement in the help page that features with high values actually mean low correlation with others and hence they are ranked best.

@pat-s pat-s merged commit 8c9bc21 into master Feb 24, 2020
@pat-s pat-s deleted the findcorrelation branch February 24, 2020 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Filters to remove correlated features and linear combinations

2 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载