-
-
Notifications
You must be signed in to change notification settings - Fork 9
FilterFindCorrelation #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not necessarily need to match the cutoff numbers of {caret} with the cutoff
used in mlr3.
Usually we assume that higher filter values are better in mlr3.
I would like to keep this behavior consistent.
See my proposed change to simply change to "1 - max(x)".
In this case a mlr3 cutoff value of 0.1 would correspond to {carets}'s 0.9 value: A feature with a high value is "better" and means that it has a low correlation value, according to the findcorrelation
filter.
Imo documenting this should suffice.
This way users can trust the "1 to x" ranking of filters in mlr3 which means that the "best" feature of a filter comes in at rank 1.
Reprex
library(mlr3verse)
#> Loading required package: mlr3
#> Loading required package: mlr3db
#> Loading required package: mlr3filters
#> Loading required package: mlr3learners
#> Loading required package: mlr3pipelines
#> Loading required package: mlr3tuning
#> Loading required package: mlr3viz
#> Loading required package: paradox
task = tsk("sonar")
filtered_task = po("filter", flt("findcorrelation"),
filter.cutoff = 0.1)$train(list(task))[[1]]
setdiff(task$feature_names, filtered_task$feature_names)
#> [1] "V15" "V18" "V20"
which(flt("findcorrelation")$calculate(task)$scores < 0.1)
#> V20 V15 V18
#> 58 59 60
Created on 2020-02-19 by the reprex package (v0.3.0)
R/FilterFindCorrelation.R
Outdated
public = list( | ||
initialize = function() { | ||
super$initialize( | ||
id = "correlation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a filter with id = "correlation"
.
Can we find a more matching name that deviates clearly from the existing one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is obviously a typo, it should be "findcorrelation"
, because it imitates the caret::findCorrelation
function. What other name would you suggest / what naming scheme do you use in mlr3filters?
DESCRIPTION
Outdated
rpart, | ||
testthat | ||
testthat, | ||
caret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this already implemented in {tidymodels}? I am worried about adding this to suggest because {caret}'s lifecycle is coming to its end in the foreseeable future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
caret is only used for comparison in tests and not actually loaded by the filter, but I can have a look if they copied the behaviour to somewhere in another package.
The code before had essentially |
Yes, because positive scores / a ranking with the "best" features in front aligns with the way We just need a clear statement in the help page that features with high values actually mean low correlation with others and hence they are ranked best. |
See #61: Filter that emulates
caret::findFilterCorrelation(exact = FALSE)
. Only trouble is:findFilterCorrelation(cutoff = 0.3)
excludes more features thanfindFilterCorrelation(cutoff = 0.7)
, whereas for filter scores, lower cutoff values mean fewer features get excluded. Therefore this filter produces negative scores.This fixes #61 (at least all that can be fixed).