FilterFindCorrelation #62

mb706 · 2020-02-11T20:03:21Z

See #61: Filter that emulates caret::findFilterCorrelation(exact = FALSE). Only trouble is: findFilterCorrelation(cutoff = 0.3) excludes more features than findFilterCorrelation(cutoff = 0.7), whereas for filter scores, lower cutoff values mean fewer features get excluded. Therefore this filter produces negative scores.

> task = tsk("sonar")
> #what are the features names dropped by findCorrelation(cutoff = 0.9)?
> task$feature_names[caret::findCorrelation(cor(task$data(cols = task$feature_names)), cutoff = 0.9, exact = FALSE)]
[1] "V18" "V15" "V20"
>
> # what are the features with scores < -0.9?
> which(flt("findcorrelation")$calculate(task)$scores < -0.9)
V20 V15 V18 
 58  59  60 
>
> # what are the features dropped by PipeOpFilter(filter.cutoff = -0.9)?
> filtered_task = po("filter", flt("findcorrelation"), filter.cutoff = -0.9)$train(list(task))[[1]]
> setdiff(task$feature_names, filtered_task$feature_names)
[1] "V15" "V18" "V20"

This fixes #61 (at least all that can be fixed).

pat-s

We do not necessarily need to match the cutoff numbers of {caret} with the cutoff used in mlr3.

Usually we assume that higher filter values are better in mlr3.
I would like to keep this behavior consistent.
See my proposed change to simply change to "1 - max(x)".

In this case a mlr3 cutoff value of 0.1 would correspond to {carets}'s 0.9 value: A feature with a high value is "better" and means that it has a low correlation value, according to the findcorrelation filter.
Imo documenting this should suffice.

This way users can trust the "1 to x" ranking of filters in mlr3 which means that the "best" feature of a filter comes in at rank 1.

Reprex

library(mlr3verse)
#> Loading required package: mlr3
#> Loading required package: mlr3db
#> Loading required package: mlr3filters
#> Loading required package: mlr3learners
#> Loading required package: mlr3pipelines
#> Loading required package: mlr3tuning
#> Loading required package: mlr3viz
#> Loading required package: paradox

task = tsk("sonar")

filtered_task = po("filter", flt("findcorrelation"), 
                   filter.cutoff = 0.1)$train(list(task))[[1]]
setdiff(task$feature_names, filtered_task$feature_names)
#> [1] "V15" "V18" "V20"

which(flt("findcorrelation")$calculate(task)$scores < 0.1)
#> V20 V15 V18 
#>  58  59  60

^{Created on 2020-02-19 by the reprex package (v0.3.0)}

R/FilterFindCorrelation.R

pat-s · 2020-02-19T22:08:19Z

R/FilterFindCorrelation.R

+  public = list(
+    initialize = function() {
+      super$initialize(
+        id = "correlation",


We already have a filter with id = "correlation".

Can we find a more matching name that deviates clearly from the existing one?

This is obviously a typo, it should be "findcorrelation", because it imitates the caret::findCorrelation function. What other name would you suggest / what naming scheme do you use in mlr3filters?

pat-s · 2020-02-19T22:09:55Z

DESCRIPTION

    rpart,
-    testthat
+    testthat,
+    caret


Is this already implemented in {tidymodels}? I am worried about adding this to suggest because {caret}'s lifecycle is coming to its end in the foreseeable future.

caret is only used for comparison in tests and not actually loaded by the filter, but I can have a look if they copied the behaviour to somewhere in another package.

mb706 · 2020-02-20T18:13:42Z

Usually we assume that higher filter values are better in mlr3.
I would like to keep this behavior consistent.
See my proposed change to simply change to "1 - max(x)".

The code before had essentially -max(x), so your change just adds 1. Do you prefer to have positive scores? Currently a findCorrelation cutoff value of 0.9 corresponds to a filter value of -0.9 .

pat-s · 2020-02-21T14:19:23Z

The code before had essentially -max(x), so your change just adds 1. Do you prefer to have positive scores?

Yes, because positive scores / a ranking with the "best" features in front aligns with the way cutoff can be tuned. I don't care about the sign so much but having it positive stays in line with all other filters.

We just need a clear statement in the help page that features with high values actually mean low correlation with others and hence they are ranked best.

… findcorrelation

mb706 added 3 commits February 11, 2020 20:51

FilterFindCorrelation

788bfc1

suggest caret for tests

d6702a9

comment correction

6c29e87

mb706 mentioned this pull request Feb 12, 2020

Preproces to remove correlated features mlr-org/mlr3pipelines#313

Closed

pat-s reviewed Feb 19, 2020

View reviewed changes

pat-s added Priority: Medium Status: Revision Needed Type: Enhancement labels Feb 19, 2020

Merge branch 'master' into findcorrelation

bed758b

github-actions bot had a problem deploying to production February 19, 2020 23:07 Failure

Merge branch 'master' into findcorrelation

13d8a4a

github-actions bot deployed to production February 19, 2020 23:16 Active

github-actions bot had a problem deploying to production February 19, 2020 23:19 Failure

github-actions bot deployed to production February 19, 2020 23:23 Active

Merge branch 'master' into findcorrelation

0a824ad

github-actions bot deployed to production February 20, 2020 10:10 Active

github-actions bot deployed to production February 20, 2020 10:12 Active

github-actions bot deployed to production February 20, 2020 10:15 Active

github-actions bot deployed to production February 20, 2020 10:16 Active

Merge branch 'master' into findcorrelation

c62f78f

mlr-org deleted a comment from codecov-io Feb 20, 2020

github-actions bot requested a deployment to production February 20, 2020 10:24 Pending

github-actions bot requested a deployment to production February 20, 2020 10:25 Pending

github-actions bot requested a deployment to production February 20, 2020 10:29 Pending

github-actions bot requested a deployment to production February 20, 2020 10:30 Pending

Adding 1 to filter scores to avoid negative scores

1f8fed6

Adjusting tests to +1 shift of filter scores

5dfe5d0

github-actions bot requested a deployment to production February 21, 2020 15:59 Pending

github-actions bot requested a deployment to production February 21, 2020 16:02 Pending

github-actions bot requested a deployment to production February 21, 2020 16:08 Pending

pat-s added 4 commits February 23, 2020 22:25

Merge branch 'master' into findcorrelation

07bdeda

Merge branch 'findcorrelation' of github.com:mlr-org/mlr3filters into…

fdf713c

… findcorrelation

polish

dd11050

add martin as auth

e01ba9f

pat-s merged commit 8c9bc21 into master Feb 24, 2020

pat-s deleted the findcorrelation branch February 24, 2020 06:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FilterFindCorrelation #62

FilterFindCorrelation #62

Uh oh!

mb706 commented Feb 11, 2020 •

edited

Loading

Uh oh!

pat-s left a comment

Uh oh!

Uh oh!

pat-s Feb 19, 2020

Uh oh!

mb706 Feb 20, 2020

Uh oh!

pat-s Feb 19, 2020

Uh oh!

mb706 Feb 20, 2020

Uh oh!

mb706 commented Feb 20, 2020

Uh oh!

pat-s commented Feb 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

FilterFindCorrelation #62

FilterFindCorrelation #62

Uh oh!

Conversation

mb706 commented Feb 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pat-s left a comment

Choose a reason for hiding this comment

Reprex

Uh oh!

Uh oh!

pat-s Feb 19, 2020

Choose a reason for hiding this comment

Uh oh!

mb706 Feb 20, 2020

Choose a reason for hiding this comment

Uh oh!

pat-s Feb 19, 2020

Choose a reason for hiding this comment

Uh oh!

mb706 Feb 20, 2020

Choose a reason for hiding this comment

Uh oh!

mb706 commented Feb 20, 2020

Uh oh!

pat-s commented Feb 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mb706 commented Feb 11, 2020 •

edited

Loading