-
Notifications
You must be signed in to change notification settings - Fork 72
Ryan multiclass #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ryan multiclass #35
Conversation
importance scores. Might as well do this as soon as possible.
scoring_utils.py. Inserted commented out code to fix the distance array calculation when missing data is present. Not implemented yet.
calculation is properly normalized by the number of uniquely missing features in comparing instance pair distances. This way distance is computed agnostically with respect to missing values. The previous implementation would make instances with more missing values appear closer to one another.
distance calculation within get_row_missing(). Added 'cmins' variable passed forward for subtraction of minimum value. Some modifications still needed to complete this fix.
It appears it it only set up to handle binary endpoints correctly, not multiclass or continuous valued endpoints. This problem does not exist for the other 4 core algorithms (SURF, SURF*, MultiSURF, and MultiSURF*)
binary class endpoints.
skrebate/surf.py
Outdated
| from .relieff import ReliefF | ||
| from .scoring_utils import SURF_compute_scores | ||
| from relieff import ReliefF | ||
| from scoring_utils import SURF_compute_scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick look. Please edit two lines as below for passing unit tests.
from .relieff import ReliefF
from .scoring_utils import SURF_compute_scores
You need the . for importing.
endpoint nearest neighbor determination. Also got rid of mmdiff in scoring_utils for discrete endpoints (only needed for continuous endpoints.) Identified a problem in 'compute_score'. For #far score contributions, continuous valued features should not check for equality (this will lead to many scores being left out.)
original data array only for datasets with all continuous values) now prenormalization is run on 'xc' for such data. Also Fixed scoring update for any continuous features (no prior feature equivalence check. This causes definitely problems for data with continuous features particularly in MultiSURF*. So far only fixed for binary endpoints. *fixed normalization for binary endpoint scoring. Added normalization by 'n' (number of training instances) this doesn't appear to be in here anywhere for any of the Relief methods.
issues as well as normalization fixes and update to be more relatable to the rebate papers.
and mmdiff. I also changed the use of abs value continuous feature difference and mmdiff normalization so it's only called when a continuous feature is present rather than for any feature regardless.
count_miss happens to be zero.
weixuanfu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 failed unit testss about pipeline with mixed attributes maybe from the correction of estimation of mmdiff for continuous features in categorical endpoints. But I am still not sure why it only happened when using cross_eval_score to estimate cv_scores. Maybe it is related to CV
| *'k','h','m' normalization dividing by the respective number of hits and misses in NN (after ignoring missing values), also helps account for class imbalance within nearest neighbor radius)""" | ||
| if count_hit == 0.0 or count_miss == 0.0: #Special case, avoid division error | ||
| if count_hit == 0.0: | ||
| diff = (diff_miss / count_miss) / datalen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need add an exception when both count_hit == 0.0 and count_miss == 0.0, this will fix a failed unit test
| headers = list(genetic_data.drop("class", axis=1)) | ||
|
|
||
| clf = make_pipeline(RFE(ReliefF(), n_features_to_select=2), | ||
| clf = make_pipeline(TuRF(core_algorithm="MultiSURF", n_features_to_select=2, step=0.1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace all step=0.1 to step=0.4 in tests.py (not this docs_sources/using.md) to speed up unit test or using small dataset instead
happen when there are very few features and instances.
008fdf9 to
9429e58
Compare
9429e58 to
d0778c4
Compare
weixuanfu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ramp_function looks great!
skrebate/surfstar.py
Outdated
| NN_far_list = [i[1] for i in NNlist] | ||
|
|
||
| if self.n_jobs != 1: | ||
| if self.n_jobs != 1: #Parallelization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could remove if self.n_jobs != 1 in all rebate-based algorithms for reducing redundant codes
| @@ -0,0 +1 @@ | |||
| run_test.py | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a personal test codes. Maybe you need clean up it.
and multisurfstar for parallelization check.
No description provided.