这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@ryanurbs
Copy link
Member

No description provided.

sauravbose and others added 17 commits March 3, 2018 13:53
importance scores. Might as well do this as soon as possible.
scoring_utils.py.  Inserted commented out code to fix the distance array
calculation when missing data is present. Not implemented yet.
calculation is properly normalized by the number of uniquely missing
features in comparing instance pair distances. This way distance is
computed agnostically with respect to missing values.  The previous
implementation would make instances with more missing values appear
closer to one another.
distance calculation within get_row_missing().  Added 'cmins' variable
passed forward for subtraction of minimum value.  Some modifications
still needed to complete this fix.
It appears it it only set up to handle binary endpoints correctly, not
multiclass or continuous valued endpoints. This problem does not exist
for the other 4 core algorithms (SURF, SURF*, MultiSURF, and MultiSURF*)
skrebate/surf.py Outdated
from .relieff import ReliefF
from .scoring_utils import SURF_compute_scores
from relieff import ReliefF
from scoring_utils import SURF_compute_scores
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick look. Please edit two lines as below for passing unit tests.

from .relieff import ReliefF
from .scoring_utils import SURF_compute_scores

You need the . for importing.

endpoint nearest neighbor determination.  Also got rid of mmdiff in
scoring_utils for discrete endpoints (only needed for continuous
endpoints.) Identified a problem in 'compute_score'.  For #far score
contributions, continuous valued features should not check for equality
(this will lead to many scores being left out.)
original data array only for datasets with all continuous values) now
prenormalization is run on 'xc' for such data. 
Also Fixed scoring update for any continuous features (no prior feature
equivalence check.  This causes definitely problems for data with
continuous features particularly in MultiSURF*. So far only fixed for
binary endpoints.
*fixed normalization for binary endpoint scoring.  Added normalization
by 'n' (number of training instances) this doesn't appear to be in here
anywhere for any of the Relief methods.
issues as well as normalization fixes and update to be more relatable to
the rebate papers.
and mmdiff.  I also changed the use of abs value continuous feature
difference and mmdiff normalization so it's only called when a
continuous feature is present rather than for any feature regardless.
Copy link
Contributor

@weixuanfu weixuanfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 failed unit testss about pipeline with mixed attributes maybe from the correction of estimation of mmdiff for continuous features in categorical endpoints. But I am still not sure why it only happened when using cross_eval_score to estimate cv_scores. Maybe it is related to CV

*'k','h','m' normalization dividing by the respective number of hits and misses in NN (after ignoring missing values), also helps account for class imbalance within nearest neighbor radius)"""
if count_hit == 0.0 or count_miss == 0.0: #Special case, avoid division error
if count_hit == 0.0:
diff = (diff_miss / count_miss) / datalen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need add an exception when both count_hit == 0.0 and count_miss == 0.0, this will fix a failed unit test

headers = list(genetic_data.drop("class", axis=1))

clf = make_pipeline(RFE(ReliefF(), n_features_to_select=2),
clf = make_pipeline(TuRF(core_algorithm="MultiSURF", n_features_to_select=2, step=0.1),
Copy link
Contributor

@weixuanfu weixuanfu Mar 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace all step=0.1 to step=0.4 in tests.py (not this docs_sources/using.md) to speed up unit test or using small dataset instead

@coveralls
Copy link

coveralls commented Mar 31, 2018

Coverage Status

Coverage increased (+6.5%) to 76.667% when pulling 2e2e214 on sauravbose:ryan_multiclass into 386ea28 on EpistasisLab:development.

Copy link
Contributor

@weixuanfu weixuanfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ramp_function looks great!

NN_far_list = [i[1] for i in NNlist]

if self.n_jobs != 1:
if self.n_jobs != 1: #Parallelization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could remove if self.n_jobs != 1 in all rebate-based algorithms for reducing redundant codes

@@ -0,0 +1 @@
run_test.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a personal test codes. Maybe you need clean up it.

@ryanurbs ryanurbs merged commit 0f8801b into EpistasisLab:development Apr 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants