Ryan multiclass #35

ryanurbs · 2018-03-29T16:31:12Z

No description provided.

…re algos

importance scores. Might as well do this as soon as possible.

scoring_utils.py. Inserted commented out code to fix the distance array calculation when missing data is present. Not implemented yet.

calculation is properly normalized by the number of uniquely missing features in comparing instance pair distances. This way distance is computed agnostically with respect to missing values. The previous implementation would make instances with more missing values appear closer to one another.

distance calculation within get_row_missing(). Added 'cmins' variable passed forward for subtraction of minimum value. Some modifications still needed to complete this fix.

It appears it it only set up to handle binary endpoints correctly, not multiclass or continuous valued endpoints. This problem does not exist for the other 4 core algorithms (SURF, SURF*, MultiSURF, and MultiSURF*)

possible changes

binary class endpoints.

weixuanfu · 2018-03-29T18:47:42Z

skrebate/surf.py

-from .relieff import ReliefF
-from .scoring_utils import SURF_compute_scores
+from relieff import ReliefF
+from scoring_utils import SURF_compute_scores


Just a quick look. Please edit two lines as below for passing unit tests.

from .relieff import ReliefF from .scoring_utils import SURF_compute_scores

You need the . for importing.

endpoint nearest neighbor determination. Also got rid of mmdiff in scoring_utils for discrete endpoints (only needed for continuous endpoints.) Identified a problem in 'compute_score'. For #far score contributions, continuous valued features should not check for equality (this will lead to many scores being left out.)

original data array only for datasets with all continuous values) now prenormalization is run on 'xc' for such data. Also Fixed scoring update for any continuous features (no prior feature equivalence check. This causes definitely problems for data with continuous features particularly in MultiSURF*. So far only fixed for binary endpoints. *fixed normalization for binary endpoint scoring. Added normalization by 'n' (number of training instances) this doesn't appear to be in here anywhere for any of the Relief methods.

issues as well as normalization fixes and update to be more relatable to the rebate papers.

and mmdiff. I also changed the use of abs value continuous feature difference and mmdiff normalization so it's only called when a continuous feature is present rather than for any feature regardless.

count_miss happens to be zero.

weixuanfu

3 failed unit testss about pipeline with mixed attributes maybe from the correction of estimation of mmdiff for continuous features in categorical endpoints. But I am still not sure why it only happened when using cross_eval_score to estimate cv_scores. Maybe it is related to CV

weixuanfu · 2018-03-30T19:38:07Z

skrebate/scoring_utils.py

+        *'k','h','m' normalization dividing by the respective number of hits and misses in NN (after ignoring missing values), also helps account for class imbalance within nearest neighbor radius)"""
+        if count_hit == 0.0 or count_miss == 0.0: #Special case, avoid division error
+            if count_hit == 0.0:
+                diff =  (diff_miss / count_miss) / datalen


Need add an exception when both count_hit == 0.0 and count_miss == 0.0, this will fix a failed unit test

weixuanfu · 2018-03-30T19:40:05Z

docs_sources/using.md

+headers = list(genetic_data.drop("class", axis=1))

-clf = make_pipeline(RFE(ReliefF(), n_features_to_select=2),
+clf = make_pipeline(TuRF(core_algorithm="MultiSURF", n_features_to_select=2, step=0.1),


Please replace all step=0.1 to step=0.4 in tests.py (not this docs_sources/using.md) to speed up unit test or using small dataset instead

happen when there are very few features and instances.

unit testing

test errors.

testing.

coveralls · 2018-03-31T01:23:03Z

Coverage increased (+6.5%) to 76.667% when pulling 2e2e214 on sauravbose:ryan_multiclass into 386ea28 on EpistasisLab:development.

weixuanfu

ramp_function looks great!

weixuanfu · 2018-04-02T13:43:20Z

skrebate/surfstar.py

        NN_far_list = [i[1] for i in NNlist]

-        if self.n_jobs != 1:
+        if self.n_jobs != 1: #Parallelization


I think we could remove if self.n_jobs != 1 in all rebate-based algorithms for reducing redundant codes

weixuanfu · 2018-04-02T13:46:25Z

skrebate/.gitignore

@@ -0,0 +1 @@
+run_test.py


I think this is a personal test codes. Maybe you need clean up it.

and multisurfstar for parallelization check.

sauravbose and others added 17 commits March 3, 2018 13:53

TuRF detailed prints with header settings

3f33d0c

All core algos removed comments

bc0057f

Modified docs

f903f5a

Updated docstrings

b072185

multiclass

ae380ff

multiclass

5c3e72b

made attr a class variable and added multiclass capability for all co…

52debc4

…re algos

Updated far scoring in multiclass case

f02739f

Added clarifying comments to fit() in relieff

600ea71

Moved deletion of distance array to just after calculation of feature

660132d

importance scores. Might as well do this as soon as possible.

Added clarifying comments to the rest of relieff.py and

fe4f332

scoring_utils.py. Inserted commented out code to fix the distance array calculation when missing data is present. Not implemented yet.

Partial fix to proper normalization of continuous value range in

66fc318

distance calculation within get_row_missing(). Added 'cmins' variable passed forward for subtraction of minimum value. Some modifications still needed to complete this fix.

Added more comments. Identified an issue in relieff, find_neighbors().

6d5b150

It appears it it only set up to handle binary endpoints correctly, not multiclass or continuous valued endpoints. This problem does not exist for the other 4 core algorithms (SURF, SURF*, MultiSURF, and MultiSURF*)

Minor comment and code rearrangement

3c137a5

added further comments to the code for clarification and to identify

ea92361

possible changes

Added Saurav's fix for nearest neighbor selection of multiclass and

e8742e9

binary class endpoints.

weixuanfu reviewed Mar 29, 2018

View reviewed changes

ryanurbs added 7 commits March 29, 2018 22:55

Completed fixes to 'compute_score' to fixe continuous feature scoring

3e52bef

issues as well as normalization fixes and update to be more relatable to the rebate papers.

Added remaining clarifying comments to individual RBA module files.

594a92e

minor comment addition

24343f5

Manually added weixuan fu's bug fixes for scoring_units about data_len

2b95aa0

and mmdiff. I also changed the use of abs value continuous feature difference and mmdiff normalization so it's only called when a continuous feature is present rather than for any feature regardless.

Made fix pointed out by Weixuan regarding special case when count_hit or

27177fa

count_miss happens to be zero.

weixuanfu reviewed Mar 30, 2018

View reviewed changes

ryanurbs added 3 commits March 30, 2018 18:22

analysis example dataset file path (csv changed to tsv)

0902bc5

Fixed special case where both count hit and miss are zero. This could

77dbeef

happen when there are very few features and instances.

in tests.py, changed all TuRF step sizes from 0.1 to 0.4 to speed up

008fdf9

unit testing

ryanurbs force-pushed the ryan_multiclass branch from 008fdf9 to 9429e58 Compare March 31, 2018 00:27

Fixed some errors in my ramp function implementation causing additional

d0778c4

test errors.

ryanurbs force-pushed the ryan_multiclass branch from 9429e58 to d0778c4 Compare March 31, 2018 00:57

Minor fix to ramp function implementation to fix error in integrated

3551877

testing.

weixuanfu reviewed Apr 2, 2018

View reviewed changes

Included TuRF to init import, and removed unnecessary code from surfstar

2e2e214

and multisurfstar for parallelization check.

ryanurbs merged commit 0f8801b into EpistasisLab:development Apr 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ryan multiclass #35

Ryan multiclass #35

Uh oh!

ryanurbs commented Mar 29, 2018

Uh oh!

weixuanfu Mar 29, 2018

Uh oh!

weixuanfu left a comment

Uh oh!

weixuanfu Mar 30, 2018

Uh oh!

weixuanfu Mar 30, 2018 •

edited

Loading

Uh oh!

coveralls commented Mar 31, 2018 •

edited

Loading

Uh oh!

weixuanfu left a comment

Uh oh!

weixuanfu Apr 2, 2018

Uh oh!

weixuanfu Apr 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1 @@
		run_test.py

Ryan multiclass #35

Ryan multiclass #35

Uh oh!

Conversation

ryanurbs commented Mar 29, 2018

Uh oh!

weixuanfu Mar 29, 2018

Choose a reason for hiding this comment

Uh oh!

weixuanfu left a comment

Choose a reason for hiding this comment

Uh oh!

weixuanfu Mar 30, 2018

Choose a reason for hiding this comment

Uh oh!

weixuanfu Mar 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Mar 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weixuanfu left a comment

Choose a reason for hiding this comment

Uh oh!

weixuanfu Apr 2, 2018

Choose a reason for hiding this comment

Uh oh!

weixuanfu Apr 2, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

weixuanfu Mar 30, 2018 •

edited

Loading

coveralls commented Mar 31, 2018 •

edited

Loading