-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Add p-value adjustment based on local fdr #9335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hello @jake-soloff! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-08-28 20:41:25 UTC |
Thanks, looks interesting. I need to read the article. It looks like the unit tests are not run. We don't have sklearn as a dependency. I made an inline comment. Otherwise, this looks good and should be easy to review and merge. |
statsmodels/stats/multitest.py
Outdated
----- | ||
|
||
Assumes p-values are uniformly distributed under the null and have a | ||
decreasing density under the alternative. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, it should also mention that p-values are i.i.d., mainly p-values are assumed to be independent.
(in contrast to other methods that allow for positively or negatively correlated p-values.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added independence as an assumption in the docstring since the main theory in the article assumes independence. Proposition 8 in the article allows for some dependence but it is a weaker result. Identically distributed is not necessary, however. Here is a result, mentioned in the discussion of the paper, to support this.
Suppose
$p_1,\ldots,p_m$ are independent random variables on$[0,1]$ such that, if the$i$ -th null$H_i$ is true, the corresponding$p$ -value is uniformly distributed$p_i\sim \text{Uniform}(0,1)$ . For the SL$(\alpha)$ procedure, the probability that the last rejection is null is exactly$\frac{m_0}{m}\alpha$ , where$m_0$ denotes the number of nulls.
Interestingly, this result doesn't require decreasing alternative densities, but it is only justified to control the probability that the last rejection is null if the alternative densities are decreasing.
statsmodels/stats/multitest.py
Outdated
slopes = slopes[pvals_sortind.argsort()] | ||
ell_values = np.minimum(1, pi_0_est/slopes) | ||
|
||
return (ell_values <= alpha), ell_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return a Holder or HolderTuple instance.
(preferred in new code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the rejection decision (ell_values <= alpha)
but I added the tail FDR estimates, so I changed the return statement to return Holder(fdr=fdr, lfdr=lfdr)
.
pvals : array_like, 1d | ||
List of p-values of the individual tests. | ||
pi_0_est : float, optional | ||
estimate of the null proportion. Defaults to conservative choice ``1.``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local_fdr uses null_proportion
as name of argument.
Is this the same type of parameter?
we should try to have names consistent across methods
I'm not trying to understand the method (it's an Annals article and not easy to read). The docstring should be informative enough to have a rough idea about what it does and how it (local fdr, max-lfdr) differs from standard FDR (BH). To understand a bit of the background: why is default alpha = 0.2? much larger than the usual 0.05. (edit): end of section 1.2 sounds like alpha corresponds to controlling max-lfdr and not controlling fdr as in BH to the code Aside: The plan is to eventually remove the rejection decision from the return of extra question (not in this PR): |
IIUC now, then multitest would have three different statistics that are controlled
I guess local_fdr would also control marginal FDR but uses z-values instead of p-values as argument. In that case, we should make it more explicit in the multitests docstring, especially for the alpha parameter (as long as it is still included) and for the meaning of the returned corrected pvalues. Fig 4 on page 594 looks informative (once the definitions are understood) update more on difference between BH FDR (tail area average local fdr) and local fdr: Efron, Bradley. 2007. “Size, Power and False Discovery Rates.” The Annals of Statistics 35 (4): 1351–77. https://doi.org/10.1214/009053606000001460. |
looking a bit more at the background, including fdrtool package in R Is it possible to return the tail-FDR (corrected p-values) implied by the local fdr estimate? The current Related: The |
NumPy's guide.
This PR adds a function$\alpha$ yields the support line (SL) procedure. The SL procedure controls the lfdr--see this journal paper or corresponding arxiv preprint. This result is analogous to the fdr control guarantee for the Benjamini-Hochberg procedure, implemented in
lfdrcorrection
for estimating the local false discovery rate (lfdr) based on a sequence of p-values. The method estimates the lfdr using the Grenander estimator of a monotone density. Rejecting the null whenever the estimated lfdr falls below some user-specified tolerancefdrcorrection
.Notes:
needed for doc changes.
then show that it is fixed with the new code.
verify you changes are well formatted by running
flake8
is installed. This command is also available on Windowsusing the Windows System for Linux once
flake8
is installed in thelocal Linux environment. While passing this test is not required, it is good practice and it help
improve code quality in
statsmodels
.