这是indexloc提供的服务,不要输入任何密码
Skip to content

Add p-value adjustment based on local fdr #9335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

jake-soloff
Copy link

  • tests added / passed.
  • code/documentation is well formatted.
  • properly formatted commit message. See
    NumPy's guide.

This PR adds a function lfdrcorrection for estimating the local false discovery rate (lfdr) based on a sequence of p-values. The method estimates the lfdr using the Grenander estimator of a monotone density. Rejecting the null whenever the estimated lfdr falls below some user-specified tolerance $\alpha$ yields the support line (SL) procedure. The SL procedure controls the lfdr--see this journal paper or corresponding arxiv preprint. This result is analogous to the fdr control guarantee for the Benjamini-Hochberg procedure, implemented in fdrcorrection.

Notes:

  • It is essential that you add a test when making code changes. Tests are not
    needed for doc changes.
  • When adding a new function, test values should usually be verified in another package (e.g., R/SAS/Stata).
  • When fixing a bug, you must add a test that would produce the bug in main and
    then show that it is fixed with the new code.
  • New code additions must be well formatted. Changes should pass flake8. If on Linux or OSX, you can
    verify you changes are well formatted by running
    git diff upstream/main -u -- "*.py" | flake8 --diff --isolated
    
    assuming flake8 is installed. This command is also available on Windows
    using the Windows System for Linux once flake8 is installed in the
    local Linux environment. While passing this test is not required, it is good practice and it help
    improve code quality in statsmodels.
  • Docstring additions must render correctly, including escapes and LaTeX.

@pep8speaks
Copy link

pep8speaks commented Aug 26, 2024

Hello @jake-soloff! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 383:77: W291 trailing whitespace

Comment last updated at 2024-08-28 20:41:25 UTC

@josef-pkt
Copy link
Member

josef-pkt commented Aug 27, 2024

Thanks, looks interesting. I need to read the article.

It looks like the unit tests are not run.
I guess the new method needs to be added to some @pytest.mark.parametrize list.
(update) isotonic regression was added in scipy version 1.12.0

We don't have sklearn as a dependency. I made an inline comment.

Otherwise, this looks good and should be easy to review and merge.

-----

Assumes p-values are uniformly distributed under the null and have a
decreasing density under the alternative.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, it should also mention that p-values are i.i.d., mainly p-values are assumed to be independent.
(in contrast to other methods that allow for positively or negatively correlated p-values.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added independence as an assumption in the docstring since the main theory in the article assumes independence. Proposition 8 in the article allows for some dependence but it is a weaker result. Identically distributed is not necessary, however. Here is a result, mentioned in the discussion of the paper, to support this.

Suppose $p_1,\ldots,p_m$ are independent random variables on $[0,1]$ such that, if the $i$-th null $H_i$ is true, the corresponding $p$-value is uniformly distributed $p_i\sim \text{Uniform}(0,1)$. For the SL $(\alpha)$ procedure, the probability that the last rejection is null is exactly $\frac{m_0}{m}\alpha$, where $m_0$ denotes the number of nulls.

Interestingly, this result doesn't require decreasing alternative densities, but it is only justified to control the probability that the last rejection is null if the alternative densities are decreasing.

slopes = slopes[pvals_sortind.argsort()]
ell_values = np.minimum(1, pi_0_est/slopes)

return (ell_values <= alpha), ell_values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return a Holder or HolderTuple instance.
(preferred in new code)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the rejection decision (ell_values <= alpha) but I added the tail FDR estimates, so I changed the return statement to return Holder(fdr=fdr, lfdr=lfdr).

pvals : array_like, 1d
List of p-values of the individual tests.
pi_0_est : float, optional
estimate of the null proportion. Defaults to conservative choice ``1.``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local_fdr uses null_proportion as name of argument.
Is this the same type of parameter?
we should try to have names consistent across methods

@josef-pkt
Copy link
Member

josef-pkt commented Aug 27, 2024

I'm not trying to understand the method (it's an Annals article and not easy to read).

The docstring should be informative enough to have a rough idea about what it does and how it (local fdr, max-lfdr) differs from standard FDR (BH).
(But I will not go over the details of the algorithm.)

To understand a bit of the background:
I never looked at the details of local FDR.

why is default alpha = 0.2? much larger than the usual 0.05.
Do the returned p-values (ell_values) correspond to the usual corrected p-values, or do they have a different meaning (local?) from BH-FDR?

(edit): end of section 1.2 sounds like alpha corresponds to controlling max-lfdr and not controlling fdr as in BH

to the code
alpha is only used for the rejection decision and is not part of the algorithm.
So I think the function should only return the corrected p-values (ell_values) and not the rejection decision.
For consistency, multipletests can compute the rejection decision.

Aside: The plan is to eventually remove the rejection decision from the return of multipletests. It was added mainly for historical reasons, when multiple testing articles were written in terms of rejection decisions and not in terms of corrected p-values.

extra question (not in this PR):
Is there a two- or multi-stage version of this? (similar to fdrcorrection_twostage "bky" )
For example
Theorem 4 and simulations in the article have pi_0^lambda

@josef-pkt
Copy link
Member

josef-pkt commented Aug 27, 2024

IIUC now, then multitest would have three different statistics that are controlled

  • FWR
  • FDR
  • marginal FDR or max FDR

I guess local_fdr would also control marginal FDR but uses z-values instead of p-values as argument.
(add See Also)

In that case, we should make it more explicit in the multitests docstring, especially for the alpha parameter (as long as it is still included) and for the meaning of the returned corrected pvalues.

Fig 4 on page 594 looks informative (once the definitions are understood)
(IIUC, then I would prefer "marginal fdr" for intuition.)

update

more on difference between BH FDR (tail area average local fdr) and local fdr:

Efron, Bradley. 2007. “Size, Power and False Discovery Rates.” The Annals of Statistics 35 (4): 1351–77. https://doi.org/10.1214/009053606000001460.

@josef-pkt
Copy link
Member

josef-pkt commented Aug 28, 2024

looking a bit more at the background, including fdrtool package in R
https://cran.r-project.org/web/packages/fdrtool/index.html

Is it possible to return the tail-FDR (corrected p-values) implied by the local fdr estimate?

The current multipletests only needs the lfdr/pvalues_corrected.
But it might be useful to expand the standalone function lfdrcorrection with other interesting returns.

Related:
Using normality assumption of zscores, I think we can add the current zscore based local_fdr as method to multipletests (using norm.cdf and norm.ppf to convert between zscores and p-valuus)
edit but we don't have the signs for the zscores, pvalues are positive.

The local_fdr function allows for a null_pdf option different from default gaussian, but in those cases the user can use the local_fdr function directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants