Abstract
This protocol describes how to perform basic statistical analysis in a population-based genetic association case-control study. The steps described involve the (i) appropriate selection of measures of association and relevance of disease models; (ii) appropriate selection of tests of association; (iii) visualization and interpretation of results; (iv) consideration of appropriate methods to control for multiple testing; and (v) replication strategies. Assuming no previous experience with software such as PLINK, R or Haploview, we describe how to use these popular tools for handling single-nucleotide polymorphism data in order to carry out tests of association and visualize and interpret results. This protocol assumes that data quality assessment and control has been performed, as described in a previous protocol, so that samples and markers deemed to have the potential to introduce bias to the study have been identified and removed. Study design, marker selection and quality control of case-control studies have also been discussed in earlier protocols. The protocol should take ∼1 h to complete.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Zondervan, K.T. & Cardon, L.R. Designing candidate gene and genome-wide case-control association studies. Nat. Protoc. 2, 2492–2501 (2007).
Pettersson, F.H. et al. Marker selection for genetic case-control association studies. Nat. Protoc. 4, 743–752 (2009).
Anderson, C.A. et al. Data quality control in genetic-case control association studies. Nat. Protoc. 5, 1564–1573 (2010).
Morris, A.P. & Zeggini, E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet. Epidemiol. 34, 188–193 (2010).
Cho, E.Y. et al. Genome-wide association analysis and replication of coronary artery disease in South Korea suggests a causal variant common to diverse populations. Heart Asia 2, 104–108 (2010).
Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
The International HapMap Project. Nature 426, 789–796 (2003).
Anderson, C.A. et al. Evaluating the effects of imputation on the power, coverage, and cost efficiency of genome-wide SNP platforms. Am. J. Hum. Genet. 83, 112–119 (2008).
Camp, N.J. Genomewide transmission/disequilibrium testing—consideration of the genotypic relative risks at disease loci. Am. J. Hum. Genet. 61, 1424–1430 (1997).
Balding, D.J., Bishop, M. & Cannings, C. Handbook of Statistical Genetics (John Wiley & Sons Ltd., 2003).
Bishop, Y.M.M., Fienberg, S.E. & Holland, P.W. Discrete Multivariate Analysis: Theory and Practice (MIT Press, 557, 1975).
Cochran, W.G. Some methods for strengthening the common chi-squared test. Biometrics 10 (1954).
Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).
Rice, J.A. Mathematical Statistics and Data Analysis (Duxbury Press, 1995).
Sidak, Z. On multivariate normal probabilities of rectangles: their dependence on correlations. Ann. Math. Statist. 39, 1425–1434 (1968).
Sidak, Z. On probabilities of rectangles in multivariate Student distributions: their dependence on correlations. Ann. Math. Statist. 42, 169–175 (1971).
Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate - a practical and powerful approach to multiple testing. J. Royal Statist. Soc. Series B-Methodological 57, 289–300 (1995).
Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001).
Westfall, P.H. & Young, S.S. Resampling-Based Multiple Testing: Examples and Methods for P-value Adjustment xvii, 340 p. (John Wiley & Sons, 1993).
Dudbridge, F. & Gusnanto, A. Estimation of significance thresholds for genomewide association scans. Genet. Epidemiol. 32, 227–234 (2008).
Hoggart, C.J., Clark, T.G., De Iorio, M., Whittaker, J.C. & Balding, D.J. Genome-wide significance for dense SNP and resequencing data. Genet. Epidemiol. 32, 179–185 (2008).
Pe'er, I., Yelensky, R., Altshuler, D. & Daly, M.J. Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genet. Epidemiol. 32, 381–385 (2008).
Weir, B.S., Hill, W.G. & Cardon, L.R. Allelic association patterns for a dense SNP map. Genet. Epidemiol. 27, 442–450 (2004).
Knowler, W.C., Williams, R.C., Pettitt, D.J. & Steinberg, A.G. Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am. J. Hum. Genet. 43, 520–526 (1988).
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
de Bakker, P.I. et al. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum. Mol. Genet. 17, R122–R128 (2008).
Clarke, G.M., Carter, K.W., Palmer, L.J., Morris, A.P. & Cardon, L.R. Fine mapping versus replication in whole-genome association studies. Am. J. Hum. Genet. 81, 995–1005 (2007).
Skol, A.D., Scott, L.J., Abecasis, G.R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 38, 209–213 (2006).
Skol, A.D., Scott, L.J., Abecasis, G.R. & Boehnke, M. Optimal designs for two-stage genome-wide association studies. Genet. Epidemiol. 31, 776–788 (2007).
R Development Core Team.. A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2009).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Barrett, J.C., Fry, B., Maller, J. & Daly, M.J. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21, 263–265 (2005).
Fox, J. An R and S-Plus Companion to Applied Regression, xvi, 312 p. (Sage Publications, 2002).
Nyholt, D.R. A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am. J. Hum. Genet. 74, 765–769 (2004).
Hosmer, D.W. & Lemeshow, S. Applied Logistic Regression, xii, 373 p. (Wiley, 2000).
Dalgaard, P. Introductory Statistics with R, xvi, 363 p. (Springer, 2008).
Pettersson, F., Jonsson, O. & Cardon, L.R. GOLDsurfer: three dimensional display of linkage disequilibrium. Bioinformatics 20, 3241–3243 (2004).
Pettersson, F., Morris, A.P., Barnes, M.R. & Cardon, L.R. Goldsurfer2 (Gs2): a comprehensive tool for the analysis and visualization of genome wide association studies. BMC Bioinformatics 9, 138 (2008).
Acknowledgements
G.M.C. is funded by the Wellcome Trust. F.H.P. is funded by the Welcome Trust. C.A.A. is funded by the Wellcome Trust (WT91745/Z/10/Z). A.P.M. is supported by a Wellcome Trust Senior Research Fellowship. K.T.Z. is supported by a Wellcome Trust Research Career Development Fellowship.
Author information
Authors and Affiliations
Contributions
G.M.C. wrote the first draft of the manuscript, wrote scripts and performed analyses. G.M.C., C.A.A., A.P.M. and K.T.Z. revised the manuscript and designed the protocol. L.R.C. conceived the protocol.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Data 1
Example genome wide association (GTA) data. (ZIP 201229 kb)
Supplementary Data 2
Example candidate gene 9 (CG) data. (ZIP 66 kb)
Rights and permissions
About this article
Cite this article
Clarke, G., Anderson, C., Pettersson, F. et al. Basic statistical analysis in genetic case-control studies. Nat Protoc 6, 121–133 (2011). https://doi.org/10.1038/nprot.2010.182
Published:
Issue Date:
DOI: https://doi.org/10.1038/nprot.2010.182
This article is cited by
-
Haplotype based testing for a better understanding of the selective architecture
BMC Bioinformatics (2023)
-
Leveraging Mann–Whitney U test on large-scale genetic variation data for analysing malaria genetic markers
Malaria Journal (2022)
-
Transfer learning for genotype–phenotype prediction using deep learning models
BMC Bioinformatics (2022)
-
A systematic review and meta-analysis of HLA class II associations in patients with IgG4 autoimmunity
Scientific Reports (2022)
-
Effect of L3MBTL3/PTPN9 polymorphisms on risk to alcohol-induced ONFH in Chinese Han population
Neurological Sciences (2022)