Small examples of common filetypes for unit-testing bioinformatics frameworks.
All mutation data is either simulated, completely artificial, or subsampled somatic calls from publicly available cell lines.
Most files describe a single-sample. Samples describing a cohort, will be prefixed with 'cohort'.
tumor_normal.2sample.purple.pave.hg38.vcf
- Example somatic mutations in typical tumor-normal VCF, as produced by oncoanalyser (purple enriched VCF calls). Filter status of some PASS variants was manually changed to 'readStrandBias' or '.'. Variants have been annotated with PAVE.
tumor_normal.2sample.purple.minimal.hg38.vcf
- Removed INFO and FORMAT fields except for GT using
bcftools annotate -x "INFO,FORMAT"
tumor_normal.2sample.purple.minimal.vep.hg38.vcf
- VEP annotated (with identify canonical transcripts on).
tumor.1sample.purple.pave.hg38.vcf
- Single sample version of
tumor_normal.2sample.purple.pave.hg38.vcf
. 'Normal' sample dropped usingbcftools -s tumor
tumor.1sample.purple.minimal.hg38.vcf
- More minimal version of
tumor_normal.2sample.purple.hg38.vcf
with INFO and FORMAT fields dropped usingbcftools annotate -x 'INFO,FORMAT'
. Only GT field remains.
tumor.0sample.purple.minimal.hg38.vcf
- Minimal VCF with no sample information. Mutations describe a single sample whose ID is not described anywhere in the file.
tumor.1sample.purple.vep.hg38.vcf
- Annotated with VEP (CSQ info field). See header for command
tumor.1sample.purple.vep_and_pave.hg38.vcf
- Annotated with VEP (CSQ info field). Pave annotations remain present. See
tumor.singlesample.purple.vep.hg38.vcf
for a VEP only version and- Options: GRCh38.p14; GENCODE 48; Cache Version 114_GRCh38
annovar.hg38.txt & annovar.hg38.csv
- Annovar annotation files generated by running
tumor.singlasample.purple.hg38.vcf
through wAnnovar (tsv and csv version)
chromposrefalt.1based.hg38.tsv
- Minimal Tabular Variant Format (pass only).
bcftools view -f PASS -H tumor.singlesample.purple.minimal.hg38.vc f | cut -f1,2,4,5 | awk 'BEGIN{print "Chromosome","Position","Ref","Alt"}{print $0}' OFS="\t" | head > chromposrefalt.1based.hg38.tsv
tumor_normal.2sample.purple.sv.hg38.vcf
- purple somatic SVs (PASS & INFERRED). Oncoanalyser Output.
tumor.1sample.purple.sv.hg38.vcf
- somatic SVs (PASS & INFERRED) with only 1 sample (tumor sample) described.
tumor.0sample.purple.sv.hg38.vcf
- somatic SVs (PASS & INFERRED) describing a single sample, with no sample ID in VCF.
purple.sv.breakpoints.hg38.bedpe
- Somatic breakpoints from
tumor_normal.2sample.purple.sv.hg38.vcf
. Does not include single breakends, where second breakpoint could not be found. Seescripts/sv_vcf_to_tabular.R
for code to reproduce.
purple.sv.breakends.hg38.bed
- Somatic single breakends from
tumor_normal.2sample.purple.sv.hg38.vcf
. Does not include SVs where both ends of breakpoint are found. Score of breakends inferred by copynumnber change are set to zero. Seescripts/sv_vcf_to_tabular.R
for code to reproduce.
purple.cnv.somatic.hg38.tsv
- Copy number profile of all (contiguous) segments of a tumor sample
cohort.3sample.purple.hg38.tsv
- Cohort segment file. Contains three samples with identical copynumber profiles (
purple.cnv.somatic.hg38.tsv
triplicated)