Compare two VCF/BCF files reporting various statistics
Usage
vcfcomp(
test,
truth,
formats = c("DS", "GT"),
stats = "r2",
by.sample = FALSE,
by.variant = FALSE,
flip = FALSE,
names = NULL,
bins = NULL,
af = NULL,
out = NULL,
choose_random_start = FALSE,
return_pse_sites = FALSE,
...
)Arguments
- test
path to the comparison file (test), which can be a VCF/BCF file, vcftable object or saved RDS file.
- truth
path to the baseline file (truth), which can be a VCF/BCF file, vcftable object or saved RDS file.
- formats
character vector. the FORMAT tags to extract for the test and truth respectively. default c("DS", "GT") extracts 'DS' of the test and 'GT' of the truth.
- stats
character. the statistics to be calculated. Supports the following options:
- "r2"
the Pearson correlation coefficient squared (default)
- "f1"
the F1-score, good balance between sensitivity and precision
- "nrc"
the Non-Reference Concordance rate
- "pse"
the Phasing Switch Error rate
- "all"
calculate r2, f1, and nrc together
- "gtgq"
genotype quality-based concordance analysis
- "gtdp"
depth-based concordance analysis
- by.sample
logical. calculate sample-wise concordance, which can be stratified by MAF bin.
- by.variant
logical. calculate variant-wise concordance, which can be stratified by MAF bin. If both by.sample and by.variant are FALSE, then do calculations for all samples and variants together in a bin.
- flip
logical. flip the ref and alt variants
- names
character vector. reset samples' names in the test VCF.
- bins
numeric vector. break statistics into allele frequency bins. If NULL (default), bins are automatically generated with fine resolution for rare variants and coarser resolution for common variants (ranging from 0 to 0.5).
- af
file path with allele frequency or a RDS file with a saved object for af. Format of the text file: a space-separated text file with five columns and a header named 'chr' 'pos' 'ref' 'alt' 'af'. If NULL, allele frequencies are calculated from the truth genotypes.
- out
output prefix for saving objects into RDS file. If provided, creates three files:
.af.rds, .test.rds, and .truth.rds - choose_random_start
logical. choose random start for stats="pse". Defaults to FALSE.
- return_pse_sites
logical. return phasing switch error sites when stats="pse". Defaults to FALSE.
- ...
additional options passed to
vcftable, such as 'samples', 'region', or 'pass'.
Value
a list object of class "vcfcomp" containing:
- samples
character vector of sample names
- stats
the calculated statistics, named according to the 'stats' parameter. For stats="all", returns r2, f1, and nrc components.
Details
vcfcomp implements various statistics to compare two VCF/BCF files,
e.g. report genotype concordance, correlation stratified by allele frequency.
Author
Zilong Li zilong.dk@gmail.com
Examples
library('vcfppR')
# site-wise comparision stratified by allele frequency
test <- system.file("extdata", "imputed.gt.vcf.gz", package="vcfppR")
truth <- system.file("extdata", "raw.gt.vcf.gz", package="vcfppR")
samples <- "HG00673,NA10840"
res <- vcfcomp(test, truth, stats="r2", bins=c(0,1), samples=samples, setid=TRUE)
str(res)
# sample-wise comparision stratified by sample-level metrice e.g GQ
test <- system.file("extdata", "svupp.call.vcf.gz", package="vcfppR")
truth <- system.file("extdata", "platinum.sv.vcf.gz", package="vcfppR")
res <- vcfcomp(test, truth, stats = "gtgq", region = "chr1")
str(res)