Skip to contents

Compare two VCF/BCF files reporting various statistics

Usage

vcfcomp(
  test,
  truth,
  formats = c("DS", "GT"),
  stats = "r2",
  by.sample = FALSE,
  by.variant = FALSE,
  flip = FALSE,
  names = NULL,
  bins = NULL,
  af = NULL,
  out = NULL,
  choose_random_start = FALSE,
  return_pse_sites = FALSE,
  ...
)

Arguments

test

path to the comparison file (test), which can be a VCF/BCF file, vcftable object or saved RDS file.

truth

path to the baseline file (truth), which can be a VCF/BCF file, vcftable object or saved RDS file.

formats

character vector. the FORMAT tags to extract for the test and truth respectively. default c("DS", "GT") extracts 'DS' of the test and 'GT' of the truth.

stats

character. the statistics to be calculated. Supports the following options:

"r2"

the Pearson correlation coefficient squared (default)

"f1"

the F1-score, good balance between sensitivity and precision

"nrc"

the Non-Reference Concordance rate

"pse"

the Phasing Switch Error rate

"all"

calculate r2, f1, and nrc together

"gtgq"

genotype quality-based concordance analysis

"gtdp"

depth-based concordance analysis

by.sample

logical. calculate sample-wise concordance, which can be stratified by MAF bin.

by.variant

logical. calculate variant-wise concordance, which can be stratified by MAF bin. If both by.sample and by.variant are FALSE, then do calculations for all samples and variants together in a bin.

flip

logical. flip the ref and alt variants

names

character vector. reset samples' names in the test VCF.

bins

numeric vector. break statistics into allele frequency bins. If NULL (default), bins are automatically generated with fine resolution for rare variants and coarser resolution for common variants (ranging from 0 to 0.5).

af

file path with allele frequency or a RDS file with a saved object for af. Format of the text file: a space-separated text file with five columns and a header named 'chr' 'pos' 'ref' 'alt' 'af'. If NULL, allele frequencies are calculated from the truth genotypes.

out

output prefix for saving objects into RDS file. If provided, creates three files: .af.rds, .test.rds, and .truth.rds

choose_random_start

logical. choose random start for stats="pse". Defaults to FALSE.

return_pse_sites

logical. return phasing switch error sites when stats="pse". Defaults to FALSE.

...

additional options passed to vcftable, such as 'samples', 'region', or 'pass'.

Value

a list object of class "vcfcomp" containing:

samples

character vector of sample names

stats

the calculated statistics, named according to the 'stats' parameter. For stats="all", returns r2, f1, and nrc components.

Details

vcfcomp implements various statistics to compare two VCF/BCF files, e.g. report genotype concordance, correlation stratified by allele frequency.

Author

Zilong Li zilong.dk@gmail.com

Examples

library('vcfppR')
# site-wise comparision stratified by allele frequency
test <- system.file("extdata", "imputed.gt.vcf.gz", package="vcfppR")
truth <- system.file("extdata", "raw.gt.vcf.gz", package="vcfppR")
samples <- "HG00673,NA10840"
res <- vcfcomp(test, truth, stats="r2", bins=c(0,1), samples=samples, setid=TRUE)
str(res)

# sample-wise comparision stratified by sample-level metrice e.g GQ
test <- system.file("extdata", "svupp.call.vcf.gz", package="vcfppR")
truth <- system.file("extdata", "platinum.sv.vcf.gz", package="vcfppR")
res <- vcfcomp(test, truth, stats = "gtgq", region = "chr1")
str(res)