Skip to contents

The swiss army knife for reading VCF/BCF into R data types rapidly and easily.

Usage

vcftable(
  vcffile,
  region = "",
  samples = "-",
  vartype = "all",
  format = "GT",
  ids = NULL,
  qual = 0,
  pass = FALSE,
  info = TRUE,
  collapse = TRUE,
  setid = FALSE,
  mac = 0
)

Arguments

vcffile

path to the VCF/BCF file

region

region to subset in bcftools-like style: "chr1", "chr1:1-10000000"

samples

samples to subset in bcftools-like style. comma separated list of samples to include (or exclude with "^" prefix). e.g. "id01,id02", "^id01,id02".

vartype

restrict to specific type of variants. supports "snps","indels", "sv", "multisnps","multiallelics"

format

the FORMAT tag to extract. default "GT" is extracted.

ids

character vector. restrict to sites with ID in the given vector. default NULL won't filter any sites.

qual

numeric. restrict to variants with QUAL > qual.

pass

logical. restrict to variants with FILTER = "PASS".

info

logical. drop INFO column in the returned list.

collapse

logical. It acts on the FORMAT. If the FORMAT to extract is "GT", the dim of raw genotypes matrix of diploid is (M, 2 * N), where M is #markers and N is #samples. default TRUE will collapse the genotypes for each sample such that the matrix is (M, N). Set this to FALSE if one wants to maintain the phasing order, e.g. "1|0" is parsed as c(1, 0) with collapse=FALSE. If the FORMAT to extract is not "GT", then with collapse=TRUE it will try to turn a list of the extracted vector into a matrix. However, this raises issues when one variant is mutliallelic resulting in more vaules than others.

setid

logical. reset ID column as CHR_POS_REF_ALT.

mac

integer. restrict to variants with minor allele count higher than the value.

Value

Return a list containing the following components:

samples

: character vector;
the samples ids in the VCF file after subsetting

chr

: character vector;
the CHR column in the VCF file

pos

: character vector;
the POS column in the VCF file

id

: character vector;
the ID column in the VCF file

ref

: character vector;
the REF column in the VCF file

alt

: character vector;
the ALT column in the VCF file

qual

: character vector;
the QUAL column in the VCF file

filter

: character vector;
the FILTER column in the VCF file

info

: character vector;
the INFO column in the VCF file

format

: matrix of either integer of numberic values depending on the tag to extract;
a specifiy tag in the FORMAT column to be extracted

Details

vcftable uses the C++ API of vcfpp, which is a wrapper of htslib, to read VCF/BCF files. Thus, it has the full functionalities of htslib, such as restrict to specific variant types, samples and regions. For the memory efficiency reason, the vcftable is designed to parse only one tag at a time in the FORMAT column of the VCF. In default, only the matrix of genotypes, i.e. "GT" tag, are returned by vcftable, but there are many other tags supported by the format option.

Author

Zilong Li zilong.dk@gmail.com

Examples

library('vcfppR')
vcffile <- system.file("extdata", "raw.gt.vcf.gz", package="vcfppR")
res <- vcftable(vcffile, "chr21:1-5050000", vartype = "snps")
str(res)