The swiss army knife for reading VCF/BCF into R data types rapidly and easily.
Usage
vcftable(
vcffile,
region = "",
samples = "-",
vartype = "all",
format = "GT",
ids = NULL,
qual = 0,
pass = FALSE,
info = TRUE,
collapse = TRUE,
setid = FALSE,
mac = 0
)
Arguments
- vcffile
path to the VCF/BCF file
- region
region to subset in bcftools-like style: "chr1", "chr1:1-10000000"
- samples
samples to subset in bcftools-like style. comma separated list of samples to include (or exclude with "^" prefix). e.g. "id01,id02", "^id01,id02".
- vartype
restrict to specific type of variants. supports "snps","indels", "sv", "multisnps","multiallelics"
- format
the FORMAT tag to extract. default "GT" is extracted.
- ids
character vector. restrict to sites with ID in the given vector. default NULL won't filter any sites.
- qual
numeric. restrict to variants with QUAL > qual.
- pass
logical. restrict to variants with FILTER = "PASS".
- info
logical. drop INFO column in the returned list.
- collapse
logical. It acts on the FORMAT. If the FORMAT to extract is "GT", the dim of raw genotypes matrix of diploid is (M, 2 * N), where M is #markers and N is #samples. default TRUE will collapse the genotypes for each sample such that the matrix is (M, N). Set this to FALSE if one wants to maintain the phasing order, e.g. "1|0" is parsed as c(1, 0) with collapse=FALSE. If the FORMAT to extract is not "GT", then with collapse=TRUE it will try to turn a list of the extracted vector into a matrix. However, this raises issues when one variant is mutliallelic resulting in more vaules than others.
- setid
logical. reset ID column as CHR_POS_REF_ALT.
- mac
integer. restrict to variants with minor allele count higher than the value.
Value
Return a list containing the following components:
- samples
: character vector;
the samples ids in the VCF file after subsetting- chr
: character vector;
the CHR column in the VCF file- pos
: character vector;
the POS column in the VCF file- id
: character vector;
the ID column in the VCF file- ref
: character vector;
the REF column in the VCF file- alt
: character vector;
the ALT column in the VCF file- qual
: character vector;
the QUAL column in the VCF file- filter
: character vector;
the FILTER column in the VCF file- info
: character vector;
the INFO column in the VCF file- format
: matrix of either integer of numberic values depending on the tag to extract;
a specifiy tag in the FORMAT column to be extracted
Details
vcftable
uses the C++ API of vcfpp, which is a wrapper of htslib, to read VCF/BCF files.
Thus, it has the full functionalities of htslib, such as restrict to specific variant types,
samples and regions. For the memory efficiency reason, the vcftable
is designed
to parse only one tag at a time in the FORMAT column of the VCF. In default, only the matrix of genotypes,
i.e. "GT" tag, are returned by vcftable
, but there are many other tags supported by the format
option.
Author
Zilong Li zilong.dk@gmail.com
Examples
library('vcfppR')
vcffile <- system.file("extdata", "raw.gt.vcf.gz", package="vcfppR")
res <- vcftable(vcffile, "chr21:1-5050000", vartype = "snps")
str(res)