Genotypes and sample-aware queries¶
New in varcode 2.3.0 (#267).
When you load a multi-sample VCF, varcode captures each sample's genotype
(GT, AD, DP, GQ, PS) automatically. As of 2.3.0 this data is available
as structured Genotype objects and via sample-aware filtering helpers
on VariantCollection.
The basics¶
from varcode import load_vcf
vc = load_vcf("tumor_normal.vcf", genome="GRCh38")
vc.samples
# ['normal', 'tumor']
vc.has_sample_data()
# True
One variant, one sample¶
variant = vc[0]
gt = vc.genotype(variant, "tumor")
# Genotype(raw_gt='0/1', alleles=(0, 1), phased=False, phase_set=None,
# allele_depths=(10, 5), total_depth=15, genotype_quality=99)
gt.is_called # True
gt.alleles # (0, 1)
gt.phased # False
gt.allele_depths # (10, 5) — (ref_depth, alt_depth)
gt.total_depth # 15
gt.depth_for_alt(1) # 5
Zygosity is relative to this variant's alt¶
from varcode import Zygosity
vc.zygosity(variant, "tumor")
# <Zygosity.HETEROZYGOUS: 'het'>
vc.zygosity(variant, "normal")
# <Zygosity.ABSENT: 'absent'> — normal is ref/ref at this site
Four states, relative to the alt of the Variant you query:
| State | Meaning |
|---|---|
HETEROZYGOUS |
Sample has at least one copy of this alt, and not all copies are this alt |
HOMOZYGOUS |
All called copies are this alt |
ABSENT |
Sample was called, but doesn't carry this alt (ref/ref, or a different alt at a multi-allelic site) |
MISSING |
Sample's call was ./. or the sample wasn't called |
Filtering¶
These return filtered VariantCollections:
# Variants where this sample carries the alt (het or hom).
vc.for_sample("tumor")
# Finer distinctions:
vc.heterozygous_in("tumor")
vc.homozygous_alt_in("tumor")
Typoed sample names fail fast:
vc.for_sample("typo")
# SampleNotFoundError: Sample 'typo' not found. Available samples: ['normal', 'tumor']
Cross-sample queries¶
Tumor/normal, trio, and cohort queries fall out of set operations on the primitives:
# Somatic candidates: in tumor, not in normal.
tumor = set(vc.for_sample("tumor"))
normal = set(vc.for_sample("normal"))
somatic = tumor - normal
# De novo candidates in a trio.
de_novo = (set(vc.for_sample("child"))
- set(vc.for_sample("mom"))
- set(vc.for_sample("dad")))
Multi-allelic sites¶
VCF rows like REF=A ALT=T,G are split into two Variant objects
(one per alt). zygosity is computed relative to the specific alt of
the variant you query, so a sample with GT=1/2 is reported as
heterozygous for both A→T and A→G (one copy of each, with the other
copy being a different alt). A sample with GT=0/1 (one ref, one T)
is heterozygous for A→T and ABSENT for A→G.
This means vc.heterozygous_in("tumor") on a row where tumor is 1/2
yields two entries, one per alt — the correct biological
interpretation.
Beyond zygosity: phased and germline-aware effects¶
The Genotype API gives you the data; downstream effect prediction uses it via separate features that landed later:
- Phased effects of cis variants — when two variants share a
phase set, varcode builds a joint
HaplotypeEffectviaeffects(phase_resolver=...). See #269. - Germline-aware somatic annotation — pass a
GermlineContexttoeffects(germline=...)and somatic variants are classified against the patient's germline-applied transcript. See Germline-aware annotation.
Dedicated joint-analysis helpers like vc.de_novo_in(...) or
vc.somatic(...) aren't shipped — the set-operation pattern above
covers them, and shortcuts get added only when a clear use case
shows up.