Effect annotation¶
How varcode turns a variant into one or more MutationEffect objects.
How it composes¶
A single DNA event can produce one or more plausible mutant proteins.
varcode represents each concrete mutant as a MutantTranscript. The
annotator turns a variant into one or more of these; the classifier
turns each MutantTranscript into a typed MutationEffect. When the
DNA alone admits multiple plausible outcomes — splice ambiguity, SV
breakpoint resolution, unphased germline-overlapping codons — the
results are packaged in a MultiOutcomeEffect whose outcomes
property exposes the set. An optional RNA-evidence resolver narrows
the set to observed isoforms or appends observed-only outcomes.
The four primitives¶
| Primitive | What it represents | Module |
|---|---|---|
MutationEffect (and subclasses) |
One deterministic consequence: Substitution, Silent, FrameShift, PrematureStop, ... |
varcode.effects.effect_classes |
MutantTranscript |
One concrete mutant protein with edit provenance and the annotator that produced it | varcode.mutant_transcript |
MultiOutcomeEffect |
Possibility set: a sequence of candidate outcomes ordered by a prior (most likely first) | varcode.effects.effect_classes |
EffectAnnotator |
How a variant becomes effects or mutant transcripts | varcode.annotators |
Everything in effect annotation is an implementation or consumer of one of those four.
Basic usage¶
import varcode
variants = varcode.load_maf("my_variants.maf")
# Simplest path: get an EffectCollection.
effects = variants.effects()
effects.top_priority_effect()
# Filter by transcript consequence.
nonsilent = effects.drop_silent_and_noncoding()
variants.effects() calls the current default annotator
("fast") on every (variant, transcript) pair and returns
an EffectCollection. Each element is a MutationEffect
subclass — Substitution, Silent, PrematureStop, and so
on.
Splice-disrupting variants: two representations¶
When a variant sits in the canonical splice window (last 3 exonic bases, first 3–6 intronic, canonical donor/acceptor), varcode recognizes it as splice-disrupting. Two ways the effect is expressed, at different richness levels.
Default: lightweight 2-outcome form¶
variant = Variant("17", 43082575 - 5, "C", "T", "GRCh38")
effect = variant.effect_on_transcript(transcript)
# ExonicSpliceSite(...)
# .alternate_effect -> Substitution(...) # if splicing proceeds
ExonicSpliceSite carries alternate_effect: the coding
consequence that applies if splicing still works. Exactly
two outcomes, represented as a primary effect + one
alternate field. Cheap. Ships unconditionally.
SpliceDonor, SpliceAcceptor, and IntronicSpliceSite
don't expose alternate_effect today because the variant
is intronic — there's no coding consequence to attach.
Opt-in: full possibility set¶
effects = variant.effects(splice_outcomes=True)
# SpliceOutcomeSet(...) replaces the splice effect
# .candidates ordered most-plausible-first:
# SpliceCandidate(NORMAL_SPLICING, plausibility=0.1,
# coding_effect=Substitution(...))
# SpliceCandidate(EXON_SKIPPING, plausibility=0.5,
# coding_effect=Deletion(...))
# SpliceCandidate(INTRON_RETENTION, plausibility=0.3)
# SpliceCandidate(CRYPTIC_DONOR, plausibility=0.1)
SpliceOutcomeSet replaces the splice effect with a set of
candidate outcomes, each carrying a plausibility score
(hand-tuned heuristic, not a probability) and — where
computable from cDNA — a concrete coding_effect. The
NORMAL_SPLICING candidate carries the same information as
alternate_effect in the default form.
When you opt in, SpliceDonor / SpliceAcceptor /
IntronicSpliceSite also get wrapped, so every splice-
disrupting variant produces a SpliceOutcomeSet.
Relationship between the two¶
| # candidates | Class |
|---|---|
| 1 | plain Substitution / Silent / etc. — not wrapped |
| 2 | ExonicSpliceSite with alternate_effect |
| N | SpliceOutcomeSet (opt-in via splice_outcomes=True) |
Both ExonicSpliceSite and SpliceOutcomeSet are MultiOutcomeEffect
subclasses, so consumers iterate .outcomes uniformly without caring
about which form they're holding. alternate_effect works on both:
on ExonicSpliceSite it's the splicing-proceeds outcome directly; on
SpliceOutcomeSet it resolves to the NORMAL_SPLICING candidate's
coding_effect. The element types inside .outcomes differ
(MutationEffect vs SpliceCandidate), but outcome.effect.short_description
is uniform.
Limitations¶
The splice classifier is position-based — it fires on the canonical window (last 3 exonic, first 3-6 intronic, donor/acceptor) and nothing else. Sequence-based signals are not flagged: exonic splicing enhancer/silencer disruption mid-exon (~6-10nt SR-protein motifs), branch points (~20-50nt upstream of the acceptor), deep intronic cryptic sites. Detecting these needs ML predictors (SpliceAI, Pangolin, MMSplice, SpliceTransformer) or direct RNA evidence; tracked in #297.
Annotator selection¶
Three annotators ship behind the EffectAnnotator protocol:
| Annotator | Algorithm | Used for |
|---|---|---|
ProteinDiffEffectAnnotator |
Builds a MutantTranscript, translates, diffs against the reference protein |
Default for SNVs / indels / MNVs |
FastEffectAnnotator |
Offset arithmetic against the reference CDS | Opt-in for byte-for-byte 2.x parity or perf-sensitive paths |
StructuralVariantAnnotator |
Reassembles SV outcomes (deletions, duplications, inversions, fusions, translocations) | Routed automatically when the variant is a StructuralVariant |
All three emit the same MutationEffect hierarchy. protein_diff
catches boundary-codon and frameshift-realignment cases that
offset-arithmetic can miss; for trivial SNVs the two produce
identical output. The SV annotator dispatches on variant.is_structural
and isn't user-selectable for point variants.
# Default (protein_diff for point variants, structural_variant for SVs):
effects = variant.effects()
# Opt into the legacy fast path:
effects = variant.effects(annotator="fast")
# Scoped swap:
with varcode.use_annotator("fast"):
effects = variant_collection.effects()
Third-party annotators (isovar, Exacto) register via the registry:
Any object exposing name / supports / version /
annotate_on_transcript satisfies the protocol.
Provenance¶
Every EffectCollection produced by predict_variant_effects
records:
annotator— name of the annotator that ran ("fast","protein_diff", etc.)annotator_version— version stringannotated_at— ISO-8601 UTC timestamp
Fields are preserved through clone_with_new_elements
(so filter / groupby keep them), written to CSV headers
(# annotator=fast, etc.), and recovered by from_csv
verbatim — restored collections remember when they were
originally produced.
A mismatch between the CSV's annotator and the current default
raises a warning on load; wrap from_csv in
use_annotator(<csv's annotator>) if you need the original
annotator's output specifically.
Structural variants¶
StructuralVariant (a Variant subclass) carries SV-specific fields:
sv_type (one of DEL, DUP, INV, INS, CNV, BND), end,
breakend mate fields, confidence intervals, and an open-ended info
dict. Pass parse_structural_variants=True to load_vcf to load
symbolic ALTs (<DEL>, <INS:ME:ALU>, <CN0>, breakends) as
StructuralVariant objects rather than dropping them.
from varcode import load_vcf
vc = load_vcf("manta.vcf", parse_structural_variants=True)
sv_effects = [
e for e in vc.effects()
if e.variant.__class__.__name__ == "StructuralVariant"
]
SV effects (LargeDeletion, LargeDuplication, Inversion,
GeneFusion, TranslocationToIntergenic) are MultiOutcomeEffect
subclasses — e.outcomes exposes the candidate ORFs / cryptic-splice
outcomes the annotator generated, ordered by a per-class prior.
External scorers (RNA evidence, long-read assembly) plug in via
apply_rna_evidence_to_effects to narrow the set or append observed
outcomes; see Germline-aware annotation for the same
composition pattern applied to germline.
Limitations:
- Mate breakend pairing (joining two
BNDrows that are halves of one translocation) is deferred. EachBNDrow produces its ownStructuralVariant; consumers can matchMATEIDthemselves. parse_structural_variants=Falseis the default. Without the flag, symbolic ALTs are dropped with a warning that names the flag.
Downstream consumers¶
MutantTranscript is the prediction-boundary type for downstream
neoantigen pipelines (topiary reads mt.mutant_protein_sequence;
vaxrank consumes the EffectCollection + protein pair to score
neoantigens). RNA-evidence callers (isovar, Exacto) plug in either as
registered annotators or via the RNAEvidenceResolver protocol —
see Germline-aware annotation for the resolver pattern,
which the same evidence shape uses across germline / phase / RNA.