API reference¶
Auto-generated from in-source docstrings via mkdocstrings.
Variants¶
varcode.Variant¶
varcode.Variant(contig, start, ref, alt, genome=None, ensembl=None, allow_extended_nucleotides=False, normalize_contig_names=True, convert_ucsc_contig_names=None)
¶
Bases: Serializable
Construct a Variant object.
| PARAMETER | DESCRIPTION |
|---|---|
contig
|
Chromosome that this variant is on
TYPE:
|
start
|
1-based position on the chromosome of first reference nucleotide
TYPE:
|
ref
|
Reference nucleotide(s)
TYPE:
|
alt
|
Alternate nucleotide(s)
TYPE:
|
genome
|
Name of reference genome, Ensembl release number, or object derived from pyensembl.Genome. Default to latest available release of GRCh38
TYPE:
|
ensembl
|
Previous name used instead of 'genome', the two arguments should be mutually exclusive.
TYPE:
|
allow_extended_nucleotides
|
Extended nucleotides include 'Y' for pyrimidies or 'N' for any base
TYPE:
|
normalize_contig_names
|
By default the contig name will be normalized by converting integers to strings (e.g. 1 -> "1"), and converting any letters after "chr" to uppercase (e.g. "chrx" -> "chrX"). If you don't want this behavior then pass normalize_contig_name=False.
TYPE:
|
convert_ucsc_contig_names
|
Setting this argument to True causes UCSC chromosome names to be coverted, such as "chr1" to "1". If the default value (None) is used then it defaults to whether or not a UCSC genome was pass in for the 'genome' argument.
TYPE:
|
Source code in varcode/variant.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 | |
ensembl
property
¶
Deprecated alias for Variant.genome
| RETURNS | DESCRIPTION |
|---|---|
Genome
|
|
trimmed_ref
property
¶
Eventually the field Variant.ref will store the reference nucleotides as given in a VCF or MAF and trimming of any shared prefix/suffix between ref and alt will be done via the properties trimmed_ref and trimmed_alt.
trimmed_alt
property
¶
Eventually the field Variant.ref will store the reference nucleotides as given in a VCF or MAF and trimming of any shared prefix/suffix between ref and alt will be done via the properties trimmed_ref and trimmed_alt.
trimmed_base1_start
property
¶
Currently the field Variant.start carries the base-1 starting position adjusted by trimming any shared prefix between Variant.ref and Variant.alt. Eventually this trimming should be done more explicitly via trimmed_* properties.
trimmed_base1_end
property
¶
Currently the field Variant.end carries the base-1 "last" position of this variant, adjusted by trimming any shared suffix between Variant.ref and Variant.alt. Eventually this trimming should be done more explicitly via trimmed_* properties.
short_description
property
¶
HGVS nomenclature for genomic variants More info: http://www.hgvs.org/mutnomen/
coding_transcripts
property
¶
Protein coding transcripts
genes
property
¶
Return Gene object for all genes which overlap this variant.
gene_ids
property
¶
Return IDs of all genes which overlap this variant. Calling
this method is significantly cheaper than calling Variant.genes(),
which has to issue many more queries to construct each Gene object.
gene_names
property
¶
Return names of all genes which overlap this variant. Calling
this method is significantly cheaper than calling Variant.genes(),
which has to issue many more queries to construct each Gene object.
coding_genes
property
¶
Protein coding transcripts
is_insertion
property
¶
Does this variant represent the insertion of nucleotides into the reference genome?
is_deletion
property
¶
Does this variant represent the deletion of nucleotides from the reference genome?
is_indel
property
¶
Is this variant either an insertion or deletion?
is_snv
property
¶
Is the variant a single nucleotide variant
is_transition
property
¶
Is this variant and pyrimidine to pyrimidine change or purine to purine change
is_transversion
property
¶
Is this variant a pyrimidine to purine change or vice versa
__lt__(other)
¶
Variants are ordered by locus.
to_dict()
¶
We want the original values (un-normalized) field values while serializing since normalization will happen in init.
Source code in varcode/variant.py
effects(raise_on_error=True, splice_outcomes=False, annotator=None, phase_resolver=None, rna_resolver=None, germline=None)
¶
Predict the variant's effects on overlapping transcripts.
| PARAMETER | DESCRIPTION |
|---|---|
raise_on_error
|
If True, raise on annotation errors; if False, capture them as Failure effects.
TYPE:
|
splice_outcomes
|
If True, splice-disrupting effects are wrapped in a
:class:
TYPE:
|
annotator
|
Per-call annotator override.
TYPE:
|
phase_resolver
|
Optional phase-evidence source (typically an
:class:
TYPE:
|
rna_resolver
|
Optional RNA-observed-outcome source. When provided, any
:class:
TYPE:
|
germline
|
Optional patient-germline context. When non-empty, every
per-transcript effect is computed against the patient's
germline-applied transcript instead of the reference.
Codons / splice signals where germline overlaps the
somatic and phase is unknown produce a
:class:
TYPE:
|
Source code in varcode/variant.py
454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 | |
clone_without_ucsc_data()
¶
Clone this variant but discarding the original format of its genome and contig: useful if we want to mix hg19 and GRCh37 variants.
| RETURNS | DESCRIPTION |
|---|---|
Variant
|
|
Source code in varcode/variant.py
varcode.VariantCollection¶
varcode.VariantCollection(variants, distinct=True, sort_key=variant_ascending_position_sort_key, sources=None, source_to_metadata_dict={})
¶
Bases: Collection
Construct a VariantCollection from a list of Variant records.
| PARAMETER | DESCRIPTION |
|---|---|
variants
|
Variant objects contained in this VariantCollection
TYPE:
|
distinct
|
Don't keep repeated variants
TYPE:
|
sort_key
|
TYPE:
|
sources
|
Optional set of source names, may be larger than those for which we have metadata dictionaries.
TYPE:
|
source_to_metadata_dict
|
Dictionary mapping each source name (e.g. VCF path) to a dictionary from metadata attributes to values.
TYPE:
|
Source code in varcode/variant_collection.py
metadata
property
¶
The most common usage of a VariantCollection is loading a single VCF, in which case it's annoying to have to always specify that path when accessing metadata fields. This property is meant to both maintain backward compatibility with old versions of Varcode and make the common case easier.
samples
property
¶
Sorted list of sample names present in the collection's
sample_info metadata (empty if no VCFs with sample columns
were loaded).
to_dict()
¶
Since Collection.to_dict() returns a state dictionary with an 'elements' field we have to rename it to 'variants'.
Source code in varcode/variant_collection.py
clone_with_new_elements(new_elements)
¶
Create another VariantCollection of the same class and with same state (including metadata) but possibly different entries.
Warning: metadata is a dictionary keyed by variants. This method leaves that dictionary as-is, which may result in extraneous entries or missing entries.
Source code in varcode/variant_collection.py
effects(raise_on_error=True, splice_outcomes=False, annotator=None, phase_resolver=None, rna_resolver=None, germline=None, validate_reference=True)
¶
| PARAMETER | DESCRIPTION |
|---|---|
raise_on_error
|
If exception is raised while determining effect of variant on a transcript, should it be raised? This default is True, meaning errors result in raised exceptions, otherwise they are only logged.
TYPE:
|
splice_outcomes
|
If True, splice-disrupting effects are wrapped in a
:class:
TYPE:
|
annotator
|
Per-call annotator override applied to every variant in
the collection. See :meth:
TYPE:
|
phase_resolver
|
Optional phase-evidence source (e.g.
:class:
TYPE:
|
rna_resolver
|
Optional RNA-observed-outcome source. When provided, any
:class:
TYPE:
|
germline
|
Optional patient-germline context. When non-empty, every
per-transcript effect is computed against the patient's
germline-applied transcript instead of the reference.
See :meth:
TYPE:
|
validate_reference
|
Cross-check that the germline context's reference build
matches this collection's reference build before running
annotation. Hard error on mismatch. Set to False if
you've already lifted over and know the builds agree.
Ignored when
TYPE:
|
Source code in varcode/variant_collection.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | |
reference_names()
¶
All distinct reference names used by Variants in this collection.
| RETURNS | DESCRIPTION |
|---|---|
set of str
|
|
original_reference_names()
¶
Similar to reference_names but preserves UCSC references,
so that a variant collection derived from an hg19 VCF would
return {"hg19"} instead of {"GRCh37"}.
| RETURNS | DESCRIPTION |
|---|---|
set of str
|
|
Source code in varcode/variant_collection.py
groupby_gene_name()
¶
Group variants by the gene names they overlap, which may put each variant in multiple groups.
gene_counts()
¶
Returns number of elements overlapping each gene name. Expects the derived class (VariantCollection or EffectCollection) to have an implementation of groupby_gene_name.
Source code in varcode/variant_collection.py
filter_by_transcript_expression(transcript_expression_dict, min_expression_value=0.0)
¶
Filters variants down to those which have overlap a transcript whose expression value in the transcript_expression_dict argument is greater than min_expression_value.
| PARAMETER | DESCRIPTION |
|---|---|
transcript_expression_dict
|
Dictionary mapping Ensembl transcript IDs to expression estimates (either FPKM or TPM)
TYPE:
|
min_expression_value
|
Threshold above which we'll keep an effect in the result collection
TYPE:
|
Source code in varcode/variant_collection.py
filter_by_gene_expression(gene_expression_dict, min_expression_value=0.0)
¶
Filters variants down to those which have overlap a gene whose expression value in the transcript_expression_dict argument is greater than min_expression_value.
| PARAMETER | DESCRIPTION |
|---|---|
gene_expression_dict
|
Dictionary mapping Ensembl gene IDs to expression estimates (either FPKM or TPM)
TYPE:
|
min_expression_value
|
Threshold above which we'll keep an effect in the result collection
TYPE:
|
Source code in varcode/variant_collection.py
exactly_equal(other)
¶
Comparison between VariantCollection instances that takes into account the info field of Variant instances.
| RETURNS | DESCRIPTION |
|---|---|
True if the variants in this collection equal the variants in the other
|
|
collection. The Variant.info fields are included in the comparison.
|
|
Source code in varcode/variant_collection.py
union(*others, **kwargs)
¶
Returns the union of variants in a several VariantCollection objects.
Source code in varcode/variant_collection.py
intersection(*others, **kwargs)
¶
Returns the intersection of variants in several VariantCollection objects.
Source code in varcode/variant_collection.py
difference(*others, **kwargs)
¶
Returns variants present in this collection but not in any of the others.
Source code in varcode/variant_collection.py
to_dataframe()
¶
Build a DataFrame from this variant collection.
Source code in varcode/variant_collection.py
to_csv(path, include_header=True)
¶
Write this collection to CSV.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Output path.
TYPE:
|
include_header
|
If True (default), prepend
TYPE:
|
Source code in varcode/variant_collection.py
from_csv(path, genome=None, distinct=True, sort_key=variant_ascending_position_sort_key)
classmethod
¶
Rebuild a VariantCollection from a CSV previously written by
VariantCollection.to_csv().
The CSV round-trip is human-readable and easy to inspect. For
byte-for-byte round-trip or for faster loading of large
collections (≳10k variants), prefer from_json — CSV parsing
plus per-row Variant construction is significantly slower.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to the CSV file. Lines starting with '#' are treated as
comments and parsed as
TYPE:
|
genome
|
Reference genome to associate with the loaded variants. If
TYPE:
|
distinct
|
Drop duplicate variants (same as the constructor).
TYPE:
|
sort_key
|
Sort key for the resulting collection.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
VariantCollection
|
|
Source code in varcode/variant_collection.py
487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 | |
has_sample_data()
¶
genotype(variant, sample)
¶
Return the Genotype for sample at variant.
| PARAMETER | DESCRIPTION |
|---|---|
variant
|
TYPE:
|
sample
|
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Genotype or None
|
|
| RAISES | DESCRIPTION |
|---|---|
SampleNotFoundError
|
If the variant's metadata exists but doesn't include the
requested sample. Subclass of |
Source code in varcode/variant_collection.py
zygosity(variant, sample)
¶
Zygosity of the given sample at the given variant.
Multi-allelic aware: at a site split into multiple Variants, each asks "does this sample carry this alt?".
Source code in varcode/variant_collection.py
for_sample(sample)
¶
Return a VariantCollection restricted to variants where
sample carries the alt (heterozygous or homozygous). Useful
for multi-sample VCFs where not every row is called in every
sample.
Source code in varcode/variant_collection.py
heterozygous_in(sample)
¶
Variants where sample is heterozygous for this variant's alt.
homozygous_alt_in(sample)
¶
Variants where sample is homozygous for this variant's alt.
varcode.StructuralVariant¶
varcode.StructuralVariant(contig: str, start: int, sv_type: str, end: Optional[int] = None, alt: Optional[str] = None, ref: str = 'N', mate_contig: Optional[str] = None, mate_start: Optional[int] = None, mate_orientation: Optional[str] = None, ci_start: Optional[Tuple[int, int]] = None, ci_end: Optional[Tuple[int, int]] = None, alt_assembly: Optional[str] = None, info: Optional[Mapping[str, Any]] = None, genome=None, ensembl=None, normalize_contig_names: bool = True, convert_ucsc_contig_names=None)
¶
Bases: Variant
A structural variant — deletion, duplication, inversion, insertion, CNV, or breakend — too large or too complex to represent as a simple ref/alt nucleotide pair.
Subclasses :class:Variant so isinstance(v, Variant) still
works; downstream code that handles variant kinds generically
(effect collections, serialization) sees a :class:Variant and
the shared contract still applies. The SV-specific fields
(:attr:sv_type, :attr:end, breakend mate fields) live here
and are consulted by SV-aware code.
The SV position model:
- :attr:
start— 1-based start of the affected region (matches VCF POS). - :attr:
end— 1-based inclusive end. For a DEL/DUP/INV/CNV this is the SV endpoint on the same contig. For an INS it equals start (insertions are zero-width in reference coords). For a BND,end == startand the other breakpoint lives in :attr:mate_contig/ :attr:mate_start.
| PARAMETER | DESCRIPTION |
|---|---|
contig
|
Chromosome of the (first) breakpoint.
TYPE:
|
start
|
1-based start position.
TYPE:
|
sv_type
|
One of :data:
TYPE:
|
end
|
1-based inclusive end position. Defaults to
TYPE:
|
alt
|
Original ALT field from the VCF —
TYPE:
|
ref
|
Original REF base (usually one nucleotide, the anchor).
Defaults to
TYPE:
|
mate_contig
|
For BND: the mate breakpoint's chromosome.
TYPE:
|
mate_start
|
For BND: the mate breakpoint's position.
TYPE:
|
mate_orientation
|
For BND: one of
TYPE:
|
ci_start
|
Confidence interval around
TYPE:
|
ci_end
|
Confidence interval around
TYPE:
|
alt_assembly
|
Caller-supplied assembled sequence of the rearranged allele. Hook for long-read / targeted-assembly pipelines. The SV annotator can prefer this over inferring from breakpoints.
TYPE:
|
info
|
Open-ended bag for extra VCF INFO fields the core class doesn't model (HOMLEN, SVMETHOD, MATEID, etc.). Kept as a Mapping so callers can pass whatever shape their caller produces.
TYPE:
|
genome
|
Same meaning as :class:
DEFAULT:
|
ensembl
|
Same meaning as :class:
DEFAULT:
|
normalize_contig_names
|
Same meaning as :class:
DEFAULT:
|
convert_ucsc_contig_names
|
Same meaning as :class:
DEFAULT:
|
Source code in varcode/structural_variant.py
varcode.parse_symbolic_alt¶
varcode.parse_symbolic_alt(contig: str, start: int, ref: str, alt: str, info=None, genome=None) -> Optional[StructuralVariant]
¶
Parse a single symbolic or breakend ALT into a
:class:StructuralVariant. Returns None if the ALT is not
symbolic (the caller keeps handling it as a simple variant).
info is an optional mapping (e.g. a pyvcf INFO dict) that
may carry END, SVTYPE, CIPOS, CIEND, MATEID,
etc. The parser reads those when present but doesn't require
them — the ALT shape alone is enough to distinguish symbolic
from breakend from inline.
Source code in varcode/sv_allele_parser.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | |
varcode.SV_TYPES¶
varcode.SV_TYPES = frozenset({'DEL', 'DUP', 'INV', 'INS', 'CNV', 'BND'})
module-attribute
¶
Genotypes¶
varcode.Genotype¶
varcode.Genotype(raw_gt: str, alleles: Tuple[Optional[int], ...], phased: bool = False, phase_set: Optional[int] = None, allele_depths: Optional[Tuple[int, ...]] = None, total_depth: Optional[int] = None, genotype_quality: Optional[int] = None)
dataclass
¶
Bases: DataclassSerializable
One sample's genotype at one variant locus.
The alleles tuple encodes the observed alleles using VCF GT
semantics: 0 is the reference allele, 1 is the first ALT
listed on the VCF row, 2 is the second, and so on. None
indicates a no-call on that haplotype.
For varcode's variant-level API, note that Variant.alt is a
specific alt (a multi-allelic VCF row is split into one Variant
per alt). When querying zygosity relative to a Variant, use the
variant's alt_allele_index from the collection's metadata and
add 1 to get the GT-encoded index, then call
:meth:zygosity_for_alt or :meth:carries_alt.
is_called: bool
property
¶
True if at least one allele is non-None.
ploidy: int
property
¶
Number of alleles in the call (including missing).
from_sample_info(sample_info)
classmethod
¶
Build a Genotype from pyvcf's call.data._asdict() output.
Handles the keys varcode normally sees: GT, AD, DP,
GQ, PS. Missing keys default to None.
Source code in varcode/genotype.py
carries_alt(alt_index: int) -> bool
¶
True if this sample's genotype contains the given alt.
alt_index uses VCF GT encoding: 1 is the first alt on
the row, 2 is the second, etc. (i.e. one more than
alt_allele_index from the VariantCollection metadata).
Source code in varcode/genotype.py
copies_of_alt(alt_index: int) -> int
¶
zygosity_for_alt(alt_index: int) -> Zygosity
¶
Classify the sample's zygosity relative to one alt allele.
Multi-allelic aware: GT=1/2 queried for alt 1 returns
HETEROZYGOUS (one copy of this alt, one of a different
alt); queried for alt 3 it returns ABSENT.
Source code in varcode/genotype.py
depth_for_alt(alt_index: int) -> Optional[int]
¶
Per-allele read depth for a given alt, from the AD field.
AD is indexed with ref at position 0 and alt #1 at
position 1, etc., so alt_index should use GT encoding
(1 = first alt).
Source code in varcode/genotype.py
varcode.Zygosity¶
varcode.Zygosity
¶
Bases: Enum
Zygosity of a sample's genotype relative to a specific alt allele.
ABSENT is distinct from MISSING: ABSENT means the call
exists but doesn't include the alt in question (e.g. the sample
is ref-ref, or carries a different alt at a multi-allelic
site). MISSING means the call itself is ./. or the sample
wasn't called.
Effects¶
varcode.MutationEffect¶
varcode.MutationEffect(variant)
¶
Bases: Serializable
Base class for mutation effects.
Source code in varcode/effects/effect_classes.py
short_description
property
¶
A short but human-readable description of the effect. Defaults to class name for most of the non-coding effects, but is more informative for coding ones.
original_protein_sequence
property
¶
Amino acid sequence of a coding transcript (without the nucleotide variant/mutation)
__lt__(other)
¶
Effects are ordered by their associated variants, which have comparison implement in terms of their chromosomal locations.
varcode.NonsilentCodingMutation¶
varcode.NonsilentCodingMutation(variant, transcript, aa_mutation_start_offset, aa_mutation_end_offset, aa_ref)
¶
Bases: CodingMutation
All coding mutations other than silent codon substitutions
variant : Variant
transcript : Transcript
aa_mutation_start_offset : int Offset of first modified amino acid in protein (starting from 0)
aa_mutation_end_offset : int Offset after last mutated amino acid (half-open coordinates)
aa_ref : str Amino acid string of what used to be at aa_mutation_start_offset in the wildtype (unmutated) protein.
Source code in varcode/effects/effect_classes.py
varcode.MultiOutcomeEffect¶
varcode.MultiOutcomeEffect(variant)
¶
Bases: MutationEffect
Marker base class for effects that represent a set of plausible outcomes rather than a single deterministic effect.
Subclasses must expose:
candidates— sequence of :class:MutationEffectinstances, sorted most-plausible-first. (Kept for back-compat with 2.x callers.)most_likely— the top candidate (i.e.candidates[0]).priority_class— effect class whose priority this set adopts (read by :func:varcode.effects.effect_priority).
Harmonized interface (#299): new code should read
:attr:outcomes instead of candidates. Each entry is an
:class:~varcode.outcomes.Outcome carrying the effect plus
provenance (probability, source, evidence dict). The default
implementation wraps candidates with source="varcode"
and no probability — external scorers (SpliceAI, Pangolin),
RNA-evidence callers (Isovar), and long-read assembly tools
override to attach their own scores without subclassing.
Downstream consumers filter for multi-outcome results with
isinstance(effect, MultiOutcomeEffect), so new wrappers (RNA
evidence #259, germline-aware #268, SV-at-breakpoint) implement
the same protocol without downstream code churn.
Source code in varcode/effects/effect_classes.py
outcomes
property
¶
Tuple of :class:~varcode.outcomes.Outcome objects,
most-plausible-first. Default implementation wraps
:attr:candidates under source="varcode"; subclasses
(or external integrations) override to attach probabilities
and evidence.
External integrations (RNA evidence, SpliceAI scoring, etc.)
attach extra outcomes via :meth:_with_extra_outcomes —
subclasses overriding this property must call that helper on
their derived tuple so the plug-in path remains uniform.
varcode.EffectCollection¶
varcode.EffectCollection(effects, distinct=False, sort_key=None, sources=set([]), annotator=None, annotator_version=None, annotated_at=None)
¶
Bases: Collection
Collection of MutationEffect objects and helpers for grouping or filtering them.
| PARAMETER | DESCRIPTION |
|---|---|
effects
|
Collection of any class which is compatible with the sort key
TYPE:
|
distinct
|
Only keep distinct entries or allow duplicates.
TYPE:
|
sort_key
|
Function which maps each element to a sorting criterion.
If None (the default), effects are sorted by priority with
the most severe effects first. Pass an explicit sort_key to
override this behaviour, or
TYPE:
|
sources
|
Set of files from which this collection was generated.
TYPE:
|
annotator
|
Name of the :class:
TYPE:
|
annotator_version
|
Version string of the annotator (typically the varcode
version for built-in annotators).
TYPE:
|
annotated_at
|
ISO-8601 UTC timestamp recording when the annotation ran.
Populated by :func:
TYPE:
|
Source code in varcode/effects/effect_collection.py
gene_counts()
¶
Returns number of elements overlapping each gene name. Expects the derived class (VariantCollection or EffectCollection) to have an implementation of groupby_gene_name.
Source code in varcode/effects/effect_collection.py
filter_by_transcript_expression(transcript_expression_dict, min_expression_value=0.0)
¶
Filters effects to those which have an associated transcript whose expression value in the transcript_expression_dict argument is greater than min_expression_value.
| PARAMETER | DESCRIPTION |
|---|---|
transcript_expression_dict
|
Dictionary mapping Ensembl transcript IDs to expression estimates (either FPKM or TPM)
TYPE:
|
min_expression_value
|
Threshold above which we'll keep an effect in the result collection
TYPE:
|
Source code in varcode/effects/effect_collection.py
filter_by_gene_expression(gene_expression_dict, min_expression_value=0.0)
¶
Filters effects to those which have an associated gene whose expression value in the gene_expression_dict argument is greater than min_expression_value.
| PARAMETER | DESCRIPTION |
|---|---|
gene_expression_dict
|
Dictionary mapping Ensembl gene IDs to expression estimates (either FPKM or TPM)
TYPE:
|
min_expression_value
|
Threshold above which we'll keep an effect in the result collection
TYPE:
|
Source code in varcode/effects/effect_collection.py
filter_by_effect_priority(min_priority_class)
¶
Create a new EffectCollection containing only effects whose priority falls below the given class.
Source code in varcode/effects/effect_collection.py
drop_silent_and_noncoding()
¶
Create a new EffectCollection containing only non-silent coding effects
detailed_string()
¶
Create a long string with all transcript effects for each mutation, grouped by gene (if a mutation affects multiple genes).
Source code in varcode/effects/effect_collection.py
top_priority_effect()
¶
Highest priority MutationEffect of all genes/transcripts overlapped by this variant. If this variant doesn't overlap anything, then this this method will return an Intergenic effect.
If multiple effects have the same priority, then return the one which is associated with the longest transcript.
Source code in varcode/effects/effect_collection.py
top_priority_effect_per_variant()
¶
Highest priority effect for each unique variant
Source code in varcode/effects/effect_collection.py
top_priority_effect_per_transcript_id()
¶
Highest priority effect for each unique transcript ID
Source code in varcode/effects/effect_collection.py
top_priority_effect_per_gene_id()
¶
Highest priority effect for each unique gene ID
Source code in varcode/effects/effect_collection.py
effect_expression(expression_levels)
¶
| PARAMETER | DESCRIPTION |
|---|---|
expression_levels
|
Dictionary mapping transcript IDs to length-normalized expression levels (either FPKM or TPM)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OrderedDict
|
Mapping from each transcript effect to an expression quantity. Effects that don't have an associated transcript (e.g. Intergenic) are excluded. |
Source code in varcode/effects/effect_collection.py
top_expression_effect(expression_levels)
¶
Return effect whose transcript has the highest expression level. If none of the effects are expressed or have associated transcripts, then return None. In case of ties, add lexicographical sorting by effect priority and transcript length.
Source code in varcode/effects/effect_collection.py
to_dataframe()
¶
Build a dataframe from the effect collection.
Source code in varcode/effects/effect_collection.py
to_csv(path, include_header=True)
¶
Write this collection to CSV.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Output path.
TYPE:
|
include_header
|
If True (default), prepend
TYPE:
|
Source code in varcode/effects/effect_collection.py
from_csv(path, genome=None)
classmethod
¶
Rebuild an EffectCollection from a CSV previously written by
EffectCollection.to_csv().
The current CSV format records (contig, start, ref, alt, transcript_id) but not enough per-effect state to reconstruct effects byte-for-byte. This method takes the pragmatic semantic round-trip path: rebuild each Variant, re-annotate against the recorded transcript, and emit the resulting effect. The resulting collection should match the original whenever annotation is deterministic for a given (variant, transcript) pair.
Prefer from_json for byte-for-byte round-trip or for
larger collections (≳10k effects); per-row re-annotation makes
CSV loading significantly slower than JSON. Emits a warning
when the CSV header reports a different major varcode version
than the one currently installed — annotation logic can change
across major versions and the reconstructed effects may differ
from the ones that were written.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to the CSV file. Lines starting with '#' are treated as
comments and parsed as
TYPE:
|
genome
|
Reference genome to associate with the loaded variants and
to look up transcripts by ID. If
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
EffectCollection
|
|
Source code in varcode/effects/effect_collection.py
428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 | |
varcode.Outcome¶
varcode.Outcome(effect: Any, probability: Optional[float] = None, source: str = 'varcode', evidence: Mapping[str, Any] = dict(), description: Optional[str] = None)
dataclass
¶
Bases: DataclassSerializable
One plausible consequence of a variant.
| PARAMETER | DESCRIPTION |
|---|---|
effect
|
The effect this outcome represents. Guaranteed to be a
:class:
TYPE:
|
probability
|
Estimated likelihood this outcome actually happens, in
TYPE:
|
source
|
Name of the tool or annotator that produced this outcome.
Defaults to
TYPE:
|
evidence
|
Open-ended provenance dict. Shape is source-specific; the
convention is that keys match the source's native field names
(e.g. SpliceAI scores under
TYPE:
|
description
|
Optional human-readable sentence describing this specific
outcome ("Exon 7 is skipped (in-frame, 15 aa removed)").
Distinct from
TYPE:
|
short_description: str
property
¶
Convenience passthrough to the wrapped effect's
short_description. Lets callers build tables of outcomes
without unpacking outcome.effect.short_description
everywhere.
Priority ordering¶
varcode.effect_priority(effect)
¶
Returns the integer priority for a given transcript effect.
Effects may opt out of class-based priority lookup by exposing a
priority_class attribute — used by wrapper classes like
:class:varcode.splice_outcomes.SpliceOutcomeSet to delegate to
the wrapped effect's class.
Source code in varcode/effects/effect_ordering.py
varcode.top_priority_effect(effects)
¶
Given a collection of variant transcript effects, return the top priority object. ExonicSpliceSite variants require special treatment since they actually represent two effects -- the splicing modification and whatever else would happen to the exonic sequence if nothing else gets changed. In cases where multiple transcripts give rise to multiple effects, use a variety of filtering and sorting heuristics to pick the canonical transcript.
Source code in varcode/effects/effect_ordering.py
Mutant transcripts¶
varcode.MutantTranscript¶
varcode.MutantTranscript(reference_transcript: Optional[object] = None, edits: Tuple[TranscriptEdit, ...] = tuple(), reference_segments: Optional[Tuple[ReferenceSegment, ...]] = None, cdna_sequence: Optional[str] = None, mutant_protein_sequence: Optional[str] = None, annotator_name: str = 'unknown', evidence: Optional[dict] = None)
dataclass
¶
Bases: DataclassSerializable
A reference transcript (or assembled set of reference segments) with zero or more variant-derived edits applied, optionally carrying the mutated cDNA and protein sequences.
Producers (the protein-diff annotator, RNA-evidence importers,
the splice-outcomes rewrite, germline-aware annotation,
structural-variant annotators) construct this once per
(transcript-or-segments, variant-set, context) and hand it to
downstream consumers. Each consumer reads the fields it cares
about — edits for provenance, cdna_sequence /
mutant_protein_sequence for protein-level analysis.
Two shapes:
-
Point-variant shape (
reference_transcriptis set,reference_segmentsisNone) — the mutant is derived from a single reference transcript by applying zero or more :class:TranscriptEditobjects. This is the shape used by the protein-diff annotator for SNVs, MNVs, and simple indels. -
Structural-variant shape (
reference_segmentsis set,reference_transcriptmay beNoneor the primary / 5'-partner transcript) — the mutant is assembled by concatenating :class:ReferenceSegmentslices in order. A gene fusion is two segments from two transcripts; a translocation to intergenic is one transcript segment plus a genomic-interval segment; an inversion is three forward/ reverse/forward segments.editsmay still be populated for point-variant edits layered on top of the assembled segments, but typically an SV carries no extra edits.
Sequence fields are Optional[str] because not every producer
computes them eagerly. Callers that require the protein check or
compute it themselves; the protein-diff annotator guarantees it
for point variants.
Forward-looking hooks (not implemented here; documented so new integrations know where to plug in):
- Personalized / full-genome reference — pass a
:class:
ReferenceSegmentwhosesourceis a patient-specific contig object. varcode's translation logic readssource.sequence; it doesn't care whether that's GRCh38 or a custom assembly. - Long-read resolution — when an SV has an :attr:
alt_assemblyon the :class:StructuralVariant, the SV annotator can wrap that sequence as a single synthetic segment. - SV outcomes ambiguity — a translocation producing many candidate
ORFs that only RNA can resolve should return
List[MutantTranscript]or wrap it in a :class:MultiOutcomeEffectper #299, each with its ownevidencedict capturing the disambiguator.
reference_transcript: Optional[object] = None
class-attribute
instance-attribute
¶
The :class:pyensembl.Transcript this mutant is derived from
(point-variant shape), or the primary / 5'-partner transcript
(SV shape). None when the SV has no canonical primary
transcript (e.g. intergenic-to-intergenic BNDs). Not typed
tightly here so :mod:pyensembl isn't a hard import dependency
for anyone who just wants the dataclass.
edits: Tuple[TranscriptEdit, ...] = field(default_factory=tuple)
class-attribute
instance-attribute
¶
Edits applied to produce this mutant, sorted by
:attr:TranscriptEdit.cdna_start. Empty tuple means the mutant
is identical to the reference (or, for SV shape, the assembled
segments carry the rearrangement directly without further
point-level edits).
reference_segments: Optional[Tuple[ReferenceSegment, ...]] = None
class-attribute
instance-attribute
¶
Ordered tuple of :class:ReferenceSegment objects that, when
concatenated in order (applying reverse-complement to -
strand segments), produce the mutant cDNA. None for the
point-variant shape; a fusion's segments would be
(5p_partner_segment, 3p_partner_segment). Coordinates are
in each segment's own reference system.
cdna_sequence: Optional[str] = None
class-attribute
instance-attribute
¶
The mutated spliced mRNA, when computed. None if the
producer hasn't materialized it yet.
mutant_protein_sequence: Optional[str] = None
class-attribute
instance-attribute
¶
The translated mutant protein, stopping at the first stop
codon. None if not yet translated, or if the edit set
doesn't produce a coherent ORF (e.g. start-codon loss). Callers
that need a guaranteed-present protein should use the
protein-diff annotator once it lands.
annotator_name: str = 'unknown'
class-attribute
instance-attribute
¶
Name of the :class:EffectAnnotator (or other producer) that
created this MutantTranscript. Used as provenance in
serialization and for A/B comparisons.
evidence: Optional[dict] = None
class-attribute
instance-attribute
¶
Optional producer-specific evidence (RNA read counts, Isovar fragment ids, SpliceAI scores, long-read assembly metadata). Shape is annotator-specific and not part of the stable contract; consumers that care about a particular evidence shape should type-check it at the call site.
is_identical_to_reference: bool
property
¶
True if no edits were applied AND there are no
reference-rearranging segments. Does NOT check
cdna_sequence / mutant_protein_sequence — a producer
can legitimately carry an identical sequence with zero edits
and a single identity segment.
is_structural: bool
property
¶
True when this mutant was assembled from
:attr:reference_segments (SV shape) rather than applying
:attr:edits to a single reference transcript.
total_length_delta: int
property
¶
Sum of :attr:TranscriptEdit.length_delta across all
edits — how much longer or shorter the mutant cDNA is than
the reference (point-variant shape). For SV shape, returns
0; the length of an assembled cDNA is the sum of segment
lengths, not a delta against a single reference.
varcode.apply_variant_to_transcript¶
varcode.apply_variant_to_transcript(variant, transcript)
¶
Construct a :class:MutantTranscript by applying variant
to transcript's spliced cDNA.
Returns a :class:MutantTranscript whose cdna_sequence is
populated, plus mutant_protein_sequence when the variant
lies after the start codon (so translation from the canonical
start is well-defined). The codon table is selected from the
transcript's contig — mitochondrial transcripts use NCBI table
2 automatically (see :func:varcode.effects.codon_tables.codon_table_for_transcript).
Returns None when the variant can't be cleanly applied:
- Transcript is not protein-coding or is incomplete.
- Variant doesn't overlap the transcript at all.
- Variant spans more than one exon (splice-junction-crossing variants need the splice-aware path; not handled here).
- Reference allele doesn't match the transcript's cDNA at the computed offset.
Callers that get None should fall back to the fast
:class:EffectAnnotator. The forthcoming protein-diff annotator
layers effect classification on top of this builder.
Source code in varcode/mutant_transcript.py
varcode.apply_variants_to_transcript¶
varcode.apply_variants_to_transcript(variants, transcript)
¶
Apply a list of variants to a single transcript, yielding one
:class:MutantTranscript that carries all the resulting edits
(#269). Used for haplotype-aware joint effect prediction — cis
variants on the same transcript become one combined mutant
rather than N independent per-variant mutants.
Edits are applied in cDNA-coordinate order (highest offset
first, so earlier offsets aren't shifted) to transcript's
spliced cDNA. Returns None when any of the usual
single-variant preconditions fail (non-coding, incomplete, etc.),
or when the provided variants conflict — i.e. claim to edit
overlapping cDNA ranges. The caller is responsible for falling
back to per-variant effects in that case.
mutant_protein_sequence is populated when at least one edit
lands after the canonical CDS start; the joint cDNA is translated
from there to the first stop.
Order of variants doesn't matter — edits are sorted by cDNA
offset internally.
Source code in varcode/mutant_transcript.py
Annotators¶
varcode.EffectAnnotator¶
varcode.EffectAnnotator
¶
Bases: Protocol
Protocol for an object that annotates variant effects on transcripts.
Conforming objects expose:
name— short identifier (e.g."fast") used in the registry and in serialized provenance.supports— set of variant-kind tags the annotator can handle (e.g.{"snv", "indel"}). Callers that hand the annotator a variant outside this set get a clear :class:UnsupportedVariantErrorrather than silently wrong output.- :meth:
annotate_on_transcript— the per-transcript entry point.
Optionally exposes version (string) — used in CSV provenance
headers so readers can detect when a serialized collection came
from a different annotator version. Built-in annotators track
varcode's version; third-party annotators expose their own.
The protocol is intentionally narrow at this stage — additional
methods (annotate_collection, annotate_with_context) will
be added as downstream work needs them. The contract is
duck-typed (@runtime_checkable) so third-party annotators
don't need to inherit from varcode just to register.
varcode.FastEffectAnnotator¶
varcode.FastEffectAnnotator
¶
Wraps :func:varcode.effects.predict_variant_effect_on_transcript.
version = _varcode_version
class-attribute
instance-attribute
¶
Built-in annotators track varcode's own version. Third-party annotators (isovar's plugin, exacto's plugin) expose their own version string here; CSV provenance headers and round-trip warnings read from this field. See #271.
supports = frozenset({'snv', 'indel', 'mnv'})
class-attribute
instance-attribute
¶
Variant kinds this annotator handles. Splice-possibility sets, structural variants, and phased haplotypes fall outside the fast offset-based path and will be handled by the protein-diff annotator.
annotate_on_transcript(variant, transcript)
¶
Delegate to the existing per-transcript prediction.
No fast-path / slow-path dispatch at this stage; that lives on the protein-diff annotator once it exists.
Source code in varcode/annotators/fast.py
varcode.ProteinDiffEffectAnnotator¶
varcode.ProteinDiffEffectAnnotator
¶
Classify effects by diffing translated mutant protein against the reference protein.
Produces byte-for-byte identical output to
:class:FastEffectAnnotator on the common case (trivial
SNVs and simple indels) because both flow through the same
:func:classify_from_protein_diff classifier. Diverges where
protein-diff's approach is provably more accurate (boundary
codons, frameshift realignment). Any divergence must appear in
the parity harness EXPECTED_DIFFS with an issue link.
annotate_on_transcript(variant, transcript)
¶
Classify the effect of variant on transcript.
Runs fast first to detect splice-adjacent variants (which
stay fast-classified); for everything else, builds a
:class:MutantTranscript and diffs the translated protein.
Source code in varcode/annotators/protein_diff.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 | |
varcode.StructuralVariantAnnotator¶
varcode.StructuralVariantAnnotator
¶
Classify :class:~varcode.StructuralVariant consequences on
a single transcript.
supports advertises the SV kinds the annotator handles. The
:class:~varcode.FastEffectAnnotator and
:class:~varcode.annotators.protein_diff.ProteinDiffEffectAnnotator
advertise {"snv", "indel", "mnv"}; this one advertises the
SV-type tokens used on :attr:StructuralVariant.sv_type.
annotate_on_transcript(variant, transcript)
¶
Classify variant on transcript. Returns a single
effect (typically a MultiOutcomeEffect subclass); consume
effect.outcomes for the full outcome set.
Source code in varcode/annotators/structural_variant.py
Registry¶
varcode.register_annotator(annotator)
¶
Add an annotator to the process-global registry, keyed by its
.name. Re-registering under the same name overrides the
previous entry — this is deliberate so callers can swap
implementations in tests.
Source code in varcode/annotators/registry.py
varcode.get_annotator(name)
¶
Look up a registered annotator by name. Raises KeyError
if no annotator is registered under that name.
varcode.get_default_annotator()
¶
Return the annotator currently configured as the default.
Current default is "protein_diff" (#322–#327 closed the last
known correctness bugs between the two). "fast" stays
available as an opt-in.
Source code in varcode/annotators/registry.py
varcode.set_default_annotator(name)
¶
Swap the process-wide default annotator. name must refer
to a registered annotator.
Source code in varcode/annotators/registry.py
varcode.use_annotator(name_or_instance)
¶
Context manager that temporarily swaps the default annotator.
Useful for A/B comparisons and scoped overrides without mutating global state across the codebase::
with varcode.use_annotator("protein_diff"):
effects = variant_collection.effects()
Accepts the same argument shape as the annotator= kwarg:
a registered-name string, or an annotator instance. Passing an
instance registers it temporarily under its .name so that
name-based lookups inside the block find it; on exit the
previous default and any previously-registered annotator under
that name are restored.
Source code in varcode/annotators/registry.py
Phasing¶
varcode.IsovarPhaseResolver¶
varcode.IsovarPhaseResolver(provider: IsovarAssemblyProvider)
¶
Phase resolver backed by an :class:IsovarAssemblyProvider
(#269, #259).
Two variants are cis if they appear on the same assembled contig. That's direct molecular evidence — not a probabilistic call.
Usage::
isovar_results = run_isovar(bam, vcf)
resolver = IsovarPhaseResolver(isovar_results)
effects = variants.effects(phase_resolver=resolver)
Any effect whose (variant, transcript) is covered by an
assembled contig gets its :attr:~MutationEffect.mutant_transcript
populated with the contig-derived :class:MutantTranscript —
the protein attached to the effect is the protein actually
observed in RNA, not one inferred from the reference.
Source code in varcode/phasing.py
has_contig(variant, transcript) -> bool
¶
mutant_transcript(variant, transcript)
¶
Return the assembled :class:MutantTranscript, or None
when this provider has no contig for (variant, transcript).
Source code in varcode/phasing.py
in_cis(v1, v2, transcript=None) -> Optional[bool]
¶
Return True if v1 and v2 appear on the same
Isovar contig, False if they're each on a different
contig (distinct physical molecules — trans), None when
neither variant has a contig (no evidence).
transcript is required because assemblies may be
isoform-specific; pass None only if the provider is
isoform-agnostic (the protocol allows this).
Source code in varcode/phasing.py
phased_partners(variant, transcript) -> Sequence
¶
Variants observed on the same contig as variant on
transcript — i.e. the cis set. Empty if no contig.
Source code in varcode/phasing.py
varcode.VCFPhaseResolver¶
varcode.VCFPhaseResolver(variant_collection, sample)
¶
Phase resolver backed by VCF GT + PS FORMAT fields.
Reads the phase data that varcode's VCF loader already parses
into :class:~varcode.Genotype (via #267): whether the
GT delimiter was | (phased) or / (unphased), the
PS phase-set identifier, and the per-haplotype allele indices
in :attr:Genotype.alleles.
Two variants are cis when they sit in the same phase set on
the same haplotype slot, trans when they sit in the same
phase set on different slots, and the resolver returns None
("no evidence") for variants that aren't both phased, don't share
a phase set, or lack called alleles.
Compatible with any tool that writes standard-shaped VCF:
WhatsHap, HapCUT2, DeepVariant, GATK HaplotypeCaller, long-read
callers (PEPPER-DeepVariant, Clair3), population phasers
(SHAPEIT5, Eagle2). varcode doesn't care which one wrote the
file — it only reads GT and PS.
Multi-allelic sites are handled: varcode splits those rows into
one :class:~varcode.Variant per ALT, each with an
alt_allele_index preserved on the
:class:~varcode.VariantCollection metadata. The resolver maps
each variant to its GT-encoded index and asks "which haplotype
slot carries this specific alt?".
Single-sample by construction. Phase is per-sample; multi-sample VCFs need one resolver per sample.
Currently supplies the cis/trans query but does not attach a
:class:~varcode.MutantTranscript — DNA phasing alone doesn't
produce an assembled contig. The natural next step is a
HaplotypeEffect / multi-variant apply_variants_to_transcript
helper that, when two or more cis variants overlap the same
transcript, builds a single joint :class:MutantTranscript
applying all edits at once. That's a separate PR — this
resolver already has the inputs it needs (in_cis) to drive
the grouping.
Source code in varcode/phasing.py
in_cis(v1, v2, transcript=None) -> Optional[bool]
¶
Return True if v1 and v2 are on the same
haplotype in the same phase set, False if they're on
different haplotypes in the same phase set, None when
the phase relationship can't be determined (unphased GT,
different phase sets, uncalled alleles).
transcript is accepted for interface symmetry with
:class:IsovarPhaseResolver.in_cis but isn't consulted —
DNA-level phase is isoform-agnostic.
Source code in varcode/phasing.py
phased_partners(variant, transcript=None)
¶
Variants in the collection that are cis with variant
under this resolver — i.e. sit in the same phase set on the
same haplotype slot. Empty when variant isn't phased or
has no called alt in the sample.
Source code in varcode/phasing.py
varcode.apply_phase_resolver_to_effects¶
varcode.apply_phase_resolver_to_effects(effects, phase_resolver)
¶
Post-process an :class:EffectCollection (or any iterable of
:class:MutationEffect) to attach contig-derived
:class:MutantTranscript objects when the resolver has evidence.
Mutates each effect in place by setting
effect.mutant_transcript. Effects whose transcript isn't
resolvable or whose (variant, transcript) has no contig are
left untouched — so this is safe to call on a mixed collection
where only some variants have RNA evidence.
Source code in varcode/phasing.py
RNA evidence¶
varcode.RNAEvidenceResolver¶
varcode.RNAEvidenceResolver
¶
Bases: Protocol
Source of RNA-observed outcomes for a (variant, transcript)
pair.
Implementers return zero or more :class:~varcode.outcomes.Outcome
objects describing isoforms, fusions, or RNA-level events that were
actually observed in reads. An empty sequence means "no evidence
for this pair" — the existing DNA-predicted outcomes are left
alone.
Returned outcomes should set source to a producer-specific
string ("isovar", "exacto", "longread_assembly", ...)
and populate evidence with whatever shape that producer
natively emits (transcript model IDs, junction read counts, etc.).
See :func:make_rna_outcome for a convenience factory that fills
the common fields.
observed_outcomes(variant, transcript) -> Sequence[Outcome]
¶
Return RNA-observed outcomes for variant on
transcript, or an empty sequence when no evidence is
available. Must not raise on unknown (variant, transcript)
pairs — return an empty sequence instead.
Source code in varcode/rna_evidence.py
varcode.NullRNAEvidenceResolver¶
varcode.NullRNAEvidenceResolver
¶
No-op resolver that always reports "no evidence".
Useful as a default in pipelines where an RNA resolver is optional
and as a baseline in tests. apply_rna_evidence_to_effects is
safe to call with this resolver — it's a no-op walk.
varcode.apply_rna_evidence_to_effects¶
varcode.apply_rna_evidence_to_effects(effects: Iterable, resolver) -> Iterable
¶
Attach RNA-observed outcomes from resolver to each effect
in place.
Walks effects and, for any effect with a resolvable
(variant, transcript), asks resolver.observed_outcomes
for any RNA-observed outcomes and stashes them on the effect's
_extra_outcomes slot. The :attr:MultiOutcomeEffect.outcomes
property (and its overrides) consult that slot via
:meth:MultiOutcomeEffect._with_extra_outcomes so callers see
DNA-predicted outcomes followed by RNA-observed ones.
Single-outcome effects (Missense, FrameShift, etc.) are left
untouched even when the resolver has evidence — those classes
don't expose an outcomes view, and replacing them with a
multi-outcome wrapper would break downstream isinstance checks.
Producers that need to surface RNA observations on point variants
should report them as a separate :class:MultiOutcomeEffect rather
than mutating an existing single-outcome one. (The point-variant
diff is generally already correct from DNA, so this is rarely an
issue in practice.)
Mirrors :func:varcode.phasing.apply_phase_resolver_to_effects:
in-place mutation, safe to call on a mixed collection where only
some variants have RNA evidence, no-op when resolver is None
or doesn't implement the protocol.
Returns effects for chaining convenience.
Source code in varcode/rna_evidence.py
varcode.make_rna_outcome¶
varcode.make_rna_outcome(effect, *, probability: Optional[float] = None, source: str = 'rna', transcript_model_id: Optional[str] = None, read_count: Optional[int] = None, description: Optional[str] = None, extra_evidence: Optional[Mapping[str, Any]] = None) -> Outcome
¶
Construct an :class:~varcode.outcomes.Outcome carrying
RNA-derived provenance.
Convenience factory for the common fields a reads-based or
long-read assembly tool wants on each observed outcome — keeps
consumers from hand-rolling the evidence dict shape and lets
downstream code rely on a small set of well-known keys.
| PARAMETER | DESCRIPTION |
|---|---|
effect
|
The effect this RNA-observed outcome represents.
TYPE:
|
probability
|
Estimated frequency of this isoform (e.g. expression-supported
fraction).
TYPE:
|
source
|
Producer name; defaults to
TYPE:
|
transcript_model_id
|
Stable ID of the observed transcript model from the producer.
Stored under
TYPE:
|
read_count
|
Supporting read count. Stored under
TYPE:
|
description
|
Human-readable label, passed through to
:attr:
TYPE:
|
extra_evidence
|
Producer-specific extra fields, merged into the evidence dict
on top of the well-known keys above. Allows tool-native fields
(e.g.
TYPE:
|
Source code in varcode/rna_evidence.py
Germline-aware annotation¶
varcode.GermlineContext¶
varcode.GermlineContext(variants: 'VariantCollection', completeness: Completeness = Completeness.COMPLETE, reference_name: Optional[str] = None, metadata: Mapping[str, Any] = dict())
dataclass
¶
The patient's germline, packaged with completeness metadata and reference-build info for cross-VCF validation.
Construct via the from_* classmethods rather than instantiating
directly; the constructors apply the input-shape-specific
validation each route needs.
| ATTRIBUTE | DESCRIPTION |
|---|---|
variants |
The germline variants as a :class:
TYPE:
|
completeness |
How to interpret absence-of-a-call (see :class:
TYPE:
|
reference_name |
The genome reference these variants were called against —
TYPE:
|
metadata |
Open-ended dict for caller-supplied annotations (source caller name, sample identifier, normalization tool, etc.). Not interpreted by varcode; rides along for downstream consumers and serialization.
TYPE:
|
Examples:
Route 1 — full germline call set::
ctx = GermlineContext.from_germline_vcf("normal.vcf")
Route 2 — multi-sample VCF, extract a column. The user must declare completeness explicitly because absence-from-a-multi- sample column rarely means ref/ref::
ctx = GermlineContext.from_multi_sample_vcf(
"merged.vcf", sample="NORMAL", completeness=Completeness.SPARSE)
Direct construction (tests, custom pipelines)::
ctx = GermlineContext.from_variants(
germline_variants, completeness=Completeness.COMPLETE,
reference_name="GRCh38")
Explicit empty context — opt-in to reference-relative fallback::
ctx = GermlineContext.empty()
from_germline_vcf(path: str, *, completeness: Completeness = Completeness.COMPLETE, metadata: Optional[Mapping[str, Any]] = None, **load_vcf_kwargs) -> 'GermlineContext'
classmethod
¶
Load a full germline VCF into a context.
load_vcf_kwargs are passed through to
:func:varcode.load_vcf — for example genome= or
only_passing=False. The returned context defaults to
Completeness.COMPLETE; pass completeness= only if the
VCF is something other than a real germline call set.
Source code in varcode/germline.py
from_multi_sample_vcf(path: str, sample: str, *, completeness: Completeness, metadata: Optional[Mapping[str, Any]] = None, **load_vcf_kwargs) -> 'GermlineContext'
classmethod
¶
Load a multi-sample VCF and extract one sample's calls as the germline.
completeness is required (no default) — multi-sample VCFs
from somatic callers (Mutect2's NORMAL column, e.g.) are
almost always sparse, but pure-germline multi-sample VCFs
(1000G, gnomAD batch genotyping) are complete. Forcing the
caller to declare prevents subtle correctness bugs from
treating a sparse column as if absence implied ref/ref.
The sample is filtered post-load. If you need per-sample
zygosity information, pass include_info=True (the default)
and consult vc.metadata[variant]["sample_info"][sample]
downstream.
Source code in varcode/germline.py
from_variants(variants, *, completeness: Completeness = Completeness.COMPLETE, reference_name: Optional[str] = None, metadata: Optional[Mapping[str, Any]] = None) -> 'GermlineContext'
classmethod
¶
Construct from an already-built :class:VariantCollection
or any iterable of :class:Variant objects.
Useful for tests, hand-built pipelines, and downstream tools
that already have variants in memory and don't need to re-parse
a VCF. reference_name should be passed explicitly when not
carried by the variants themselves; otherwise cross-VCF
validation will be a no-op.
Source code in varcode/germline.py
empty() -> 'GermlineContext'
classmethod
¶
Explicit no-germline context. Use this in pipelines where
germline= is structurally required but the caller has no
germline data — it documents intent better than passing
None, and downstream code can rely on the kwarg always
being a :class:GermlineContext.
Effect prediction with an empty context falls through to reference-relative annotation (no patient transcript construction), with no warnings — the caller has explicitly opted in to the fallback.
Source code in varcode/germline.py
__bool__() -> bool
¶
Truthy when there's something to apply. EMPTY contexts
are falsy so if germline_context: reads idiomatically.
validate_against(somatic, *, validate_reference: bool = True) -> None
¶
Cross-validate this context with a somatic
:class:VariantCollection. Hard error on reference-build
mismatch unless validate_reference=False; warn on
suspicious shapes (empty germline, sparse coverage with no
overlap with somatic, etc.).
Called automatically by :meth:Variant.effects /
:meth:VariantCollection.effects when a context is supplied;
callers running validation manually can do so up front to fail
fast before annotation.
Source code in varcode/germline.py
variants_in_window(contig: str, start: int, end: int) -> Tuple
¶
Germline variants overlapping [start, end] on
contig (inclusive on both ends).
Used by the window-based lookup machinery (slice 2 of #268). Lazy interval index is built on first call and cached on the instance — subsequent calls are O(log N) per contig.
Returns a tuple (immutable) so callers can safely cache the result without worrying about the underlying index mutating.
Source code in varcode/germline.py
varcode.Completeness¶
varcode.Completeness
¶
Bases: Enum
How exhaustive the germline call set is — the load-bearing flag that pins what absence of a call at a position means.
The same data structure ("a list of germline variants") can come from very different pipelines, and downstream effect prediction cannot make the right call without knowing which:
- If a position is absent from a real germline VCF emitted by a germline caller that examined the entire normal BAM, the patient is ref/ref there. Effect prediction proceeds reference-relative at that codon.
- If a position is absent from the
NORMALcolumn of a somatic-caller VCF, it likely means the somatic caller didn't emit a row — not that the position is ref/ref. The patient's germline state at that codon is unknown. The honest output is a possibility set including "unknown germline." - If a position is absent from a panel-of-normals filter list, it definitely doesn't imply ref/ref — the file only lists curated hotspots.
Mis-treating "absent" as "ref/ref" silently produces wrong germline-aware effects on somatic variants in long stretches of the genome the somatic caller never touched. The flag exists so that mistake fails loud (or at least produces an honest possibility set) instead of silently corrupting clinical annotation.
Values
+-------------------+-----------------------------------+--------------------------+
| Value | Typical pipeline of origin | Absence at a position |
+===================+===================================+==========================+
| :attr:COMPLETE | Germline caller (DeepVariant, | ⇒ ref/ref |
| | HaplotypeCaller, Strelka2 | |
| | germline) on the normal BAM | |
+-------------------+-----------------------------------+--------------------------+
| :attr:SPARSE | NORMAL column of a somatic | ⇒ unknown (probably |
| | tumor-vs-normal VCF (Mutect2, | ref/ref but not |
| | Strelka2 somatic, VarScan2 | queried). Honest output|
| | somatic) | is a possibility set. |
+-------------------+-----------------------------------+--------------------------+
| :attr:HOTSPOTS_ | Panel-of-normals filter list, | ⇒ definitely unknown. |
| ONLY | ClinVar pathogenic list, single- | Strictly weaker |
| | hotspot allowlists | evidence than SPARSE. |
+-------------------+-----------------------------------+--------------------------+
| :attr:EMPTY | "I have no germline data" — | n/a (no germline-aware |
| | explicit fallback, used so users | logic runs; equivalent to|
| | opt into reference-relative | not passing germline= at |
| | annotation rather than getting it | all) |
| | by accident from a missing kwarg | |
+-------------------+-----------------------------------+--------------------------+
What downstream slices do with this
Slice 3 of #268 wires germline= through annotator dispatch.
When a somatic variant lands in a transcript window that has no
germline calls, the annotator reads this flag to decide between:
COMPLETE→ patient is ref/ref in this window; emit a single reference-relative effect.SPARSE/HOTSPOTS_ONLY→ patient's germline is unknown in this window; emit a possibility set including the reference-relative effect plus "germline-unknown" outcomes so the user sees the uncertainty.EMPTY→ no germline-aware logic; reference-relative.
Constructors and defaults
:meth:GermlineContext.from_germline_vcf defaults to
COMPLETE because that's almost always what
a real germline VCF is.
:meth:GermlineContext.from_multi_sample_vcf requires the
caller to declare completeness explicitly (no default) — a
multi-sample VCF could be either, and silently defaulting
either direction is a correctness bug waiting to happen.
:meth:GermlineContext.empty always sets EMPTY.
varcode.predict_germline_aware_effect¶
varcode.predict_germline_aware_effect(somatic_variant, transcript, germline_ctx: GermlineContext, annotator, phase_resolver=None, window_fn=default_germline_window, max_hypotheses: int = 8)
¶
Predict the effect of somatic_variant on transcript
against the patient's germline-applied transcript.
Single entry point for germline-aware effect prediction.
:func:varcode.effects.predict_variant_effects calls this whenever
a non-empty :class:GermlineContext is supplied; otherwise it
bypasses the germline path entirely and the existing annotator
dispatch produces today's reference-relative output unchanged.
Behaviour by case:
- No germline in the somatic's window — patient transcript ≡
reference transcript at this locus; delegate to
annotatordirectly. SPARSE / HOTSPOTS_ONLY contexts mark the result witheffect.germline_unknown = Trueso consumers can see the uncertainty. - Germline in window, phase known (resolver answers, or
hemizygous, or all-cis-by-zygosity) — single patient haplotype;
classify against it via :func:
_classify_against_patient_baseline. - Germline in window, phase unknown — enumerate hypotheses
(capped via
max_hypotheses), classify each, wrap in :class:~varcode.effects.effect_classes.PhaseAmbiguousEffect.
LOH (somatic matches germline at position+alt with het zygosity)
sets effect.is_loh = True regardless of which branch ran.
window_fn is the pluggable window selector — defaults to
:func:default_germline_window (codon-level, with splice-signal
expansion when the somatic is splice-adjacent). Callers that
need different windows pass their own.
Source code in varcode/germline.py
948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 | |
varcode.apply_germline_to_transcript¶
varcode.apply_germline_to_transcript(transcript, germline_ctx, somatic_variant=None)
¶
Apply germline variants from germline_ctx to transcript,
returning the patient's baseline :class:MutantTranscript.
Lower-level entry point for callers that want the patient
transcript directly without going through full effect prediction.
Used internally by :func:predict_germline_aware_effect; exposed
publicly for downstream tools (Isovar, Exacto) that want to
compute a custom analysis on the patient haplotype.
Behaviour:
- If
germline_ctxis empty, returnsNone. - If
somatic_variantis provided, restricts germline to the somatic's window (per :func:default_germline_window); else applies all germline variants overlapping any exon of the transcript. - If germline edits conflict (overlapping cDNA ranges) or land
outside the CDS, returns
Noneand the caller falls back.
The returned object is the same shape that
:func:varcode.mutant_transcript.apply_variants_to_transcript
produces: a :class:MutantTranscript carrying the germline
edits with mutant_protein_sequence populated when the edits
land after the CDS start.
Source code in varcode/germline.py
varcode.enumerate_phase_hypotheses¶
varcode.enumerate_phase_hypotheses(somatic_variant, germline_in_window, phase_resolver=None, max_hypotheses: int = 8) -> Tuple[PhaseHypothesis, ...]
¶
Enumerate plausible phase configurations of somatic_variant
relative to germline_in_window.
Three regimes:
- Hemizygous chromosome (chrX/Y/M, male X) — single haplotype; all germline-in-window is implicitly cis. One hypothesis.
- Resolver answers for every pair (
phase_resolver.in_cisreturns True/False for each(somatic, germline_v)) — a single deterministic hypothesis with cis/trans assigned per the resolver.phase_state="phased". - Phase unknown — enumerate all 2^n cis/trans assignments
across n germline variants. Cap at
max_hypotheses; emit a single"unknown"placeholder when the cap is exceeded (consumers see aTooManyHypothesesevidence flag).
The cap is configurable so downstream pipelines that tolerate more hypotheses (long-read with rich phasing, manual analyses) can raise it. Default 8 = up to 3 germline variants in a window fully unphased.
Source code in varcode/germline.py
743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 | |
varcode.detect_loh¶
varcode.detect_loh(somatic_variant, germline_in_window) -> bool
¶
True when somatic_variant is identical at (position, alt)
to a germline variant in the window.
LOH is the most common "looks somatic but isn't really" case —
the patient was germline het at this position, and the tumor lost
the reference allele, so the variant call says "alt" in tumor and
"het" in normal but the alt itself is the germline allele. We
flag the resulting effect with is_loh=True so consumers can
distinguish a true somatic mutation from a zygosity change.
Only same-position-and-alt counts. A position where germline and somatic disagree on alt is a different mutation, not LOH.
Source code in varcode/germline.py
varcode.default_germline_window¶
varcode.default_germline_window(somatic_variant, transcript) -> Tuple[str, int, int]
¶
Default window for looking up germline variants relevant to a somatic variant on a transcript.
Returns (contig, start, end) covering the codon containing the
somatic variant — three reference bases on each side of
somatic_variant.start. This is the window from #268's table
for in-exon coding variants.
Larger windows (splice signal region for splice-adjacent variants,
same exon for frameshift candidates) are useful refinements but
don't change the API. Callers that need them pass a custom
window_fn to :func:predict_germline_aware_effect.
Splice-adjacent: when the somatic is within 6bp of an exon-intron boundary, expand to a 12bp window centered on the boundary so germline edits to the donor / acceptor signal show up in the lookup. This catches the "germline broke the splice site" case without forcing the caller to wire up a separate window function.
Source code in varcode/germline.py
VariantCollection transforms¶
varcode.transforms.pair_breakends¶
varcode.transforms.pair_breakends(vc)
¶
Merge MATEID-paired BND rows into a single combined
StructuralVariant per rearrangement event. (reduces)
For each pair of :class:~varcode.StructuralVariant rows where
row A's MATEID references row B's VCF ID (and vice versa),
emit one combined row carrying both endpoints. The combined
variant's source_variants attribute holds the two originals.
Non-BND variants, single-row TRA, and single-ended BNDs (no
MATEID) pass through unchanged with source_variants=().
Pairing rules:
- Primary key:
MATEIDfield on each variant'sinfodict against the VCF row ID stored in the collection's source metadata. - Alias:
PARID(used by older GRIDSS) is treated asMATEID. - Symmetric: row A's
MATEIDmust equal row B's ID and row B'sMATEIDmust equal row A's ID. Asymmetric references are warned and left unpaired. - If a
MATEIDpoints to an ID not present in this collection (filtered out, chunked load), a warning is emitted and the variant passes through unpaired. - If three or more rows share a
MATEIDgroup, the whole group is left unpaired with a warning (pairing is ambiguous). - Already-paired input (
source_variantsnon-empty) passes through unchanged — :func:pair_breakendsis idempotent.
Metadata merge for paired rows:
- Genotype: both halves must agree on the per-sample
GT. The combined variant inherits the shared genotype. Disagreement raises :class:ValueError. alt_assembly: if exactly one half carries an assembled sequence, the combined variant inherits it. If both differ, A's wins (deterministic via lex source-ID ordering) with a warning.filter: union of both halves' FILTER tokens;PASSis dropped if any non-PASS label is present (stricter wins).qual: minimum of the two halves' quality scores.- Other INFO fields: prefer A's; the other half is reachable
via
combined.source_variants.
| PARAMETER | DESCRIPTION |
|---|---|
vc
|
Input collection. May contain a mix of structural and non-structural variants.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
VariantCollection
|
A new collection. SV rows that were halves of paired BNDs are replaced by combined rows; everything else (including SNVs/indels/MNVs and unpaired SVs) passes through. |
Source code in varcode/transforms/__init__.py
299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 | |
varcode.transforms.left_align_indels¶
varcode.transforms.left_align_indels(vc)
¶
Shift indels to their canonical leftmost equivalent position. (preserves)
Indels in homopolymer or short-tandem-repeat regions can be
represented at any of several equivalent positions — CTT->T
inside a CT-repeat means the same biological event as
CT->_ two positions to the left. Tools that compare variants
by (contig, start, ref, alt) see those representations as
distinct calls. Left-alignment normalizes to a single canonical
representation per indel: the leftmost equivalent position.
The algorithm is the standard variant-normalization left-shift
used by bcftools norm and GATK LeftAlignAndTrimVariants,
applied as an opt-in VariantCollection -> VariantCollection
transform rather than baked into VCF load.
Reference sequence is read via the genome the variants carry —
no explicit reference parameter. Coverage tiers (see
:mod:varcode.genome_sequence):
- Chromosome FASTA attached (via :class:
varcode.Genome'sfastaslot): indels everywhere shift to canonical positions, including in introns and intergenic regions. - No FASTA (default pyensembl install): indels fully within
an exon shift via the transcript cDNA fallback. Intronic and
intergenic indels pass through unchanged. Indels that start
exonic but would shift across an exon boundary stop at the
boundary and carry
info["left_align_partial"] = True.
| PARAMETER | DESCRIPTION |
|---|---|
vc
|
Input collection. May contain a mix of SNVs, MNVs, indels, complex variants, and SVs — only pure indels (length-different REF/ALT with one side empty after suffix trimming) are considered for shifting. Everything else passes through.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
VariantCollection
|
A new collection with indels at their canonical leftmost
positions. Variants that shifted carry
|
See
|
TYPE:
|
six (location × FASTA-attached) combinations and the metadata
|
|
fields the transform writes.
|
|
Examples:
Source code in varcode/transforms/__init__.py
537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 | |
File loading¶
varcode.load_vcf¶
varcode.vcf.load_vcf(path, genome=None, reference_vcf_key='reference', only_passing=True, allow_extended_nucleotides=False, include_info=True, chunk_size=10 ** 5, max_variants=None, sort_key=variant_ascending_position_sort_key, distinct=True, normalize_contig_names=True, convert_ucsc_contig_names=True, parse_structural_variants=False, genome_fasta=None)
¶
Load reference name and Variant objects from the given VCF filename.
Local files are parsed directly. HTTP/HTTPS URLs are downloaded to a
temporary file and load_vcf recurses on the local copy; pandas
doesn't reliably stream gzipped HTTP responses, so we materialize first.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to VCF (.vcf) or compressed VCF (.vcf.gz).
TYPE:
|
genome
|
Optionally pass in a PyEnsembl Genome object, name of reference, or PyEnsembl release version to specify the reference associated with a VCF (otherwise infer reference from VCF using reference_vcf_key)
TYPE:
|
reference_vcf_key
|
Name of metadata field which contains path to reference FASTA file (default = 'reference')
TYPE:
|
only_passing
|
If true, any entries whose FILTER field is not one of "." or "PASS" is dropped.
TYPE:
|
allow_extended_nucleotides
|
Allow characters other that A,C,T,G in the ref and alt strings.
TYPE:
|
include_info
|
Whether to parse the INFO and per-sample columns. If you don't need these, set to False for faster parsing.
TYPE:
|
chunk_size
|
Number of records to load in memory at once.
DEFAULT:
|
max_variants
|
If specified, return only the first max_variants variants.
TYPE:
|
sort_key
|
Function which maps each element to a sorting criterion. Set to None to not to sort the variants.
TYPE:
|
distinct
|
Don't keep repeated variants
TYPE:
|
normalize_contig_names
|
By default contig names will be normalized by converting integers to strings (e.g. 1 -> "1"), and converting any letters after "chr" to uppercase (e.g. "chrx" -> "chrX"). If you don't want this behavior then pass normalize_contig_names=False.
TYPE:
|
convert_ucsc_contig_names
|
Convert chromosome names from hg19 (e.g. "chr1") to equivalent names for GRCh37 (e.g. "1"). By default this is set to True. If None, it also evaluates to True if the genome of the VCF is a UCSC reference.
TYPE:
|
genome_fasta
|
Optionally attach a chromosome FASTA to the resolved genome
before parsing. Equivalent to wrapping with
TYPE:
|
Source code in varcode/vcf.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | |
varcode.load_maf¶
varcode.load_maf(path, optional_cols=[], sort_key=variant_ascending_position_sort_key, distinct=True, raise_on_error=True, encoding=None, nrows=None)
¶
Load reference name and Variant objects from MAF filename.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to MAF (*.maf).
TYPE:
|
optional_cols
|
A list of MAF columns to include as metadata if they are present in the MAF. Does not result in an error if those columns are not present.
TYPE:
|
sort_key
|
Function which maps each element to a sorting criterion. Set to None to not to sort the variants.
TYPE:
|
distinct
|
Don't keep repeated variants
TYPE:
|
raise_on_error
|
Raise an exception upon encountering an error or just log a warning.
TYPE:
|
encoding
|
Encoding to use for UTF when reading MAF file.
TYPE:
|
nrows
|
Limit to number of rows loaded
TYPE:
|
Source code in varcode/maf.py
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | |
varcode.load_maf_dataframe(path, nrows=None, raise_on_error=True, encoding=None)
¶
Load the guaranteed columns of a TCGA MAF file into a DataFrame
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to MAF file
TYPE:
|
nrows
|
Optional limit to number of rows loaded
TYPE:
|
raise_on_error
|
Raise an exception upon encountering an error or log an error
TYPE:
|
encoding
|
Encoding to use for UTF when reading MAF file.
TYPE:
|
Source code in varcode/maf.py
Exceptions¶
varcode.ReferenceMismatchError¶
varcode.ReferenceMismatchError(variant, transcript, expected_ref, observed_ref, transcript_offset=None, genome_start=None, genome_end=None)
¶
Bases: ValueError
Raised when a variant's reported ref allele does not match the reference genome at the variant's position.
This most often means one of:
- The variant was called against a different reference build than the one being used for annotation (e.g. GRCh37 vs GRCh38).
- The variant's ref field was populated with the patient's germline allele rather than the canonical reference. VCF requires the ref field to match the reference genome; germline variants at the same position should be encoded as separate variants.
- Strand confusion: the variant is specified on the negative strand but varcode expects positive-strand coordinates.
Callers who would rather continue past this error can pass
raise_on_error=False to :meth:Variant.effects to receive
Failure effects instead.
Source code in varcode/errors.py
varcode.SampleNotFoundError¶
varcode.SampleNotFoundError
¶
Bases: KeyError
Raised when genotype info is requested for a sample that isn't present in the VariantCollection's source VCF(s).
varcode.GenomeBuildMismatchError¶
varcode.GenomeBuildMismatchError(somatic_reference, germline_reference)
¶
Bases: ValueError
Raised when a germline VCF and a somatic VCF were called against different reference genome builds (e.g. GRCh37 vs GRCh38). Effect coordinates from the two VCFs cannot be meaningfully composed.
Subclasses :class:ValueError so callers that already catch
ValueError for ReferenceMismatchError continue to work.
Set validate_reference=False on the call site if the user
has explicitly lifted over one VCF into the other build and
knows what they're doing.
Source code in varcode/germline.py
varcode.UnsupportedVariantError¶
varcode.UnsupportedVariantError
¶
Bases: ValueError
Raised when an :class:EffectAnnotator is asked to handle a
variant kind outside its declared supports set.
Prefer this over silent mis-annotation — the whole point of the pluggable-annotator design is that callers can see exactly which annotator handles which variant kinds.