API Documentation¶

Class I MHC ligand prediction package

class mhcflurry.Class1AffinityPredictor(allele_to_allele_specific_models=None, class1_pan_allele_models=None, allele_to_sequence=None, manifest_df=None, allele_to_percent_rank_transform=None, metadata_dataframes=None, provenance_string=None)[source]¶

Bases: object

High-level interface for peptide/MHC I binding affinity prediction.

This class manages low-level Class1NeuralNetwork instances, each of which wraps a single Keras network. The purpose of Class1AffinityPredictor is to implement ensembles, handling of multiple alleles, and predictor loading and saving. It also provides a place to keep track of metadata like prediction histograms for percentile rank calibration.

Parameters

allele_to_allele_specific_modelsdict of string -> list of Class1NeuralNetwork: Ensemble of single-allele models to use for each allele.
class1_pan_allele_modelslist of Class1NeuralNetwork: Ensemble of pan-allele models.
allele_to_sequencedict of string -> string: MHC allele name to fixed-length amino acid sequence (sometimes referred to as the pseudosequence). Required only if class1_pan_allele_models is specified.
manifest_dfpandas.DataFrame, optional: Must have columns: model_name, allele, config_json, model. Only required if you want to update an existing serialization of a Class1AffinityPredictor. Otherwise this dataframe will be generated automatically based on the supplied models.
allele_to_percent_rank_transformdict of string -> PercentRankTransform, optional: PercentRankTransform instances to use for each allele
metadata_dataframesdict of string -> pandas.DataFrame, optional: Optional additional dataframes to write to the models dir when save() is called. Useful for tracking provenance.
provenance_stringstring, optional: Optional info string to use in __str__.

property manifest_df¶

A pandas.DataFrame describing the models included in this predictor.

Based on: - self.class1_pan_allele_models - self.allele_to_allele_specific_models

Returns

pandas.DataFrame

clear_cache()[source]¶

Clear values cached based on the neural networks in this predictor.

Users should call this after mutating any of the following:

self.class1_pan_allele_models
self.allele_to_allele_specific_models
self.allele_to_sequence

Methods that mutate these instance variables will call this method on their own if needed.

property neural_networks¶

List of the neural networks in the ensemble.

Returns

list of Class1NeuralNetwork

classmethod merge(predictors)[source]¶

Merge the ensembles of two or more Class1AffinityPredictor instances.

Note: the resulting merged predictor will NOT have calibrated percentile ranks. Call calibrate_percentile_ranks on it if these are needed.

Parameters

predictorssequence of Class1AffinityPredictor

Returns

Class1AffinityPredictor instance

merge_in_place(others)[source]¶

Add the models present in other predictors into the current predictor.

Parameters

otherslist of Class1AffinityPredictor: Other predictors to merge into the current predictor.

Returns

list of stringnames of newly added models

property supported_alleles¶

Alleles for which predictions can be made.

Returns

list of string

property supported_peptide_lengths¶

(minimum, maximum) lengths of peptides supported by all models, inclusive.

Returns

(int, int) tuple

check_consistency()[source]¶

Verify that self.manifest_df is consistent with: - self.class1_pan_allele_models - self.allele_to_allele_specific_models

Currently only checks for agreement on the total number of models.

Throws AssertionError if inconsistent.

save(models_dir, model_names_to_write=None, write_metadata=True)[source]¶

Serialize the predictor to a directory on disk. If the directory does not exist it will be created.

The serialization format consists of a file called “manifest.csv” with the configurations of each Class1NeuralNetwork, along with per-network files giving the model weights. If there are pan-allele predictors in the ensemble, the allele sequences are also stored in the directory. There is also a small file “index.txt” with basic metadata: when the models were trained, by whom, on what host.

Parameters

models_dirstring: Path to directory. It will be created if it doesn’t exist.
model_names_to_writelist of string, optional: Only write the weights for the specified models. Useful for incremental updates during training.
write_metadataboolean, optional: Whether to write optional metadata

static load(models_dir=None, max_models=None, optimization_level=None)[source]¶

Deserialize a predictor from a directory on disk.

Parameters

models_dirstring: Path to directory. If unspecified the default downloaded models are used.
max_modelsint, optional: Maximum number of Class1NeuralNetwork instances to load
optimization_levelint: If >0, model optimization will be attempted. Defaults to value of environment variable MHCFLURRY_OPTIMIZATION_LEVEL.

Returns

Class1AffinityPredictor instance

optimize(warn=True)[source]¶

EXPERIMENTAL: Optimize the predictor for faster predictions.

Currently the only optimization implemented is to merge multiple pan- allele predictors at the tensorflow level.

The optimization is performed in-place, mutating the instance.

Returns

bool: Whether optimization was performed

static model_name(allele, num)[source]¶

Generate a model name

Parameters

allelestring
numint

Returns

string

static weights_path(models_dir, model_name)[source]¶

Generate the path to the weights file for a model

Parameters

models_dirstring
model_namestring

Returns

string

property master_allele_encoding¶

An AlleleEncoding containing the universe of alleles specified by self.allele_to_sequence.

Returns

AlleleEncoding

fit_allele_specific_predictors(n_models, architecture_hyperparameters_list, allele, peptides, affinities, inequalities=None, train_rounds=None, models_dir_for_save=None, verbose=0, progress_preamble='', progress_print_interval=5.0)[source]¶

Fit one or more allele specific predictors for a single allele using one or more neural network architectures.

The new predictors are saved in the Class1AffinityPredictor instance and will be used on subsequent calls to predict.

Parameters

n_modelsint: Number of neural networks to fit
architecture_hyperparameters_listlist of dict: List of hyperparameter sets.
allelestring
peptidesEncodableSequences or list of string
affinitieslist of float: nM affinities
inequalitieslist of string, each element one of “>”, “<”, or “=”: See Class1NeuralNetwork.fit for details.
train_roundssequence of int: Each training point i will be used on training rounds r for which train_rounds[i] > r, r >= 0.
models_dir_for_savestring, optional: If specified, the Class1AffinityPredictor is (incrementally) written to the given models dir after each neural network is fit.
verboseint: Keras verbosity
progress_preamblestring: Optional string of information to include in each progress update
progress_print_intervalfloat: How often (in seconds) to print progress. Set to None to disable.

Returns

list of Class1NeuralNetwork

fit_class1_pan_allele_models(n_models, architecture_hyperparameters, alleles, peptides, affinities, inequalities, models_dir_for_save=None, verbose=1, progress_preamble='', progress_print_interval=5.0)[source]¶

Fit one or more pan-allele predictors using a single neural network architecture.

The new predictors are saved in the Class1AffinityPredictor instance and will be used on subsequent calls to predict.

Parameters

n_modelsint: Number of neural networks to fit
architecture_hyperparametersdict
alleleslist of string: Allele names (not sequences) corresponding to each peptide
peptidesEncodableSequences or list of string
affinitieslist of float: nM affinities
inequalitieslist of string, each element one of “>”, “<”, or “=”: See Class1NeuralNetwork.fit for details.
models_dir_for_savestring, optional: If specified, the Class1AffinityPredictor is (incrementally) written to the given models dir after each neural network is fit.
verboseint: Keras verbosity
progress_preamblestring: Optional string of information to include in each progress update
progress_print_intervalfloat: How often (in seconds) to print progress. Set to None to disable.

Returns

list of Class1NeuralNetwork

add_pan_allele_model(model, models_dir_for_save=None)[source]¶

Add a pan-allele model to the ensemble and optionally do an incremental save.

Parameters

modelClass1NeuralNetwork
models_dir_for_savestring: Directory to save resulting ensemble to

percentile_ranks(affinities, allele=None, alleles=None, throw=True)[source]¶

Return percentile ranks for the given ic50 affinities and alleles.

The ‘allele’ and ‘alleles’ argument are as in the predict method. Specify one of these.

Parameters

affinitiessequence of float: nM affinities
allelestring
allelessequence of string
throwboolean: If True, a ValueError will be raised in the case of unsupported alleles. If False, a warning will be logged and NaN will be returned for those percentile ranks.

Returns

numpy.array of float

predict(peptides, alleles=None, allele=None, throw=True, centrality_measure='mean', model_kwargs={})[source]¶

Predict nM binding affinities.

If multiple predictors are available for an allele, the predictions are the geometric means of the individual model (nM) predictions.

One of ‘allele’ or ‘alleles’ must be specified. If ‘allele’ is specified all predictions will be for the given allele. If ‘alleles’ is specified it must be the same length as ‘peptides’ and give the allele corresponding to each peptide.

Parameters

peptidesEncodableSequences or list of string
alleleslist of string
allelestring
throwboolean: If True, a ValueError will be raised in the case of unsupported alleles or peptide lengths. If False, a warning will be logged and the predictions for the unsupported alleles or peptides will be NaN.
centrality_measurestring or callable: Measure of central tendency to use to combine predictions in the ensemble. Options include: mean, median, robust_mean.
model_kwargsdict: Additional keyword arguments to pass to Class1NeuralNetwork.predict

Returns

numpy.array of predictions

predict_to_dataframe(peptides, alleles=None, allele=None, throw=True, include_individual_model_predictions=False, include_percentile_ranks=True, include_confidence_intervals=True, centrality_measure='mean', model_kwargs={})[source]¶

Predict nM binding affinities. Gives more detailed output than predict method, including 5-95% prediction intervals.

If multiple predictors are available for an allele, the predictions are the geometric means of the individual model predictions.

One of ‘allele’ or ‘alleles’ must be specified. If ‘allele’ is specified all predictions will be for the given allele. If ‘alleles’ is specified it must be the same length as ‘peptides’ and give the allele corresponding to each peptide.

Parameters

peptidesEncodableSequences or list of string
alleleslist of string
allelestring
throwboolean: If True, a ValueError will be raised in the case of unsupported alleles or peptide lengths. If False, a warning will be logged and the predictions for the unsupported alleles or peptides will be NaN.
include_individual_model_predictionsboolean: If True, the predictions of each individual model are included as columns in the result DataFrame.
include_percentile_ranksboolean, default True: If True, a “prediction_percentile” column will be included giving the percentile ranks. If no percentile rank info is available, this will be ignored with a warning.
centrality_measurestring or callable: Measure of central tendency to use to combine predictions in the ensemble. Options include: mean, median, robust_mean.
model_kwargsdict: Additional keyword arguments to pass to Class1NeuralNetwork.predict

Returns

pandas.DataFrame of predictions

calibrate_percentile_ranks(peptides=None, num_peptides_per_length=100000, alleles=None, bins=None, motif_summary=False, summary_top_peptide_fractions=[0.001], verbose=False, model_kwargs={})[source]¶

Compute the cumulative distribution of ic50 values for a set of alleles over a large universe of random peptides, to enable taking quantiles of this distribution later.

Parameters

peptidessequence of string or EncodableSequences, optional: Peptides to use
num_peptides_per_lengthint, optional: If peptides argument is not specified, then num_peptides_per_length peptides are randomly sampled from a uniform distribution for each supported length
allelessequence of string, optional: Alleles to perform calibration for. If not specified all supported alleles will be calibrated.
binsobject: Anything that can be passed to numpy.histogram’s “bins” argument can be used here, i.e. either an integer or a sequence giving bin edges. This is in ic50 space.
motif_summarybool: If True, the length distribution and per-position amino acid frequencies are also calculated for the top x fraction of tightest- binding peptides, where each value of x is given in the summary_top_peptide_fractions list.
summary_top_peptide_fractionslist of float: Only used if motif_summary is True
verboseboolean: Whether to print status updates to stdout
model_kwargsdict: Additional low-level Class1NeuralNetwork.predict() kwargs.

Returns

dict of string -> pandas.DataFrame
If motif_summary is True, this will have keys “frequency_matrices” and
“length_distributions”. Otherwise it will be empty.

model_select(score_function, alleles=None, min_models=1, max_models=10000)[source]¶

Perform model selection using a user-specified scoring function.

This works only with allele-specific models, not pan-allele models.

Model selection is done using a “step up” variable selection procedure, in which models are repeatedly added to an ensemble until the score stops improving.

Parameters

score_functionClass1AffinityPredictor -> float function: Scoring function
alleleslist of string, optional: If not specified, model selection is performed for all alleles.
min_modelsint, optional: Min models to select per allele
max_modelsint, optional: Max models to select per allele

Returns

Class1AffinityPredictorpredictor containing the selected models

class mhcflurry.Class1NeuralNetwork(**hyperparameters)[source]¶

Bases: object

Low level class I predictor consisting of a single neural network.

Both single allele and pan-allele prediction are supported.

Users will generally use Class1AffinityPredictor, which gives a higher-level interface and supports ensembles.

network_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Hyperparameters (and their default values) that affect the neural network architecture.

compile_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Loss and optimizer hyperparameters.

fit_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Hyperparameters for neural network training.

early_stopping_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Hyperparameters for early stopping.

miscelaneous_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Miscelaneous hyperaparameters. These parameters are not used by this class but may be interpreted by other code.

hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Combined set of all supported hyperparameters and their default values.

hyperparameter_renames = {'embedding_init_method': None, 'embedding_input_dim': None, 'embedding_output_dim': None, 'kmer_size': None, 'left_edge': None, 'min_delta': None, 'mode': None, 'monitor': None, 'peptide_amino_acid_encoding': None, 'pseudosequence_use_embedding': None, 'right_edge': None, 'take_best_epoch': None, 'use_embedding': None, 'verbose': None}¶

classmethod apply_hyperparameter_renames(hyperparameters)[source]¶

Handle hyperparameter renames.

Parameters

hyperparametersdict

Returns

dictupdated hyperparameters

KERAS_MODELS_CACHE = {}¶: Process-wide keras model cache, a map from: architecture JSON string to (Keras model, existing network weights)

classmethod clear_model_cache()[source]¶: Clear the Keras model cache.

classmethod borrow_cached_network(network_json, network_weights)[source]¶

Return a keras Model with the specified architecture and weights. As an optimization, when possible this will reuse architectures from a process-wide cache.

The returned object is “borrowed” in the sense that its weights can change later after subsequent calls to this method from other objects.

If you’re using this from a parallel implementation you’ll need to hold a lock while using the returned object.

Parameters

network_jsonstring of JSON
network_weightslist of numpy.array

Returns

keras.models.Model

network(borrow=False)[source]¶

Return the keras model associated with this predictor.

Parameters

borrowbool: Whether to return a cached model if possible. See borrow_cached_network for details

Returns

keras.models.Model

update_network_description()[source]¶: Update self.network_json and self.network_weights properties based on this instances’s neural network.

static keras_network_cache_key(network_json)[source]¶

Given a Keras JSON description of a neural network, return a key that uniquely defines this network. Networks that share the same key should have compatible weights matrices and give the same prediction outputs when their weights are the same.

Parameters

network_jsonstring

Returns

string

get_config()[source]¶

serialize to a dict all attributes except model weights

Returns

dict

classmethod from_config(config, weights=None, weights_loader=None)[source]¶

deserialize from a dict returned by get_config().

Parameters

configdict
weightslist of array, optional: Network weights to restore
weights_loadercallable, optional: Function to call (no arguments) to load weights when needed

Returns

Class1NeuralNetwork

load_weights()[source]¶

Load weights by evaluating self.network_weights_loader, if needed.

After calling this, self.network_weights_loader will be None and self.network_weights will be the weights list, if available.

get_weights()[source]¶

Get the network weights

Returns

list of numpy.array giving weights for each layer or None if there is no
network

peptides_to_network_input(peptides)[source]¶

Encode peptides to the fixed-length encoding expected by the neural network (which depends on the architecture).

Parameters

peptidesEncodableSequences or list of string

Returns

numpy.array

property supported_peptide_lengths¶

(minimum, maximum) lengths of peptides supported, inclusive.

Returns

(int, int) tuple

allele_encoding_to_network_input(allele_encoding)[source]¶

Encode alleles to the fixed-length encoding expected by the neural network (which depends on the architecture).

Parameters

allele_encodingAlleleEncoding

Returns

(numpy.array, numpy.array)
Indices and allele representations.

static data_dependent_weights_initialization(network, x_dict=None, method='lsuv', verbose=1)[source]¶

Data dependent weights initialization.

Parameters

networkkeras.Model
x_dictdict of string -> numpy.ndarray: Training data as would be passed keras.Model.fit().
methodstring: Initialization method. Currently only “lsuv” is supported.
verboseint: Status updates printed to stdout if verbose > 0

fit_generator(generator, validation_peptide_encoding, validation_affinities, validation_allele_encoding=None, validation_inequalities=None, validation_output_indices=None, steps_per_epoch=10, epochs=1000, min_epochs=0, patience=10, min_delta=0.0, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]¶

Fit using a generator. Does not support many of the features of fit(), such as random negative peptides.

Fitting proceeds until early stopping is hit, using the peptides, affinities, etc. given by the parameters starting with “validation_”.

This is used for pre-training pan-allele models using data synthesized by the allele-specific models.

Parameters

generatorgenerator yielding (alleles, peptides, affinities) tuples: where alleles and peptides are lists of strings, and affinities is list of floats.
validation_peptide_encodingEncodableSequences
validation_affinitieslist of float
validation_allele_encodingAlleleEncoding
validation_inequalitieslist of string
validation_output_indiceslist of int
steps_per_epochint
epochsint
min_epochsint
patienceint
min_deltafloat
verboseint
progress_callbackthunk
progress_preamblestring
progress_print_intervalfloat

fit(peptides, affinities, allele_encoding=None, inequalities=None, output_indices=None, sample_weights=None, shuffle_permutation=None, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]¶

Fit the neural network.

Parameters

peptidesEncodableSequences or list of string
affinitieslist of float: nM affinities. Must be same length of as peptides.
allele_encodingAlleleEncoding: If not specified, the model will be a single-allele predictor.
inequalitieslist of string, each element one of “>”, “<”, or “=”.: Inequalities to use for fitting. Same length as affinities. Each element must be one of “>”, “<”, or “=”. For example, a “>” will train on y_pred > y_true for that element in the training set. Requires using a custom losses that support inequalities (e.g. mse_with_ineqalities). If None all inequalities are taken to be “=”.
output_indiceslist of int: For multi-output models only. Same length as affinities. Indicates the index of the output (starting from 0) for each training example.
sample_weightslist of float: If not specified, all samples (including random negatives added during training) will have equal weight. If specified, the random negatives will be assigned weight=1.0.
shuffle_permutationlist of int: Permutation (integer list) of same length as peptides and affinities If None, then a random permutation will be generated.
verboseint: Keras verbosity level
progress_callbackfunction: No-argument function to call after each epoch.
progress_preamblestring: Optional string of information to include in each progress update
progress_print_intervalfloat: How often (in seconds) to print progress update. Set to None to disable.

predict(peptides, allele_encoding=None, batch_size=4096, output_index=0)[source]¶

Predict affinities.

If peptides are specified as EncodableSequences, then the predictions will be cached for this predictor as long as the EncodableSequences object remains in memory. The cache is keyed in the object identity of the EncodableSequences, not the sequences themselves. The cache is used only for allele-specific models (i.e. when allele_encoding is None).

Parameters

peptidesEncodableSequences or list of string
allele_encodingAlleleEncoding, optional: Only required when this model is a pan-allele model
batch_sizeint: batch_size passed to Keras
output_indexint or None: For multi-output models. Gives the output index to return. If set to None, then all outputs are returned as a samples x outputs matrix.

Returns

numpy.array of nM affinity predictions

classmethod merge(models, merge_method='average')[source]¶

Merge multiple models at the tensorflow (or other backend) level.

Only certain neural network architectures support merging. Others will result in a NotImplementedError.

Parameters

modelslist of Class1NeuralNetwork: instances to merge
merge_methodstring, one of “average”, “sum”, or “concatenate”: How to merge the predictions of the different models

Returns

Class1NeuralNetwork: The merged neural network

make_network(peptide_encoding, allele_amino_acid_encoding, allele_dense_layer_sizes, peptide_dense_layer_sizes, peptide_allele_merge_method, peptide_allele_merge_activation, layer_sizes, dense_layer_l1_regularization, dense_layer_l2_regularization, activation, init, output_activation, dropout_probability, batch_normalization, locally_connected_layers, topology, num_outputs=1, allele_representations=None)[source]¶: Helper function to make a keras network for class 1 affinity prediction.

clear_allele_representations()[source]¶: Set allele representations to an empty array. Useful before saving to save a smaller version of the model.

set_allele_representations(allele_representations, force_surgery=False)[source]¶

Set the allele representations in use by this model. This means mutating the weights for the allele input embedding layer.

Rationale: instead of passing in the allele sequence for each data point during model training or prediction (which is expensive in terms of memory usage), we pass in an allele index between 0 and n-1 where n is the number of alleles in some universe of possible alleles. This index is used in the model to lookup the corresponding allele sequence. This function sets the lookup table.

See also: AlleleEncoding.allele_representations()

Parameters

allele_representationsnumpy.ndarray of shape (a, l, m)

where a is the total number of alleles,: l is the allele sequence length, m is the length of the vectors used to represent amino acids

class mhcflurry.Class1ProcessingPredictor(models, manifest_df=None, metadata_dataframes=None, provenance_string=None)[source]¶

Bases: object

User-facing interface to antigen processing prediction.

Delegates to an ensemble of Class1ProcessingNeuralNetwork instances.

Instantiate a new Class1ProcessingPredictor

Users will generally call load() to restore a saved predictor rather than using this constructor.

Parameters

modelslist of Class1ProcessingNeuralNetwork: Neural networks in the ensemble.
manifest_dfpandas.DataFrame: Manifest dataframe. If not specified a new one will be created when needed.
metadata_dataframesdict of string -> pandas.DataFrame: Arbitrary metadata associated with this predictor
provenance_stringstring, optional: Optional info string to use in __str__.

property sequence_lengths¶

Supported maximum sequence lengths.

Passing a peptide greater than the maximum supported length results in an error.

Passing an N- or C-flank sequence greater than the maximum supported length results in some part of it being ignored.

Returns

dict of string -> int
Keys are “peptide”, “n_flank”, “c_flank”. Values give the maximum
supported sequence length.

add_models(models)[source]¶

Add models to the ensemble (in-place).

Parameters

modelslist of Class1ProcessingNeuralNetwork

Returns

list of string
Names of the new models.

property manifest_df¶

A pandas.DataFrame describing the models included in this predictor.

Returns

pandas.DataFrame

static model_name(num)[source]¶

Generate a model name

Returns

string

static weights_path(models_dir, model_name)[source]¶

Generate the path to the weights file for a model

Parameters

models_dirstring
model_namestring

Returns

string

predict(peptides, n_flanks=None, c_flanks=None, throw=True, batch_size=4096)[source]¶

Predict antigen processing.

Parameters

peptideslist of string: Peptide sequences
n_flankslist of string: Upstream sequence before each peptide
c_flankslist of string: Downstream sequence after each peptide
throwboolean: If True, a ValueError will be raised in the case of unsupported peptides. If False, a warning will be logged and the predictions for those peptides will be NaN.
batch_sizeint: Prediction keras batch size.

Returns

numpy.array
Processing scores. Range is 0-1, higher indicates more favorable
processing.

predict_to_dataframe(peptides, n_flanks=None, c_flanks=None, throw=True, batch_size=4096)[source]¶

Predict antigen processing.

See predict method for parameter descriptions.

Returns

pandas.DataFrame
Processing predictions are in the “score” column. Also includes
peptides and flanking sequences.

predict_to_dataframe_encoded(sequences, throw=True, batch_size=4096)[source]¶

Predict antigen processing.

See predict method for more information.

Parameters

sequencesFlankingEncoding
batch_sizeint
throwboolean

Returns

pandas.DataFrame

check_consistency()[source]¶

Verify that self.manifest_df is consistent with instance variables.

Currently only checks for agreement on the total number of models.

Throws AssertionError if inconsistent.

save(models_dir, model_names_to_write=None, write_metadata=True)[source]¶

Serialize the predictor to a directory on disk. If the directory does not exist it will be created.

The serialization format consists of a file called “manifest.csv” with the configurations of each Class1ProcessingNeuralNetwork, along with per-network files giving the model weights.

Parameters

models_dirstring: Path to directory. It will be created if it doesn’t exist.

classmethod load(models_dir=None, max_models=None)[source]¶

Deserialize a predictor from a directory on disk.

Parameters

models_dirstring: Path to directory. If unspecified the default downloaded models are used.
max_modelsint, optional: Maximum number of models to load

Returns

Class1ProcessingPredictor instance

class mhcflurry.Class1ProcessingNeuralNetwork(**hyperparameters)[source]¶

Bases: object

A neural network for antigen processing prediction

network_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Hyperparameters (and their default values) that affect the neural network architecture.

fit_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Hyperparameters for neural network training.

early_stopping_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Hyperparameters for early stopping.

compile_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Loss and optimizer hyperparameters. Any values supported by keras may be used.

auxiliary_input_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Allele feature hyperparameters.

hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶

property sequence_lengths¶

Supported maximum sequence lengths

Returns

dict of string -> int
Keys are “peptide”, “n_flank”, “c_flank”. Values give the maximum
supported sequence length.

network()[source]¶: Return the keras model associated with this network.

update_network_description()[source]¶: Update self.network_json and self.network_weights properties based on this instances’s neural network.

fit(sequences, targets, sample_weights=None, shuffle_permutation=None, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]¶

Fit the neural network.

Parameters

sequencesFlankingEncoding: Peptides and upstream/downstream flanking sequences
targetslist of float: 1 indicates hit, 0 indicates decoy
sample_weightslist of float: If not specified all samples have equal weight.
shuffle_permutationlist of int: Permutation (integer list) of same length as peptides and affinities If None, then a random permutation will be generated.
verboseint: Keras verbosity level
progress_callbackfunction: No-argument function to call after each epoch.
progress_preamblestring: Optional string of information to include in each progress update
progress_print_intervalfloat: How often (in seconds) to print progress update. Set to None to disable.

predict(peptides, n_flanks=None, c_flanks=None, batch_size=4096)[source]¶

Predict antigen processing.

Parameters

peptideslist of string: Peptide sequences
n_flankslist of string: Upstream sequence before each peptide
c_flankslist of string: Downstream sequence after each peptide
batch_sizeint: Prediction keras batch size.

Returns

numpy.array
Processing scores. Range is 0-1, higher indicates more favorable
processing.

predict_encoded(sequences, throw=True, batch_size=4096)[source]¶

Predict antigen processing.

Parameters

sequencesFlankingEncoding: Peptides and flanking sequences
throwboolean: Whether to throw exception on unsupported peptides
batch_sizeint: Prediction keras batch size.

Returns

numpy.array

network_input(sequences, throw=True)[source]¶

Encode peptides to the fixed-length encoding expected by the neural network (which depends on the architecture).

Parameters

sequencesFlankingEncoding: Peptides and flanking sequences
throwboolean: Whether to throw exception on unsupported peptides

Returns

numpy.array

make_network(amino_acid_encoding, peptide_max_length, n_flank_length, c_flank_length, flanking_averages, convolutional_filters, convolutional_kernel_size, convolutional_activation, convolutional_kernel_l1_l2, dropout_rate, post_convolutional_dense_layer_sizes)[source]¶: Helper function to make a keras network given hyperparameters.

get_weights()[source]¶

Get the network weights

Returns

list of numpy.array giving weights for each layer or None if there is no
network

get_config()[source]¶

serialize to a dict all attributes except model weights

Returns

dict

classmethod from_config(config, weights=None)[source]¶

deserialize from a dict returned by get_config().

Parameters

configdict
weightslist of array, optional: Network weights to restore

Returns

Class1ProcessingNeuralNetwork

class mhcflurry.Class1PresentationPredictor(affinity_predictor=None, processing_predictor_with_flanks=None, processing_predictor_without_flanks=None, weights_dataframe=None, metadata_dataframes=None, percent_rank_transform=None, provenance_string=None)[source]¶

Bases: object

A logistic regression model over predicted binding affinity (BA) and antigen processing (AP) score.

Instances of this class delegate to Class1AffinityPredictor and Class1ProcessingPredictor instances to generate BA and AP predictions. These predictions are combined using a logistic regression model to give a “presentation score” prediction.

Most users will call the load static method to get an instance of this class, then call the predict method to generate predictions.

model_inputs = ['affinity_score', 'processing_score']¶

property supported_alleles¶: List of alleles supported by the underlying Class1AffinityPredictor

property supported_peptide_lengths¶: (min, max) of supported peptide lengths, inclusive.

property supports_affinity_prediction¶: Is there an affinity predictor associated with this instance?

property supports_processing_prediction¶: Is there a processing predictor associated with this instance?

property supports_presentation_prediction¶: Can this instance predict presentation?

predict_affinity(peptides, alleles, sample_names=None, include_affinity_percentile=True, verbose=1, throw=True)[source]¶

Predict binding affinities across samples (each corresponding to up to six MHC I alleles).

Two modes are supported: each peptide can be evaluated for binding to any of the alleles in any sample (this is what happens when sample_names is None), or the i’th peptide can be evaluated for binding the alleles of the sample given by the i’th entry in sample_names.

For example, if we don’t specify sample_names, then predictions are taken for all combinations of samples and peptides, for a result size of num peptides * num samples:

>>> predictor = Class1PresentationPredictor.load()
>>> predictor.predict_affinity(
...    peptides=["SIINFEKL", "PEPTIDE"],
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    verbose=0)
    peptide  peptide_num sample_name   affinity best_allele  affinity_percentile
0  SIINFEKL            0     sample1  11927.161       A0201                6.296
1   PEPTIDE            1     sample1  32507.083       A0201               71.249
2  SIINFEKL            0     sample2   2725.593       C0202                6.662
3   PEPTIDE            1     sample2  28304.330       C0202               54.652

In contrast, here we specify sample_names, so peptide is evaluated for binding the alleles in the corresponding sample, for a result size equal to the number of peptides:

>>> predictor.predict_affinity(
...    peptides=["SIINFEKL", "PEPTIDE"],
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    sample_names=["sample2", "sample1"],
...    verbose=0)
    peptide  peptide_num sample_name   affinity best_allele  affinity_percentile
0  SIINFEKL            0     sample2   2725.592       C0202                6.662
1   PEPTIDE            1     sample1  32507.079       A0201               71.249

Parameters

peptideslist of string: Peptide sequences
allelesdict of string -> list of string: Keys are sample names, values are the alleles (genotype) for that sample
sample_nameslist of string [same length as peptides]: Sample names corresponding to each peptide. If None, then predictions are generated for all sample genotypes across all peptides.
include_affinity_percentilebool: Whether to include affinity percentile ranks
verboseint: Set to 0 for quiet.
throwverbose: Whether to throw exception (vs. just log a warning) on invalid peptides, etc.

Returns

pandas.DataFramepredictions

predict_processing(peptides, n_flanks=None, c_flanks=None, throw=True, verbose=1)[source]¶

Predict antigen processing scores for individual peptides, optionally including flanking sequences for better cleavage prediction.

Parameters

peptideslist of string
n_flankslist of string [same length as peptides]
c_flankslist of string [same length as peptides]
throwboolean: Whether to raise exception on unsupported peptides
verboseint

Returns

numpy.arrayAntigen processing scores for each peptide

fit(targets, peptides, sample_names, alleles, n_flanks=None, c_flanks=None, verbose=1)[source]¶

Fit the presentation score logistic regression model.

Parameters

targetslist of int/float: 1 indicates hit, 0 indicates decoy
peptideslist of string [same length as targets]
sample_nameslist of string [same length as targets]
allelesdict of string -> list of string: Keys are sample names, values are the alleles for that sample
n_flankslist of string [same length as targets]
c_flankslist of string [same length as targets]
verboseint

get_model(name=None)[source]¶

Load or instantiate a new logistic regression model. Private helper method.

Parameters

namestring: If None (the default), an un-fit LR model is returned. Otherwise the weights are loaded for the specified model.

Returns

sklearn.linear_model.LogisticRegression

predict(peptides, alleles, sample_names=None, n_flanks=None, c_flanks=None, include_affinity_percentile=False, verbose=1, throw=True)[source]¶

Predict presentation scores across a set of peptides.

Presentation scores combine predictions for MHC I binding affinity and antigen processing.

This method returns a pandas.DataFrame giving presentation scores plus the binding affinity and processing predictions and other intermediate results.

Example:

>>> predictor = Class1PresentationPredictor.load()
>>> predictor.predict(
...    peptides=["SIINFEKL", "PEPTIDE"],
...    n_flanks=["NNN", "SNS"],
...    c_flanks=["CCC", "CNC"],
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    verbose=0)
    peptide n_flank c_flank  peptide_num sample_name   affinity best_allele  processing_score  presentation_score  presentation_percentile
0  SIINFEKL     NNN     CCC            0     sample1  11927.161       A0201             0.838               0.145                    2.282
1   PEPTIDE     SNS     CNC            1     sample1  32507.083       A0201             0.025               0.003                  100.000
2  SIINFEKL     NNN     CCC            0     sample2   2725.593       C0202             0.838               0.416                    1.017
3   PEPTIDE     SNS     CNC            1     sample2  28304.330       C0202             0.025               0.003                   99.287

You can also specify sample_names, in which case peptide is evaluated for binding the alleles in the corresponding sample only. See predict_affinity for an examples.

Parameters

peptideslist of string: Peptide sequences
alleleslist of string or dict of string -> list of string: If you are predicting for a single sample, pass a list of strings (up to 6) indicating the genotype. If you are predicting across multiple samples, pass a dict where the keys are (arbitrary) sample names and the values are the alleles to predict for that sample. Set to an empty list or dict to perform processing prediction only.
sample_nameslist of string [same length as peptides]: If you are passing a dict for ‘alleles’, you can use this argument to specify which peptides go with which samples. If it is None, then predictions will be performed for each peptide across all samples.
n_flankslist of string [same length as peptides]: Upstream sequences before the peptide. Sequences of any length can be given and a suffix of the size supported by the model will be used.
c_flankslist of string [same length as peptides]: Downstream sequences after the peptide. Sequences of any length can be given and a prefix of the size supported by the model will be used.
include_affinity_percentilebool: Whether to include affinity percentile ranks
verboseint: Set to 0 for quiet.
throwverbose: Whether to throw exception (vs. just log a warning) on invalid peptides, etc.

Returns

pandas.DataFrame
Presentation scores and intermediate results.

predict_sequences(sequences, alleles, result='best', comparison_quantity=None, filter_value=None, peptide_lengths=8, 9, 10, 11, use_flanks=True, include_affinity_percentile=True, verbose=1, throw=True)[source]¶

Predict presentation across protein sequences.

Example:

>>> predictor = Class1PresentationPredictor.load()
>>> predictor.predict_sequences(
...    sequences={
...        'protein1': "MDSKGSSQKGSRLLLLLVVSNLL",
...        'protein2': "SSLPTPEDKEQAQQTHH",
...    },
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    result="filtered",
...    comparison_quantity="affinity",
...    filter_value=500,
...    verbose=0)
  sequence_name  pos     peptide n_flank c_flank sample_name  affinity best_allele  affinity_percentile  processing_score  presentation_score  presentation_percentile
0      protein1   14   LLLVVSNLL   GSRLL             sample1    57.180       A0201                0.398             0.233               0.754                    0.351
1      protein1   13   LLLLVVSNL   KGSRL       L     sample1    57.339       A0201                0.398             0.031               0.586                    0.643
2      protein1    5   SSQKGSRLL   MDSKG   LLLVV     sample2   110.779       C0202                0.782             0.061               0.456                    0.920
3      protein1    6   SQKGSRLLL   DSKGS   LLVVS     sample2   254.480       C0202                1.735             0.102               0.303                    1.356
4      protein1   13  LLLLVVSNLL   KGSRL             sample1   260.390       A0201                1.012             0.158               0.345                    1.215
5      protein1   12  LLLLLVVSNL   QKGSR       L     sample1   308.150       A0201                1.094             0.015               0.206                    1.802
6      protein2    0   SSLPTPEDK           EQAQQ     sample2   410.354       C0202                2.398             0.003               0.158                    2.155
7      protein1    5    SSQKGSRL   MDSKG   LLLLV     sample2   444.321       C0202                2.512             0.026               0.159                    2.138
8      protein2    0   SSLPTPEDK           EQAQQ     sample1   459.296       A0301                0.971             0.003               0.144                    2.292
9      protein1    4   GSSQKGSRL    MDSK   LLLLV     sample2   469.052       C0202                2.595             0.014               0.146                    2.261

Parameters

sequencesstr, list of string, or string -> string dict: Protein sequences. If a dict is given, the keys are arbitrary ( e.g. protein names), and the values are the amino acid sequences.
alleleslist of string, list of list of string, or dict of string -> list of string: MHC I alleles. Can be: (1) a string (a single allele), (2) a list of strings (a single genotype), (3) a list of list of strings (multiple genotypes, where the total number of genotypes must equal the number of sequences), or (4) a dict giving multiple genotypes, which will each be run over the sequences.
resultstring: Specify ‘best’ to return the strongest peptide for each sequence, ‘all’ to return predictions for all peptides, or ‘filtered’ to return predictions where the comparison_quantity is stronger (i.e (<) for affinity, (>) for scores) than filter_value.
comparison_quantitystring: One of “presentation_score”, “processing_score”, “affinity”, or “affinity_percentile”. Prediction to use to rank (if result is “best”) or filter (if result is “filtered”) results. Default is “presentation_score”.
filter_valuefloat: Threshold value to use, only relevant when result is “filtered”. If comparison_quantity is “affinity”, then all results less than (i.e. tighter than) the specified nM affinity are retained. If it’s “presentation_score” or “processing_score” then results greater than the indicated filter_value are retained.
peptide_lengthslist of int: Peptide lengths to predict for.
use_flanksbool: Whether to include flanking sequences when running the AP predictor (for better cleavage prediction).
include_affinity_percentilebool: Whether to include affinity percentile ranks in output.
verboseint: Set to 0 for quiet mode.
throwboolean: Whether to throw exceptions (vs. log warnings) on invalid inputs.

Returns

pandas.DataFrame with columns:: peptide, n_flank, c_flank, sequence_name, affinity, best_allele, processing_score, presentation_score

save(models_dir, write_affinity_predictor=True, write_processing_predictor=True, write_weights=True, write_percent_ranks=True, write_info=True, write_metdata=True)[source]¶

Save the predictor to a directory on disk. If the directory does not exist it will be created.

The wrapped Class1AffinityPredictor and Class1ProcessingPredictor instances are included in the saved data.

Parameters

models_dirstring: Path to directory. It will be created if it doesn’t exist.

classmethod load(models_dir=None, max_models=None)[source]¶

Deserialize a predictor from a directory on disk.

This will also load the wrapped Class1AffinityPredictor and Class1ProcessingPredictor instances.

Parameters

models_dirstring: Path to directory. If unspecified the default downloaded models are used.
max_modelsint, optional: Maximum number of affinity and processing (counted separately) models to load

Returns

Class1PresentationPredictor instance

percentile_ranks(presentation_scores, throw=True)[source]¶

Return percentile ranks for the given presentation scores.

Parameters

presentation_scoressequence of float

Returns

numpy.array of float

calibrate_percentile_ranks(scores, bins=None)[source]¶

Compute the cumulative distribution of scores, to enable taking quantiles of this distribution later.

Parameters

scoressequence of float: Presentation prediction scores
binsobject: Anything that can be passed to numpy.histogram’s “bins” argument can be used here, i.e. either an integer or a sequence giving bin edges.

Submodules¶

mhcflurry.allele_encoding module¶

class mhcflurry.allele_encoding.AlleleEncoding(alleles=None, allele_to_sequence=None, borrow_from=None)[source]¶

Bases: object

A place to cache encodings for a sequence of alleles.

We frequently work with alleles by integer indices, for example as inputs to neural networks. This class is used to map allele names to integer indices in a consistent way by keeping track of the universe of alleles under use, i.e. a distinction is made between the universe of supported alleles (what’s in allele_to_sequence) and the actual set of alleles used for some task (what’s in alleles).

Parameters

alleleslist of string: Allele names. If any allele is None instead of string, it will be mapped to the special index value -1.
allele_to_sequencedict of str -> str: Allele name to amino acid sequence
borrow_fromAlleleEncoding, optional: If specified, do not specify allele_to_sequence. The sequences from the provided instance are used. This guarantees that the mappings from allele to index and from allele to sequence are the same between the instances.

compact()[source]¶

Return a new AlleleEncoding in which the universe of supported alleles is only the alleles actually used.

Returns

AlleleEncoding

allele_representations(encoding_name)[source]¶

Encode the universe of supported allele sequences to a matrix.

Parameters

encoding_namestring: How to represent amino acids. Valid names are “BLOSUM62” or “one-hot”. See amino_acid.ENCODING_DATA_FRAMES.

Returns

numpy.array of shape: (num alleles in universe, sequence length, vector size)
where vector size is usually 21 (20 amino acids + X character)

fixed_length_vector_encoded_sequences(encoding_name)[source]¶

Encode allele sequences (not the universe of alleles) to a matrix.

Parameters

encoding_namestring: How to represent amino acids. Valid names are “BLOSUM62” or “one-hot”. See amino_acid.ENCODING_DATA_FRAMES.

Returns

numpy.array with shape:: (num alleles, sequence length, vector size)
where vector size is usually 21 (20 amino acids + X character)

mhcflurry.amino_acid module¶

Functions for encoding fixed length sequences of amino acids into various vector representations, such as one-hot and BLOSUM62.

mhcflurry.amino_acid.available_vector_encodings()[source]¶

Return list of supported amino acid vector encodings.

Returns

list of string

mhcflurry.amino_acid.vector_encoding_length(name)[source]¶

Return the length of the given vector encoding.

Parameters

namestring

Returns

int

mhcflurry.amino_acid.index_encoding(sequences, letter_to_index_dict)[source]¶

Encode a sequence of same-length strings to a matrix of integers of the same shape. The map from characters to integers is given by letter_to_index_dict.

Given a sequence of n strings all of length k, return a k * n array where the (i, j)th element is letter_to_index_dict[sequence[i][j]].

Parameters

sequenceslist of length n of strings of length k
letter_to_index_dictdict

Returns

numpy.array of integers with shape (k, n)

mhcflurry.amino_acid.fixed_vectors_encoding(index_encoded_sequences, letter_to_vector_df)[source]¶

Given a n x k matrix of integers such as that returned by index_encoding() and a dataframe mapping each index to an arbitrary vector, return a n * k * m array where the (i, j)’th element is letter_to_vector_df.iloc[sequence[i][j]].

The dataframe index and columns names are ignored here; the indexing is done entirely by integer position in the dataframe.

Parameters

index_encoded_sequencesn x k array of integers
letter_to_vector_dfpandas.DataFrame of shape (alphabet size, m)

Returns

numpy.array of integers with shape (n, k, m)

mhcflurry.calibrate_percentile_ranks_command module¶

Calibrate percentile ranks for models. Runs in-place.

mhcflurry.calibrate_percentile_ranks_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶

mhcflurry.calibrate_percentile_ranks_command.run_class1_presentation_predictor(args, peptides)[source]¶

mhcflurry.calibrate_percentile_ranks_command.run_class1_affinity_predictor(args, peptides)[source]¶

mhcflurry.calibrate_percentile_ranks_command.do_class1_affinity_calibrate_percentile_ranks(alleles, constant_data={})[source]¶

mhcflurry.calibrate_percentile_ranks_command.class1_affinity_calibrate_percentile_ranks(allele, predictor, peptides=None, motif_summary=False, summary_top_peptide_fractions=[0.001], verbose=False, model_kwargs={})[source]¶

mhcflurry.class1_affinity_predictor module¶

class mhcflurry.class1_affinity_predictor.Class1AffinityPredictor(allele_to_allele_specific_models=None, class1_pan_allele_models=None, allele_to_sequence=None, manifest_df=None, allele_to_percent_rank_transform=None, metadata_dataframes=None, provenance_string=None)[source]¶

Bases: object

High-level interface for peptide/MHC I binding affinity prediction.

This class manages low-level Class1NeuralNetwork instances, each of which wraps a single Keras network. The purpose of Class1AffinityPredictor is to implement ensembles, handling of multiple alleles, and predictor loading and saving. It also provides a place to keep track of metadata like prediction histograms for percentile rank calibration.

Parameters

allele_to_allele_specific_modelsdict of string -> list of Class1NeuralNetwork: Ensemble of single-allele models to use for each allele.
class1_pan_allele_modelslist of Class1NeuralNetwork: Ensemble of pan-allele models.
allele_to_sequencedict of string -> string: MHC allele name to fixed-length amino acid sequence (sometimes referred to as the pseudosequence). Required only if class1_pan_allele_models is specified.
manifest_dfpandas.DataFrame, optional: Must have columns: model_name, allele, config_json, model. Only required if you want to update an existing serialization of a Class1AffinityPredictor. Otherwise this dataframe will be generated automatically based on the supplied models.
allele_to_percent_rank_transformdict of string -> PercentRankTransform, optional: PercentRankTransform instances to use for each allele
metadata_dataframesdict of string -> pandas.DataFrame, optional: Optional additional dataframes to write to the models dir when save() is called. Useful for tracking provenance.
provenance_stringstring, optional: Optional info string to use in __str__.

property manifest_df¶

A pandas.DataFrame describing the models included in this predictor.

Based on: - self.class1_pan_allele_models - self.allele_to_allele_specific_models

Returns

pandas.DataFrame

clear_cache()[source]¶

Clear values cached based on the neural networks in this predictor.

Users should call this after mutating any of the following:

self.class1_pan_allele_models
self.allele_to_allele_specific_models
self.allele_to_sequence

Methods that mutate these instance variables will call this method on their own if needed.

property neural_networks¶

List of the neural networks in the ensemble.

Returns

list of Class1NeuralNetwork

classmethod merge(predictors)[source]¶

Merge the ensembles of two or more Class1AffinityPredictor instances.

Note: the resulting merged predictor will NOT have calibrated percentile ranks. Call calibrate_percentile_ranks on it if these are needed.

Parameters

predictorssequence of Class1AffinityPredictor

Returns

Class1AffinityPredictor instance

merge_in_place(others)[source]¶

Add the models present in other predictors into the current predictor.

Parameters

otherslist of Class1AffinityPredictor: Other predictors to merge into the current predictor.

Returns

list of stringnames of newly added models

property supported_alleles¶

Alleles for which predictions can be made.

Returns

list of string

property supported_peptide_lengths¶

(minimum, maximum) lengths of peptides supported by all models, inclusive.

Returns

(int, int) tuple

check_consistency()[source]¶

Verify that self.manifest_df is consistent with: - self.class1_pan_allele_models - self.allele_to_allele_specific_models

Currently only checks for agreement on the total number of models.

Throws AssertionError if inconsistent.

save(models_dir, model_names_to_write=None, write_metadata=True)[source]¶

Serialize the predictor to a directory on disk. If the directory does not exist it will be created.

The serialization format consists of a file called “manifest.csv” with the configurations of each Class1NeuralNetwork, along with per-network files giving the model weights. If there are pan-allele predictors in the ensemble, the allele sequences are also stored in the directory. There is also a small file “index.txt” with basic metadata: when the models were trained, by whom, on what host.

Parameters

models_dirstring: Path to directory. It will be created if it doesn’t exist.
model_names_to_writelist of string, optional: Only write the weights for the specified models. Useful for incremental updates during training.
write_metadataboolean, optional: Whether to write optional metadata

static load(models_dir=None, max_models=None, optimization_level=None)[source]¶

Deserialize a predictor from a directory on disk.

Parameters

models_dirstring: Path to directory. If unspecified the default downloaded models are used.
max_modelsint, optional: Maximum number of Class1NeuralNetwork instances to load
optimization_levelint: If >0, model optimization will be attempted. Defaults to value of environment variable MHCFLURRY_OPTIMIZATION_LEVEL.

Returns

Class1AffinityPredictor instance

optimize(warn=True)[source]¶

EXPERIMENTAL: Optimize the predictor for faster predictions.

Currently the only optimization implemented is to merge multiple pan- allele predictors at the tensorflow level.

The optimization is performed in-place, mutating the instance.

Returns

bool: Whether optimization was performed

static model_name(allele, num)[source]¶

Generate a model name

Parameters

allelestring
numint

Returns

string

static weights_path(models_dir, model_name)[source]¶

Generate the path to the weights file for a model

Parameters

models_dirstring
model_namestring

Returns

string

property master_allele_encoding¶

An AlleleEncoding containing the universe of alleles specified by self.allele_to_sequence.

Returns

AlleleEncoding

fit_allele_specific_predictors(n_models, architecture_hyperparameters_list, allele, peptides, affinities, inequalities=None, train_rounds=None, models_dir_for_save=None, verbose=0, progress_preamble='', progress_print_interval=5.0)[source]¶

Fit one or more allele specific predictors for a single allele using one or more neural network architectures.

The new predictors are saved in the Class1AffinityPredictor instance and will be used on subsequent calls to predict.

Parameters

n_modelsint: Number of neural networks to fit
architecture_hyperparameters_listlist of dict: List of hyperparameter sets.
allelestring
peptidesEncodableSequences or list of string
affinitieslist of float: nM affinities
inequalitieslist of string, each element one of “>”, “<”, or “=”: See Class1NeuralNetwork.fit for details.
train_roundssequence of int: Each training point i will be used on training rounds r for which train_rounds[i] > r, r >= 0.
models_dir_for_savestring, optional: If specified, the Class1AffinityPredictor is (incrementally) written to the given models dir after each neural network is fit.
verboseint: Keras verbosity
progress_preamblestring: Optional string of information to include in each progress update
progress_print_intervalfloat: How often (in seconds) to print progress. Set to None to disable.

Returns

list of Class1NeuralNetwork

fit_class1_pan_allele_models(n_models, architecture_hyperparameters, alleles, peptides, affinities, inequalities, models_dir_for_save=None, verbose=1, progress_preamble='', progress_print_interval=5.0)[source]¶

Fit one or more pan-allele predictors using a single neural network architecture.

The new predictors are saved in the Class1AffinityPredictor instance and will be used on subsequent calls to predict.

Parameters

n_modelsint: Number of neural networks to fit
architecture_hyperparametersdict
alleleslist of string: Allele names (not sequences) corresponding to each peptide
peptidesEncodableSequences or list of string
affinitieslist of float: nM affinities
inequalitieslist of string, each element one of “>”, “<”, or “=”: See Class1NeuralNetwork.fit for details.
models_dir_for_savestring, optional: If specified, the Class1AffinityPredictor is (incrementally) written to the given models dir after each neural network is fit.
verboseint: Keras verbosity
progress_preamblestring: Optional string of information to include in each progress update
progress_print_intervalfloat: How often (in seconds) to print progress. Set to None to disable.

Returns

list of Class1NeuralNetwork

add_pan_allele_model(model, models_dir_for_save=None)[source]¶

Add a pan-allele model to the ensemble and optionally do an incremental save.

Parameters

modelClass1NeuralNetwork
models_dir_for_savestring: Directory to save resulting ensemble to

percentile_ranks(affinities, allele=None, alleles=None, throw=True)[source]¶

Return percentile ranks for the given ic50 affinities and alleles.

The ‘allele’ and ‘alleles’ argument are as in the predict method. Specify one of these.

Parameters

affinitiessequence of float: nM affinities
allelestring
allelessequence of string
throwboolean: If True, a ValueError will be raised in the case of unsupported alleles. If False, a warning will be logged and NaN will be returned for those percentile ranks.

Returns

numpy.array of float

predict(peptides, alleles=None, allele=None, throw=True, centrality_measure='mean', model_kwargs={})[source]¶

Predict nM binding affinities.

If multiple predictors are available for an allele, the predictions are the geometric means of the individual model (nM) predictions.

One of ‘allele’ or ‘alleles’ must be specified. If ‘allele’ is specified all predictions will be for the given allele. If ‘alleles’ is specified it must be the same length as ‘peptides’ and give the allele corresponding to each peptide.

Parameters

peptidesEncodableSequences or list of string
alleleslist of string
allelestring
throwboolean: If True, a ValueError will be raised in the case of unsupported alleles or peptide lengths. If False, a warning will be logged and the predictions for the unsupported alleles or peptides will be NaN.
centrality_measurestring or callable: Measure of central tendency to use to combine predictions in the ensemble. Options include: mean, median, robust_mean.
model_kwargsdict: Additional keyword arguments to pass to Class1NeuralNetwork.predict

Returns

numpy.array of predictions

predict_to_dataframe(peptides, alleles=None, allele=None, throw=True, include_individual_model_predictions=False, include_percentile_ranks=True, include_confidence_intervals=True, centrality_measure='mean', model_kwargs={})[source]¶

Predict nM binding affinities. Gives more detailed output than predict method, including 5-95% prediction intervals.

If multiple predictors are available for an allele, the predictions are the geometric means of the individual model predictions.

One of ‘allele’ or ‘alleles’ must be specified. If ‘allele’ is specified all predictions will be for the given allele. If ‘alleles’ is specified it must be the same length as ‘peptides’ and give the allele corresponding to each peptide.

Parameters

peptidesEncodableSequences or list of string
alleleslist of string
allelestring
throwboolean: If True, a ValueError will be raised in the case of unsupported alleles or peptide lengths. If False, a warning will be logged and the predictions for the unsupported alleles or peptides will be NaN.
include_individual_model_predictionsboolean: If True, the predictions of each individual model are included as columns in the result DataFrame.
include_percentile_ranksboolean, default True: If True, a “prediction_percentile” column will be included giving the percentile ranks. If no percentile rank info is available, this will be ignored with a warning.
centrality_measurestring or callable: Measure of central tendency to use to combine predictions in the ensemble. Options include: mean, median, robust_mean.
model_kwargsdict: Additional keyword arguments to pass to Class1NeuralNetwork.predict

Returns

pandas.DataFrame of predictions

calibrate_percentile_ranks(peptides=None, num_peptides_per_length=100000, alleles=None, bins=None, motif_summary=False, summary_top_peptide_fractions=[0.001], verbose=False, model_kwargs={})[source]¶

Compute the cumulative distribution of ic50 values for a set of alleles over a large universe of random peptides, to enable taking quantiles of this distribution later.

Parameters

peptidessequence of string or EncodableSequences, optional: Peptides to use
num_peptides_per_lengthint, optional: If peptides argument is not specified, then num_peptides_per_length peptides are randomly sampled from a uniform distribution for each supported length
allelessequence of string, optional: Alleles to perform calibration for. If not specified all supported alleles will be calibrated.
binsobject: Anything that can be passed to numpy.histogram’s “bins” argument can be used here, i.e. either an integer or a sequence giving bin edges. This is in ic50 space.
motif_summarybool: If True, the length distribution and per-position amino acid frequencies are also calculated for the top x fraction of tightest- binding peptides, where each value of x is given in the summary_top_peptide_fractions list.
summary_top_peptide_fractionslist of float: Only used if motif_summary is True
verboseboolean: Whether to print status updates to stdout
model_kwargsdict: Additional low-level Class1NeuralNetwork.predict() kwargs.

Returns

dict of string -> pandas.DataFrame
If motif_summary is True, this will have keys “frequency_matrices” and
“length_distributions”. Otherwise it will be empty.

model_select(score_function, alleles=None, min_models=1, max_models=10000)[source]¶

Perform model selection using a user-specified scoring function.

This works only with allele-specific models, not pan-allele models.

Model selection is done using a “step up” variable selection procedure, in which models are repeatedly added to an ensemble until the score stops improving.

Parameters

score_functionClass1AffinityPredictor -> float function: Scoring function
alleleslist of string, optional: If not specified, model selection is performed for all alleles.
min_modelsint, optional: Min models to select per allele
max_modelsint, optional: Max models to select per allele

Returns

Class1AffinityPredictorpredictor containing the selected models

mhcflurry.class1_neural_network module¶

class mhcflurry.class1_neural_network.Class1NeuralNetwork(**hyperparameters)[source]¶

Bases: object

Low level class I predictor consisting of a single neural network.

Both single allele and pan-allele prediction are supported.

Users will generally use Class1AffinityPredictor, which gives a higher-level interface and supports ensembles.

network_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Hyperparameters (and their default values) that affect the neural network architecture.

compile_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Loss and optimizer hyperparameters.

fit_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Hyperparameters for neural network training.

early_stopping_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Hyperparameters for early stopping.

miscelaneous_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Miscelaneous hyperaparameters. These parameters are not used by this class but may be interpreted by other code.

hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Combined set of all supported hyperparameters and their default values.

hyperparameter_renames = {'embedding_init_method': None, 'embedding_input_dim': None, 'embedding_output_dim': None, 'kmer_size': None, 'left_edge': None, 'min_delta': None, 'mode': None, 'monitor': None, 'peptide_amino_acid_encoding': None, 'pseudosequence_use_embedding': None, 'right_edge': None, 'take_best_epoch': None, 'use_embedding': None, 'verbose': None}¶

classmethod apply_hyperparameter_renames(hyperparameters)[source]¶

Handle hyperparameter renames.

Parameters

hyperparametersdict

Returns

dictupdated hyperparameters

KERAS_MODELS_CACHE = {}¶: Process-wide keras model cache, a map from: architecture JSON string to (Keras model, existing network weights)

classmethod clear_model_cache()[source]¶: Clear the Keras model cache.

classmethod borrow_cached_network(network_json, network_weights)[source]¶

Return a keras Model with the specified architecture and weights. As an optimization, when possible this will reuse architectures from a process-wide cache.

The returned object is “borrowed” in the sense that its weights can change later after subsequent calls to this method from other objects.

If you’re using this from a parallel implementation you’ll need to hold a lock while using the returned object.

Parameters

network_jsonstring of JSON
network_weightslist of numpy.array

Returns

keras.models.Model

network(borrow=False)[source]¶

Return the keras model associated with this predictor.

Parameters

borrowbool: Whether to return a cached model if possible. See borrow_cached_network for details

Returns

keras.models.Model

update_network_description()[source]¶: Update self.network_json and self.network_weights properties based on this instances’s neural network.

static keras_network_cache_key(network_json)[source]¶

Given a Keras JSON description of a neural network, return a key that uniquely defines this network. Networks that share the same key should have compatible weights matrices and give the same prediction outputs when their weights are the same.

Parameters

network_jsonstring

Returns

string

get_config()[source]¶

serialize to a dict all attributes except model weights

Returns

dict

classmethod from_config(config, weights=None, weights_loader=None)[source]¶

deserialize from a dict returned by get_config().

Parameters

configdict
weightslist of array, optional: Network weights to restore
weights_loadercallable, optional: Function to call (no arguments) to load weights when needed

Returns

Class1NeuralNetwork

load_weights()[source]¶

Load weights by evaluating self.network_weights_loader, if needed.

After calling this, self.network_weights_loader will be None and self.network_weights will be the weights list, if available.

get_weights()[source]¶

Get the network weights

Returns

list of numpy.array giving weights for each layer or None if there is no
network

peptides_to_network_input(peptides)[source]¶

Encode peptides to the fixed-length encoding expected by the neural network (which depends on the architecture).

Parameters

peptidesEncodableSequences or list of string

Returns

numpy.array

property supported_peptide_lengths¶

(minimum, maximum) lengths of peptides supported, inclusive.

Returns

(int, int) tuple

allele_encoding_to_network_input(allele_encoding)[source]¶

Encode alleles to the fixed-length encoding expected by the neural network (which depends on the architecture).

Parameters

allele_encodingAlleleEncoding

Returns

(numpy.array, numpy.array)
Indices and allele representations.

static data_dependent_weights_initialization(network, x_dict=None, method='lsuv', verbose=1)[source]¶

Data dependent weights initialization.

Parameters

networkkeras.Model
x_dictdict of string -> numpy.ndarray: Training data as would be passed keras.Model.fit().
methodstring: Initialization method. Currently only “lsuv” is supported.
verboseint: Status updates printed to stdout if verbose > 0

fit_generator(generator, validation_peptide_encoding, validation_affinities, validation_allele_encoding=None, validation_inequalities=None, validation_output_indices=None, steps_per_epoch=10, epochs=1000, min_epochs=0, patience=10, min_delta=0.0, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]¶

Fit using a generator. Does not support many of the features of fit(), such as random negative peptides.

Fitting proceeds until early stopping is hit, using the peptides, affinities, etc. given by the parameters starting with “validation_”.

This is used for pre-training pan-allele models using data synthesized by the allele-specific models.

Parameters

generatorgenerator yielding (alleles, peptides, affinities) tuples: where alleles and peptides are lists of strings, and affinities is list of floats.
validation_peptide_encodingEncodableSequences
validation_affinitieslist of float
validation_allele_encodingAlleleEncoding
validation_inequalitieslist of string
validation_output_indiceslist of int
steps_per_epochint
epochsint
min_epochsint
patienceint
min_deltafloat
verboseint
progress_callbackthunk
progress_preamblestring
progress_print_intervalfloat

fit(peptides, affinities, allele_encoding=None, inequalities=None, output_indices=None, sample_weights=None, shuffle_permutation=None, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]¶

Fit the neural network.

Parameters

peptidesEncodableSequences or list of string
affinitieslist of float: nM affinities. Must be same length of as peptides.
allele_encodingAlleleEncoding: If not specified, the model will be a single-allele predictor.
inequalitieslist of string, each element one of “>”, “<”, or “=”.: Inequalities to use for fitting. Same length as affinities. Each element must be one of “>”, “<”, or “=”. For example, a “>” will train on y_pred > y_true for that element in the training set. Requires using a custom losses that support inequalities (e.g. mse_with_ineqalities). If None all inequalities are taken to be “=”.
output_indiceslist of int: For multi-output models only. Same length as affinities. Indicates the index of the output (starting from 0) for each training example.
sample_weightslist of float: If not specified, all samples (including random negatives added during training) will have equal weight. If specified, the random negatives will be assigned weight=1.0.
shuffle_permutationlist of int: Permutation (integer list) of same length as peptides and affinities If None, then a random permutation will be generated.
verboseint: Keras verbosity level
progress_callbackfunction: No-argument function to call after each epoch.
progress_preamblestring: Optional string of information to include in each progress update
progress_print_intervalfloat: How often (in seconds) to print progress update. Set to None to disable.

predict(peptides, allele_encoding=None, batch_size=4096, output_index=0)[source]¶

Predict affinities.

If peptides are specified as EncodableSequences, then the predictions will be cached for this predictor as long as the EncodableSequences object remains in memory. The cache is keyed in the object identity of the EncodableSequences, not the sequences themselves. The cache is used only for allele-specific models (i.e. when allele_encoding is None).

Parameters

peptidesEncodableSequences or list of string
allele_encodingAlleleEncoding, optional: Only required when this model is a pan-allele model
batch_sizeint: batch_size passed to Keras
output_indexint or None: For multi-output models. Gives the output index to return. If set to None, then all outputs are returned as a samples x outputs matrix.

Returns

numpy.array of nM affinity predictions

classmethod merge(models, merge_method='average')[source]¶

Merge multiple models at the tensorflow (or other backend) level.

Only certain neural network architectures support merging. Others will result in a NotImplementedError.

Parameters

modelslist of Class1NeuralNetwork: instances to merge
merge_methodstring, one of “average”, “sum”, or “concatenate”: How to merge the predictions of the different models

Returns

Class1NeuralNetwork: The merged neural network

make_network(peptide_encoding, allele_amino_acid_encoding, allele_dense_layer_sizes, peptide_dense_layer_sizes, peptide_allele_merge_method, peptide_allele_merge_activation, layer_sizes, dense_layer_l1_regularization, dense_layer_l2_regularization, activation, init, output_activation, dropout_probability, batch_normalization, locally_connected_layers, topology, num_outputs=1, allele_representations=None)[source]¶: Helper function to make a keras network for class 1 affinity prediction.

clear_allele_representations()[source]¶: Set allele representations to an empty array. Useful before saving to save a smaller version of the model.

set_allele_representations(allele_representations, force_surgery=False)[source]¶

Set the allele representations in use by this model. This means mutating the weights for the allele input embedding layer.

Rationale: instead of passing in the allele sequence for each data point during model training or prediction (which is expensive in terms of memory usage), we pass in an allele index between 0 and n-1 where n is the number of alleles in some universe of possible alleles. This index is used in the model to lookup the corresponding allele sequence. This function sets the lookup table.

See also: AlleleEncoding.allele_representations()

Parameters

allele_representationsnumpy.ndarray of shape (a, l, m)

where a is the total number of alleles,: l is the allele sequence length, m is the length of the vectors used to represent amino acids

mhcflurry.class1_presentation_predictor module¶

class mhcflurry.class1_presentation_predictor.Class1PresentationPredictor(affinity_predictor=None, processing_predictor_with_flanks=None, processing_predictor_without_flanks=None, weights_dataframe=None, metadata_dataframes=None, percent_rank_transform=None, provenance_string=None)[source]¶

Bases: object

A logistic regression model over predicted binding affinity (BA) and antigen processing (AP) score.

Instances of this class delegate to Class1AffinityPredictor and Class1ProcessingPredictor instances to generate BA and AP predictions. These predictions are combined using a logistic regression model to give a “presentation score” prediction.

Most users will call the load static method to get an instance of this class, then call the predict method to generate predictions.

model_inputs = ['affinity_score', 'processing_score']¶

property supported_alleles¶: List of alleles supported by the underlying Class1AffinityPredictor

property supported_peptide_lengths¶: (min, max) of supported peptide lengths, inclusive.

property supports_affinity_prediction¶: Is there an affinity predictor associated with this instance?

property supports_processing_prediction¶: Is there a processing predictor associated with this instance?

property supports_presentation_prediction¶: Can this instance predict presentation?

predict_affinity(peptides, alleles, sample_names=None, include_affinity_percentile=True, verbose=1, throw=True)[source]¶

Predict binding affinities across samples (each corresponding to up to six MHC I alleles).

Two modes are supported: each peptide can be evaluated for binding to any of the alleles in any sample (this is what happens when sample_names is None), or the i’th peptide can be evaluated for binding the alleles of the sample given by the i’th entry in sample_names.

For example, if we don’t specify sample_names, then predictions are taken for all combinations of samples and peptides, for a result size of num peptides * num samples:

>>> predictor = Class1PresentationPredictor.load()
>>> predictor.predict_affinity(
...    peptides=["SIINFEKL", "PEPTIDE"],
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    verbose=0)
    peptide  peptide_num sample_name   affinity best_allele  affinity_percentile
0  SIINFEKL            0     sample1  11927.161       A0201                6.296
1   PEPTIDE            1     sample1  32507.083       A0201               71.249
2  SIINFEKL            0     sample2   2725.593       C0202                6.662
3   PEPTIDE            1     sample2  28304.330       C0202               54.652

In contrast, here we specify sample_names, so peptide is evaluated for binding the alleles in the corresponding sample, for a result size equal to the number of peptides:

>>> predictor.predict_affinity(
...    peptides=["SIINFEKL", "PEPTIDE"],
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    sample_names=["sample2", "sample1"],
...    verbose=0)
    peptide  peptide_num sample_name   affinity best_allele  affinity_percentile
0  SIINFEKL            0     sample2   2725.592       C0202                6.662
1   PEPTIDE            1     sample1  32507.079       A0201               71.249

Parameters

peptideslist of string: Peptide sequences
allelesdict of string -> list of string: Keys are sample names, values are the alleles (genotype) for that sample
sample_nameslist of string [same length as peptides]: Sample names corresponding to each peptide. If None, then predictions are generated for all sample genotypes across all peptides.
include_affinity_percentilebool: Whether to include affinity percentile ranks
verboseint: Set to 0 for quiet.
throwverbose: Whether to throw exception (vs. just log a warning) on invalid peptides, etc.

Returns

pandas.DataFramepredictions

predict_processing(peptides, n_flanks=None, c_flanks=None, throw=True, verbose=1)[source]¶

Predict antigen processing scores for individual peptides, optionally including flanking sequences for better cleavage prediction.

Parameters

peptideslist of string
n_flankslist of string [same length as peptides]
c_flankslist of string [same length as peptides]
throwboolean: Whether to raise exception on unsupported peptides
verboseint

Returns

numpy.arrayAntigen processing scores for each peptide

fit(targets, peptides, sample_names, alleles, n_flanks=None, c_flanks=None, verbose=1)[source]¶

Fit the presentation score logistic regression model.

Parameters

targetslist of int/float: 1 indicates hit, 0 indicates decoy
peptideslist of string [same length as targets]
sample_nameslist of string [same length as targets]
allelesdict of string -> list of string: Keys are sample names, values are the alleles for that sample
n_flankslist of string [same length as targets]
c_flankslist of string [same length as targets]
verboseint

get_model(name=None)[source]¶

Load or instantiate a new logistic regression model. Private helper method.

Parameters

namestring: If None (the default), an un-fit LR model is returned. Otherwise the weights are loaded for the specified model.

Returns

sklearn.linear_model.LogisticRegression

predict(peptides, alleles, sample_names=None, n_flanks=None, c_flanks=None, include_affinity_percentile=False, verbose=1, throw=True)[source]¶

Predict presentation scores across a set of peptides.

Presentation scores combine predictions for MHC I binding affinity and antigen processing.

This method returns a pandas.DataFrame giving presentation scores plus the binding affinity and processing predictions and other intermediate results.

Example:

>>> predictor = Class1PresentationPredictor.load()
>>> predictor.predict(
...    peptides=["SIINFEKL", "PEPTIDE"],
...    n_flanks=["NNN", "SNS"],
...    c_flanks=["CCC", "CNC"],
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    verbose=0)
    peptide n_flank c_flank  peptide_num sample_name   affinity best_allele  processing_score  presentation_score  presentation_percentile
0  SIINFEKL     NNN     CCC            0     sample1  11927.161       A0201             0.838               0.145                    2.282
1   PEPTIDE     SNS     CNC            1     sample1  32507.083       A0201             0.025               0.003                  100.000
2  SIINFEKL     NNN     CCC            0     sample2   2725.593       C0202             0.838               0.416                    1.017
3   PEPTIDE     SNS     CNC            1     sample2  28304.330       C0202             0.025               0.003                   99.287

You can also specify sample_names, in which case peptide is evaluated for binding the alleles in the corresponding sample only. See predict_affinity for an examples.

Parameters

peptideslist of string: Peptide sequences
alleleslist of string or dict of string -> list of string: If you are predicting for a single sample, pass a list of strings (up to 6) indicating the genotype. If you are predicting across multiple samples, pass a dict where the keys are (arbitrary) sample names and the values are the alleles to predict for that sample. Set to an empty list or dict to perform processing prediction only.
sample_nameslist of string [same length as peptides]: If you are passing a dict for ‘alleles’, you can use this argument to specify which peptides go with which samples. If it is None, then predictions will be performed for each peptide across all samples.
n_flankslist of string [same length as peptides]: Upstream sequences before the peptide. Sequences of any length can be given and a suffix of the size supported by the model will be used.
c_flankslist of string [same length as peptides]: Downstream sequences after the peptide. Sequences of any length can be given and a prefix of the size supported by the model will be used.
include_affinity_percentilebool: Whether to include affinity percentile ranks
verboseint: Set to 0 for quiet.
throwverbose: Whether to throw exception (vs. just log a warning) on invalid peptides, etc.

Returns

pandas.DataFrame
Presentation scores and intermediate results.

predict_sequences(sequences, alleles, result='best', comparison_quantity=None, filter_value=None, peptide_lengths=8, 9, 10, 11, use_flanks=True, include_affinity_percentile=True, verbose=1, throw=True)[source]¶

Predict presentation across protein sequences.

Example:

>>> predictor = Class1PresentationPredictor.load()
>>> predictor.predict_sequences(
...    sequences={
...        'protein1': "MDSKGSSQKGSRLLLLLVVSNLL",
...        'protein2': "SSLPTPEDKEQAQQTHH",
...    },
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    result="filtered",
...    comparison_quantity="affinity",
...    filter_value=500,
...    verbose=0)
  sequence_name  pos     peptide n_flank c_flank sample_name  affinity best_allele  affinity_percentile  processing_score  presentation_score  presentation_percentile
0      protein1   14   LLLVVSNLL   GSRLL             sample1    57.180       A0201                0.398             0.233               0.754                    0.351
1      protein1   13   LLLLVVSNL   KGSRL       L     sample1    57.339       A0201                0.398             0.031               0.586                    0.643
2      protein1    5   SSQKGSRLL   MDSKG   LLLVV     sample2   110.779       C0202                0.782             0.061               0.456                    0.920
3      protein1    6   SQKGSRLLL   DSKGS   LLVVS     sample2   254.480       C0202                1.735             0.102               0.303                    1.356
4      protein1   13  LLLLVVSNLL   KGSRL             sample1   260.390       A0201                1.012             0.158               0.345                    1.215
5      protein1   12  LLLLLVVSNL   QKGSR       L     sample1   308.150       A0201                1.094             0.015               0.206                    1.802
6      protein2    0   SSLPTPEDK           EQAQQ     sample2   410.354       C0202                2.398             0.003               0.158                    2.155
7      protein1    5    SSQKGSRL   MDSKG   LLLLV     sample2   444.321       C0202                2.512             0.026               0.159                    2.138
8      protein2    0   SSLPTPEDK           EQAQQ     sample1   459.296       A0301                0.971             0.003               0.144                    2.292
9      protein1    4   GSSQKGSRL    MDSK   LLLLV     sample2   469.052       C0202                2.595             0.014               0.146                    2.261

Parameters

sequencesstr, list of string, or string -> string dict: Protein sequences. If a dict is given, the keys are arbitrary ( e.g. protein names), and the values are the amino acid sequences.
alleleslist of string, list of list of string, or dict of string -> list of string: MHC I alleles. Can be: (1) a string (a single allele), (2) a list of strings (a single genotype), (3) a list of list of strings (multiple genotypes, where the total number of genotypes must equal the number of sequences), or (4) a dict giving multiple genotypes, which will each be run over the sequences.
resultstring: Specify ‘best’ to return the strongest peptide for each sequence, ‘all’ to return predictions for all peptides, or ‘filtered’ to return predictions where the comparison_quantity is stronger (i.e (<) for affinity, (>) for scores) than filter_value.
comparison_quantitystring: One of “presentation_score”, “processing_score”, “affinity”, or “affinity_percentile”. Prediction to use to rank (if result is “best”) or filter (if result is “filtered”) results. Default is “presentation_score”.
filter_valuefloat: Threshold value to use, only relevant when result is “filtered”. If comparison_quantity is “affinity”, then all results less than (i.e. tighter than) the specified nM affinity are retained. If it’s “presentation_score” or “processing_score” then results greater than the indicated filter_value are retained.
peptide_lengthslist of int: Peptide lengths to predict for.
use_flanksbool: Whether to include flanking sequences when running the AP predictor (for better cleavage prediction).
include_affinity_percentilebool: Whether to include affinity percentile ranks in output.
verboseint: Set to 0 for quiet mode.
throwboolean: Whether to throw exceptions (vs. log warnings) on invalid inputs.

Returns

pandas.DataFrame with columns:: peptide, n_flank, c_flank, sequence_name, affinity, best_allele, processing_score, presentation_score

save(models_dir, write_affinity_predictor=True, write_processing_predictor=True, write_weights=True, write_percent_ranks=True, write_info=True, write_metdata=True)[source]¶

Save the predictor to a directory on disk. If the directory does not exist it will be created.

The wrapped Class1AffinityPredictor and Class1ProcessingPredictor instances are included in the saved data.

Parameters

models_dirstring: Path to directory. It will be created if it doesn’t exist.

classmethod load(models_dir=None, max_models=None)[source]¶

Deserialize a predictor from a directory on disk.

This will also load the wrapped Class1AffinityPredictor and Class1ProcessingPredictor instances.

Parameters

models_dirstring: Path to directory. If unspecified the default downloaded models are used.
max_modelsint, optional: Maximum number of affinity and processing (counted separately) models to load

Returns

Class1PresentationPredictor instance

percentile_ranks(presentation_scores, throw=True)[source]¶

Return percentile ranks for the given presentation scores.

Parameters

presentation_scoressequence of float

Returns

numpy.array of float

calibrate_percentile_ranks(scores, bins=None)[source]¶

Compute the cumulative distribution of scores, to enable taking quantiles of this distribution later.

Parameters

scoressequence of float: Presentation prediction scores
binsobject: Anything that can be passed to numpy.histogram’s “bins” argument can be used here, i.e. either an integer or a sequence giving bin edges.

mhcflurry.class1_processing_neural_network module¶

Antigen processing neural network implementation

class mhcflurry.class1_processing_neural_network.Class1ProcessingNeuralNetwork(**hyperparameters)[source]¶

Bases: object

A neural network for antigen processing prediction

network_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Hyperparameters (and their default values) that affect the neural network architecture.

fit_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Hyperparameters for neural network training.

early_stopping_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Hyperparameters for early stopping.

compile_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Loss and optimizer hyperparameters. Any values supported by keras may be used.

auxiliary_input_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶: Allele feature hyperparameters.

hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶

property sequence_lengths¶

Supported maximum sequence lengths

Returns

dict of string -> int
Keys are “peptide”, “n_flank”, “c_flank”. Values give the maximum
supported sequence length.

network()[source]¶: Return the keras model associated with this network.

update_network_description()[source]¶: Update self.network_json and self.network_weights properties based on this instances’s neural network.

fit(sequences, targets, sample_weights=None, shuffle_permutation=None, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]¶

Fit the neural network.

Parameters

sequencesFlankingEncoding: Peptides and upstream/downstream flanking sequences
targetslist of float: 1 indicates hit, 0 indicates decoy
sample_weightslist of float: If not specified all samples have equal weight.
shuffle_permutationlist of int: Permutation (integer list) of same length as peptides and affinities If None, then a random permutation will be generated.
verboseint: Keras verbosity level
progress_callbackfunction: No-argument function to call after each epoch.
progress_preamblestring: Optional string of information to include in each progress update
progress_print_intervalfloat: How often (in seconds) to print progress update. Set to None to disable.

predict(peptides, n_flanks=None, c_flanks=None, batch_size=4096)[source]¶

Predict antigen processing.

Parameters

peptideslist of string: Peptide sequences
n_flankslist of string: Upstream sequence before each peptide
c_flankslist of string: Downstream sequence after each peptide
batch_sizeint: Prediction keras batch size.

Returns

numpy.array
Processing scores. Range is 0-1, higher indicates more favorable
processing.

predict_encoded(sequences, throw=True, batch_size=4096)[source]¶

Predict antigen processing.

Parameters

sequencesFlankingEncoding: Peptides and flanking sequences
throwboolean: Whether to throw exception on unsupported peptides
batch_sizeint: Prediction keras batch size.

Returns

numpy.array

network_input(sequences, throw=True)[source]¶

Encode peptides to the fixed-length encoding expected by the neural network (which depends on the architecture).

Parameters

sequencesFlankingEncoding: Peptides and flanking sequences
throwboolean: Whether to throw exception on unsupported peptides

Returns

numpy.array

make_network(amino_acid_encoding, peptide_max_length, n_flank_length, c_flank_length, flanking_averages, convolutional_filters, convolutional_kernel_size, convolutional_activation, convolutional_kernel_l1_l2, dropout_rate, post_convolutional_dense_layer_sizes)[source]¶: Helper function to make a keras network given hyperparameters.

get_weights()[source]¶

Get the network weights

Returns

list of numpy.array giving weights for each layer or None if there is no
network

get_config()[source]¶

serialize to a dict all attributes except model weights

Returns

dict

classmethod from_config(config, weights=None)[source]¶

deserialize from a dict returned by get_config().

Parameters

configdict
weightslist of array, optional: Network weights to restore

Returns

Class1ProcessingNeuralNetwork

mhcflurry.class1_processing_predictor module¶

class mhcflurry.class1_processing_predictor.Class1ProcessingPredictor(models, manifest_df=None, metadata_dataframes=None, provenance_string=None)[source]¶

Bases: object

User-facing interface to antigen processing prediction.

Delegates to an ensemble of Class1ProcessingNeuralNetwork instances.

Instantiate a new Class1ProcessingPredictor

Users will generally call load() to restore a saved predictor rather than using this constructor.

Parameters

modelslist of Class1ProcessingNeuralNetwork: Neural networks in the ensemble.
manifest_dfpandas.DataFrame: Manifest dataframe. If not specified a new one will be created when needed.
metadata_dataframesdict of string -> pandas.DataFrame: Arbitrary metadata associated with this predictor
provenance_stringstring, optional: Optional info string to use in __str__.

property sequence_lengths¶

Supported maximum sequence lengths.

Passing a peptide greater than the maximum supported length results in an error.

Passing an N- or C-flank sequence greater than the maximum supported length results in some part of it being ignored.

Returns

dict of string -> int
Keys are “peptide”, “n_flank”, “c_flank”. Values give the maximum
supported sequence length.

add_models(models)[source]¶

Add models to the ensemble (in-place).

Parameters

modelslist of Class1ProcessingNeuralNetwork

Returns

list of string
Names of the new models.

property manifest_df¶

A pandas.DataFrame describing the models included in this predictor.

Returns

pandas.DataFrame

static model_name(num)[source]¶

Generate a model name

Returns

string

static weights_path(models_dir, model_name)[source]¶

Generate the path to the weights file for a model

Parameters

models_dirstring
model_namestring

Returns

string

predict(peptides, n_flanks=None, c_flanks=None, throw=True, batch_size=4096)[source]¶

Predict antigen processing.

Parameters

peptideslist of string: Peptide sequences
n_flankslist of string: Upstream sequence before each peptide
c_flankslist of string: Downstream sequence after each peptide
throwboolean: If True, a ValueError will be raised in the case of unsupported peptides. If False, a warning will be logged and the predictions for those peptides will be NaN.
batch_sizeint: Prediction keras batch size.

Returns

numpy.array
Processing scores. Range is 0-1, higher indicates more favorable
processing.

predict_to_dataframe(peptides, n_flanks=None, c_flanks=None, throw=True, batch_size=4096)[source]¶

Predict antigen processing.

See predict method for parameter descriptions.

Returns

pandas.DataFrame
Processing predictions are in the “score” column. Also includes
peptides and flanking sequences.

predict_to_dataframe_encoded(sequences, throw=True, batch_size=4096)[source]¶

Predict antigen processing.

See predict method for more information.

Parameters

sequencesFlankingEncoding
batch_sizeint
throwboolean

Returns

pandas.DataFrame

check_consistency()[source]¶

Verify that self.manifest_df is consistent with instance variables.

Currently only checks for agreement on the total number of models.

Throws AssertionError if inconsistent.

save(models_dir, model_names_to_write=None, write_metadata=True)[source]¶

Serialize the predictor to a directory on disk. If the directory does not exist it will be created.

The serialization format consists of a file called “manifest.csv” with the configurations of each Class1ProcessingNeuralNetwork, along with per-network files giving the model weights.

Parameters

models_dirstring: Path to directory. It will be created if it doesn’t exist.

classmethod load(models_dir=None, max_models=None)[source]¶

Deserialize a predictor from a directory on disk.

Parameters

models_dirstring: Path to directory. If unspecified the default downloaded models are used.
max_modelsint, optional: Maximum number of models to load

Returns

Class1ProcessingPredictor instance

mhcflurry.cluster_parallelism module¶

Simple, relatively naive parallel map implementation for HPC clusters.

Used for training MHCflurry models.

mhcflurry.cluster_parallelism.add_cluster_parallelism_args(parser)[source]¶

Add commandline arguments controlling cluster parallelism to an argparse ArgumentParser.

Parameters

parserargparse.ArgumentParser

mhcflurry.cluster_parallelism.cluster_results_from_args(args, work_function, work_items, constant_data=None, input_serialization_method='pickle', result_serialization_method='pickle', clear_constant_data=False)[source]¶

Parallel map configurable using commandline arguments. See the cluster_results() function for docs.

The args parameter should be an argparse.Namespace from an argparse parser generated using the add_cluster_parallelism_args() function.

Parameters

args
work_function
work_items
constant_data
result_serialization_method
clear_constant_data

Returns

generator

mhcflurry.cluster_parallelism.cluster_results(work_function, work_items, constant_data=None, submit_command='sh', results_workdir='./cluster-workdir', additional_complete_file=None, script_prefix_path=None, input_serialization_method='pickle', result_serialization_method='pickle', max_retries=3, clear_constant_data=False)[source]¶

Parallel map on an HPC cluster.

Returns [work_function(item) for item in work_items] where each invocation of work_function is performed as a separate HPC cluster job. Order is preserved.

Optionally, “constant data” can be specified, which will be passed to each work_function() invocation as a keyword argument called constant_data. This data is serialized once and all workers read it from the same source, which is more efficient than serializing it separately for each worker.

Each worker’s input is serialized to a shared NFS directory and the submit_command is used to launch a job to process that input. The shared filesystem is polled occasionally to watch for results, which are fed back to the user.

Parameters

work_functionA -> B
work_itemslist of A
constant_dataobject
submit_commandstring: For running on LSF, we use “bsub” here.
results_workdirstring: Path to NFS shared directory where inputs and results can be written
script_prefix_pathstring: Path to script that will be invoked to run each worker. A line calling the _mhcflurry-cluster-worker-entry-point command will be appended to the contents of this file.
result_serialization_methodstring, one of “pickle” or “save_predictor”: The “save_predictor” works only when the return type of work_function is Class1AffinityPredictor
max_retriesint: How many times to attempt to re-launch a failed worker
clear_constant_databool: If True, the constant data dict is cleared on the launching host after it is serialized to disk.

Returns

generator of B

mhcflurry.cluster_parallelism.worker_entry_point(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶

Entry point for the worker command.

Parameters

argvlist of string

mhcflurry.common module¶

mhcflurry.common.configure_tensorflow(backend=None, gpu_device_nums=None, num_threads=None)[source]¶

Configure Keras backend to use GPU or CPU. Only tensorflow is supported.

Parameters

backendstring, optional: one of ‘tensorflow-default’, ‘tensorflow-cpu’, ‘tensorflow-gpu’
gpu_device_numslist of int, optional: GPU devices to potentially use
num_threadsint, optional: Tensorflow threads to use

mhcflurry.common.configure_logging(verbose=False)[source]¶

Configure logging module using defaults.

Parameters

verboseboolean: If true, output will be at level DEBUG, otherwise, INFO.

mhcflurry.common.amino_acid_distribution(peptides, smoothing=0.0)[source]¶

Compute the fraction of each amino acid across a collection of peptides.

Parameters

peptideslist of string
smoothingfloat, optional: Small number (e.g. 0.01) to add to all amino acid fractions. The higher the number the more uniform the distribution.

Returns

pandas.Series indexed by amino acids

mhcflurry.common.random_peptides(num, length=9, distribution=None)[source]¶

Generate random peptides (kmers).

Parameters

numint: Number of peptides to return
lengthint: Length of each peptide
distributionpandas.Series: Maps 1-letter amino acid abbreviations to probabilities. If not specified a uniform distribution is used.

Returns

list of string

mhcflurry.common.positional_frequency_matrix(peptides)[source]¶

Given a set of peptides, calculate a length x amino acids frequency matrix.

Parameters

peptideslist of string: All of same length

Returns

pandas.DataFrame: Index is position, columns are amino acids

mhcflurry.common.save_weights(weights_list, filename)[source]¶

Save model weights to the given filename using numpy’s “.npz” format.

Parameters

weights_listlist of numpy array
filenamestring

mhcflurry.common.load_weights(filename)[source]¶

Restore model weights from the given filename, which should have been created with save_weights.

Parameters

filenamestring

Returns

list of array

class mhcflurry.common.NumpyJSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶

Bases: json.encoder.JSONEncoder

JSON encoder (used with json module) that can handle numpy arrays.

Constructor for JSONEncoder, with sensible defaults.

If skipkeys is false, then it is a TypeError to attempt encoding of keys that are not str, int, float or None. If skipkeys is True, such items are simply skipped.

If ensure_ascii is true, the output is guaranteed to be str objects with all incoming non-ASCII characters escaped. If ensure_ascii is false, the output can contain non-ASCII characters.

If check_circular is true, then lists, dicts, and custom encoded objects will be checked for circular references during encoding to prevent an infinite recursion (which would cause an OverflowError). Otherwise, no such check takes place.

If allow_nan is true, then NaN, Infinity, and -Infinity will be encoded as such. This behavior is not JSON specification compliant, but is consistent with most JavaScript based encoders and decoders. Otherwise, it will be a ValueError to encode such floats.

If sort_keys is true, then the output of dictionaries will be sorted by key; this is useful for regression tests to ensure that JSON serializations can be compared on a day-to-day basis.

If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.

If specified, separators should be an (item_separator, key_separator) tuple. The default is (‘, ‘, ‘: ‘) if indent is None and (‘,’, ‘: ‘) otherwise. To get the most compact JSON representation, you should specify (‘,’, ‘:’) to eliminate whitespace.

If specified, default is a function that gets called for objects that can’t otherwise be serialized. It should return a JSON encodable version of the object or raise a TypeError.

default(obj)[source]¶

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)

mhcflurry.custom_loss module¶

Custom loss functions.

For losses supporting inequalities, each training data point is associated with one of (=), (<), or (>). For e.g. (>) inequalities, penalization is applied only if the prediction is less than the given value.

mhcflurry.custom_loss.get_loss(name)[source]¶

Get a custom_loss.Loss instance by name.

Parameters

namestring

Returns

custom_loss.Loss

class mhcflurry.custom_loss.Loss(name=None)[source]¶

Bases: object

Thin wrapper to keep track of neural network loss functions, which could be custom or baked into Keras.

Each subclass or instance should define these properties/methods: - name : string - loss : string or function

This is what gets passed to keras.fit()

encode_ynumpy.ndarray -> numpy.ndarray
Transformation to apply to regression target before fitting

loss(y_true, y_pred)[source]¶

get_keras_loss(reduction='sum_over_batch_size')[source]¶

class mhcflurry.custom_loss.StandardKerasLoss(loss_name='mse')[source]¶

Bases: mhcflurry.custom_loss.Loss

A loss function supported by Keras, such as MSE.

supports_inequalities = False¶

supports_multiple_outputs = False¶

static encode_y(y)[source]¶

class mhcflurry.custom_loss.TransformPredictionsLossWrapper(loss, y_pred_transform=None)[source]¶

Bases: mhcflurry.custom_loss.Loss

Wrapper that applies an arbitrary transform to y_pred before calling an underlying loss function.

The y_pred_transform function should be a tensor -> tensor function.

encode_y(*args, **kwargs)[source]¶

loss(y_true, y_pred)[source]¶

class mhcflurry.custom_loss.MSEWithInequalities(name=None)[source]¶

Bases: mhcflurry.custom_loss.Loss

Supports training a regression model on data that includes inequalities (e.g. x < 100). Mean square error is used as the loss for elements with an (=) inequality. For elements with e.g. a (> 0.5) inequality, then the loss for that element is (y - 0.5)^2 (standard MSE) if y < 500 and 0 otherwise.

This loss assumes that the normal range for y_true and y_pred is 0 - 1. As a hack, the implementation uses other intervals for y_pred to encode the inequality information.

y_true is interpreted as follows:

between 0 - 1: Regular MSE loss is used. Penalty (y_pred - y_true)**2 is applied if y_pred is greater or less than y_true.
between 2 - 3:: Treated as a “>” inequality. Penalty (y_pred - (y_true - 2))**2 is applied only if y_pred is less than y_true - 2.
between 4 - 5:: Treated as a “<” inequality. Penalty (y_pred - (y_true - 4))**2 is applied only if y_pred is greater than y_true - 4.

name = 'mse_with_inequalities'¶

supports_inequalities = True¶

supports_multiple_outputs = False¶

static encode_y(y, inequalities=None)[source]¶

loss(y_true, y_pred)[source]¶

class mhcflurry.custom_loss.MSEWithInequalitiesAndMultipleOutputs(name=None)[source]¶

Bases: mhcflurry.custom_loss.Loss

Loss supporting inequalities and multiple outputs.

This loss assumes that the normal range for y_true and y_pred is 0 - 1. As a hack, the implementation uses other intervals for y_pred to encode the inequality and output-index information.

Inequalities are encoded into the regression target as in the MSEWithInequalities loss.

Multiple outputs are encoded by mapping each regression target x (after transforming for inequalities) using the rule x -> x + i * 10 where i is the output index.

The reason for explicitly encoding multiple outputs this way (rather than just making the regression target a matrix instead of a vector) is that in our use cases we frequently have missing data in the regression target. This encoding gives a simple way to penalize only on (data point, output index) pairs that have labels.

name = 'mse_with_inequalities_and_multiple_outputs'¶

supports_inequalities = True¶

supports_multiple_outputs = True¶

static encode_y(y, inequalities=None, output_indices=None)[source]¶

loss(y_true, y_pred)[source]¶

class mhcflurry.custom_loss.MultiallelicMassSpecLoss(delta=0.2, multiplier=1.0)[source]¶

Bases: mhcflurry.custom_loss.Loss

name = 'multiallelic_mass_spec_loss'¶

supports_inequalities = True¶

supports_multiple_outputs = False¶

static encode_y(y)[source]¶

loss(y_true, y_pred)[source]¶

mhcflurry.custom_loss.check_shape(name, arr, expected_shape)[source]¶

Raise ValueError if arr.shape != expected_shape.

Parameters

namestring: Included in error message to aid debugging
arrnumpy.ndarray
expected_shapetuple of int

mhcflurry.custom_loss.cls¶: alias of mhcflurry.custom_loss.MultiallelicMassSpecLoss

mhcflurry.data_dependent_weights_initialization module¶

Layer-sequential unit-variance initialization for neural networks.

See:: Mishkin and Matas, “All you need is a good init”. 2016. https://arxiv.org/abs/1511.06422

mhcflurry.data_dependent_weights_initialization.svd_orthonormal(shape)[source]¶

mhcflurry.data_dependent_weights_initialization.get_activations(model, layer, X_batch)[source]¶

mhcflurry.data_dependent_weights_initialization.lsuv_init(model, batch, verbose=True, margin=0.1, max_iter=100)[source]¶

Initialize neural network weights using layer-sequential unit-variance initialization.

See:: Mishkin and Matas, “All you need is a good init”. 2016. https://arxiv.org/abs/1511.06422

Parameters

modelkeras.Model
batchdict: Training data, as would be passed keras.Model.fit()
verboseboolean: Whether to print progress to stdout
marginfloat
max_iterint

Returns

keras.Model: Same as what was passed in.

mhcflurry.downloads module¶

Manage local downloaded data.

mhcflurry.downloads.get_downloads_dir()[source]¶: Return the path to local downloaded data

mhcflurry.downloads.get_current_release()[source]¶: Return the current downloaded data release

mhcflurry.downloads.get_downloads_metadata()[source]¶: Return the contents of downloads.yml as a dict

mhcflurry.downloads.get_default_class1_models_dir(test_exists=True)[source]¶

Return the absolute path to the default class1 models dir.

If environment variable MHCFLURRY_DEFAULT_CLASS1_MODELS is set to an absolute path, return that path. If it’s set to a relative path (i.e. does not start with /) then return that path taken to be relative to the mhcflurry downloads dir.

If environment variable MHCFLURRY_DEFAULT_CLASS1_MODELS is NOT set, then return the path to downloaded models in the “models_class1” download.

Parameters

test_existsboolean, optional: Whether to raise an exception of the path does not exist

Returns

stringabsolute path

mhcflurry.downloads.get_default_class1_presentation_models_dir(test_exists=True)[source]¶

Return the absolute path to the default class1 presentation models dir.

See get_default_class1_models_dir.

If environment variable MHCFLURRY_DEFAULT_CLASS1_PRESENTATION_MODELS is set to an absolute path, return that path. If it’s set to a relative path (does not start with /) then return that path taken to be relative to the mhcflurry downloads dir.

Parameters

test_existsboolean, optional: Whether to raise an exception of the path does not exist

Returns

stringabsolute path

mhcflurry.downloads.get_default_class1_processing_models_dir(test_exists=True)[source]¶

Return the absolute path to the default class1 processing models dir.

See get_default_class1_models_dir.

If environment variable MHCFLURRY_DEFAULT_CLASS1_PROCESSING_MODELS is set to an absolute path, return that path. If it’s set to a relative path (does not start with /) then return that path taken to be relative to the mhcflurry downloads dir.

Parameters

test_existsboolean, optional: Whether to raise an exception of the path does not exist

Returns

stringabsolute path

mhcflurry.downloads.get_current_release_downloads()[source]¶

Return a dict of all available downloads in the current release.

The dict keys are the names of the downloads. The values are a dict with two entries:

downloadedbool: Whether the download is currently available locally
metadatadict: Info about the download from downloads.yml such as URL
up_to_datebool or None: Whether the download URL(s) match what was used to download the current data. This is None if it cannot be determined.

mhcflurry.downloads.get_path(download_name, filename='', test_exists=True)[source]¶

Get the local path to a file in a MHCflurry download

Parameters

download_namestring
filenamestring: Relative path within the download to the file of interest
test_existsboolean: If True (default) throw an error telling the user how to download the data if the file does not exist

Returns

string giving local absolute path

mhcflurry.downloads.configure()[source]¶: Setup various global variables based on environment variables.

mhcflurry.downloads_command module¶

Download MHCflurry released datasets and trained models.

Examples

Fetch the default downloads:: $ mhcflurry-downloads fetch
Fetch a specific download:: $ mhcflurry-downloads fetch models_class1_pan
Get the path to a download:: $ mhcflurry-downloads path models_class1_pan
Get the URL of a download:: $ mhcflurry-downloads url models_class1_pan
Summarize available and fetched downloads:: $ mhcflurry-downloads info

mhcflurry.downloads_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶

mhcflurry.downloads_command.mkdir_p(path)[source]¶

Make directories as needed, similar to mkdir -p in a shell.

From: http://stackoverflow.com/questions/600268/mkdir-p-functionality-in-python

mhcflurry.downloads_command.yes_no(boolean)[source]¶

class mhcflurry.downloads_command.TqdmUpTo(*args, **kwargs)[source]¶

Bases: tqdm.std.tqdm

Provides update_to(n) which uses tqdm.update(delta_n).

Parameters

iterableiterable, optional

Iterable to decorate with a progressbar. Leave blank to manually manage the updates.

descstr, optional

Prefix for the progressbar.

totalint or float, optional

The number of expected iterations. If unspecified, len(iterable) is used if possible. If float(“inf”) or as a last resort, only basic progress statistics are displayed (no ETA, no progressbar). If gui is True and this parameter needs subsequent updating, specify an initial arbitrary large positive number, e.g. 9e9.

leavebool, optional

If [default: True], keeps all traces of the progressbar upon termination of iteration. If None, will leave only if position is 0.

fileio.TextIOWrapper or io.StringIO, optional

Specifies where to output the progress messages (default: sys.stderr). Uses file.write(str) and file.flush() methods. For encoding, see write_bytes.

ncolsint, optional

The width of the entire output message. If specified, dynamically resizes the progressbar to stay within this bound. If unspecified, attempts to use environment width. The fallback is a meter width of 10 and no limit for the counter and statistics. If 0, will not print any meter (only stats).

minintervalfloat, optional

Minimum progress display update interval [default: 0.1] seconds.

maxintervalfloat, optional

Maximum progress display update interval [default: 10] seconds. Automatically adjusts miniters to correspond to mininterval after long display update lag. Only works if dynamic_miniters or monitor thread is enabled.

minitersint or float, optional

Minimum progress display update interval, in iterations. If 0 and dynamic_miniters, will automatically adjust to equal mininterval (more CPU efficient, good for tight loops). If > 0, will skip display of specified number of iterations. Tweak this and mininterval to get very efficient loops. If your progress is erratic with both fast and slow iterations (network, skipping items, etc) you should set miniters=1.

asciibool or str, optional

If unspecified or False, use unicode (smooth blocks) to fill the meter. The fallback is to use ASCII characters ” 123456789#”.

disablebool, optional

Whether to disable the entire progressbar wrapper [default: False]. If set to None, disable on non-TTY.

unitstr, optional

String that will be used to define the unit of each iteration [default: it].

unit_scalebool or int or float, optional

If 1 or True, the number of iterations will be reduced/scaled automatically and a metric prefix following the International System of Units standard will be added (kilo, mega, etc.) [default: False]. If any other non-zero number, will scale total and n.

dynamic_ncolsbool, optional

If set, constantly alters ncols and nrows to the environment (allowing for window resizes) [default: False].

smoothingfloat, optional

Exponential moving average smoothing factor for speed estimates (ignored in GUI mode). Ranges from 0 (average speed) to 1 (current/instantaneous speed) [default: 0.3].

bar_formatstr, optional

Specify a custom bar string formatting. May impact performance. [default: ‘{l_bar}{bar}{r_bar}’], where l_bar=’{desc}: {percentage:3.0f}%|’ and r_bar=’| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, ‘

‘{rate_fmt}{postfix}]’

Possible vars: l_bar, bar, r_bar, n, n_fmt, total, total_fmt,: percentage, elapsed, elapsed_s, ncols, nrows, desc, unit, rate, rate_fmt, rate_noinv, rate_noinv_fmt, rate_inv, rate_inv_fmt, postfix, unit_divisor, remaining, remaining_s.

Note that a trailing “: ” is automatically removed after {desc} if the latter is empty.

initialint or float, optional

The initial counter value. Useful when restarting a progress bar [default: 0]. If using float, consider specifying {n:.3f} or similar in bar_format, or specifying unit_scale.

positionint, optional

Specify the line offset to print this bar (starting from 0) Automatic if unspecified. Useful to manage multiple bars at once (eg, from threads).

postfixdict or *, optional

Specify additional stats to display at the end of the bar. Calls set_postfix(**postfix) if possible (dict).

unit_divisorfloat, optional

[default: 1000], ignored unless unit_scale is True.

write_bytesbool, optional

If (default: None) and file is unspecified, bytes will be written in Python 2. If True will also write bytes. In all other cases will default to unicode.

lock_argstuple, optional

Passed to refresh for intermediate output (initialisation, iterating, and updating).

nrowsint, optional

The screen height. If specified, hides nested bars outside this bound. If unspecified, attempts to use environment height. The fallback is 20.

guibool, optional

WARNING: internal parameter - do not use. Use tqdm.gui.tqdm(…) instead. If set, will attempt to use matplotlib animations for a graphical output [default: False].

Returns

outdecorated iterator.

update_to(b=1, bsize=1, tsize=None)[source]¶

bint, optional: Number of blocks transferred so far [default: 1].
bsizeint, optional: Size of each block (in tqdm units) [default: 1].
tsizeint, optional: Total size (in tqdm units). If [default: None] remains unchanged.

mhcflurry.downloads_command.fetch_subcommand(args)[source]¶

mhcflurry.downloads_command.info_subcommand(args)[source]¶

mhcflurry.downloads_command.path_subcommand(args)[source]¶: Print the local path to a download

mhcflurry.downloads_command.url_subcommand(args)[source]¶: Print the URL(s) for a download

mhcflurry.encodable_sequences module¶

Class for encoding variable-length peptides to fixed-size numerical matrices

exception mhcflurry.encodable_sequences.EncodingError(message, supported_peptide_lengths)[source]¶

Bases: ValueError

Exception raised when peptides cannot be encoded

class mhcflurry.encodable_sequences.EncodableSequences(sequences)[source]¶

Bases: object

Class for encoding variable-length peptides to fixed-size numerical matrices

This class caches various encodings of a list of sequences.

In practice this is used only for peptides. To encode MHC allele sequences, see AlleleEncoding.

unknown_character = 'X'¶

classmethod create(sequences)[source]¶: Factory that returns an EncodableSequences given a list of strings. As a convenience, you can also pass it an EncodableSequences instance, in which case the object is returned unchanged.

variable_length_to_fixed_length_categorical(alignment_method='pad_middle', left_edge=4, right_edge=4, max_length=15)[source]¶

Encode variable-length sequences to a fixed-size index-encoded (integer) matrix.

See sequences_to_fixed_length_index_encoded_array for details.

Parameters

alignment_methodstring: One of “pad_middle” or “left_pad_right_pad”
left_edgeint, size of fixed-position left side: Only relevant for pad_middle alignment method
right_edgeint, size of the fixed-position right side: Only relevant for pad_middle alignment method
max_lengthmaximum supported peptide length

Returns

numpy.array of integers with shape (num sequences, encoded length)
For pad_middle, the encoded length is max_length. For left_pad_right_pad,
it’s 3 * max_length.

variable_length_to_fixed_length_vector_encoding(vector_encoding_name, alignment_method='pad_middle', left_edge=4, right_edge=4, max_length=15, trim=False, allow_unsupported_amino_acids=False)[source]¶

Encode variable-length sequences to a fixed-size matrix. Amino acids are encoded as specified by the vector_encoding_name argument.

See sequences_to_fixed_length_index_encoded_array for details.

See also: variable_length_to_fixed_length_categorical.

Parameters

vector_encoding_namestring: How to represent amino acids. One of “BLOSUM62”, “one-hot”, etc. Full list of supported vector encodings is given by available_vector_encodings().
alignment_methodstring: One of “pad_middle” or “left_pad_right_pad”
left_edgeint: Size of fixed-position left side. Only relevant for pad_middle alignment method
right_edgeint: Size of the fixed-position right side. Only relevant for pad_middle alignment method
max_lengthint: Maximum supported peptide length
trimbool: If True, longer sequences will be trimmed to fit the maximum supported length. Not supported for all alignment methods.
allow_unsupported_amino_acidsbool: If True, non-canonical amino acids will be replaced with the X character before encoding.

Returns

numpy.array with shape (num sequences, encoded length, m)

where

m is the vector encoding length (usually 21).
encoded length is max_length if alignment_method is pad_middle; 3 * max_length if it’s left_pad_right_pad.

classmethod sequences_to_fixed_length_index_encoded_array(sequences, alignment_method='pad_middle', left_edge=4, right_edge=4, max_length=15, trim=False, allow_unsupported_amino_acids=False)[source]¶

Encode variable-length sequences to a fixed-size index-encoded (integer) matrix.

How variable length sequences get mapped to fixed length is set by the “alignment_method” argument. Supported alignment methods are:

pad_middle
Encoding designed for preserving the anchor positions of class I peptides. This is what is used in allele-specific models.

Each string must be of length at least left_edge + right_edge and at most max_length. The first left_edge characters in the input always map to the first left_edge characters in the output. Similarly for the last right_edge characters. The middle characters are filled in based on the length, with the X character filling in the blanks.

Example:

AAAACDDDD -> AAAAXXXCXXXDDDD

left_pad_centered_right_pad
Encoding that makes no assumptions on anchor positions but is 3x larger than pad_middle, since it duplicates the peptide (left aligned + centered + right aligned). This is what is used for the pan-allele models.

Example:

AAAACDDDD -> AAAACDDDDXXXXXXXXXAAAACDDDDXXXXXXXXXAAAACDDDD

left_pad_right_pad
Same as left_pad_centered_right_pad but only includes left- and right-padded peptide.

Example:

AAAACDDDD -> AAAACDDDDXXXXXXXXXXXXAAAACDDDD

Parameters

sequenceslist of string
alignment_methodstring: One of “pad_middle” or “left_pad_right_pad”
left_edgeint: Size of fixed-position left side. Only relevant for pad_middle alignment method
right_edgeint: Size of the fixed-position right side. Only relevant for pad_middle alignment method
max_lengthint: maximum supported peptide length
trimbool: If True, longer sequences will be trimmed to fit the maximum supported length. Not supported for all alignment methods.
allow_unsupported_amino_acidsbool: If True, non-canonical amino acids will be replaced with the X character before encoding.

Returns

numpy.array of integers with shape (num sequences, encoded length)
For pad_middle, the encoded length is max_length. For left_pad_right_pad,
it’s 2 * max_length. For left_pad_centered_right_pad, it’s
3 * max_length.

mhcflurry.ensemble_centrality module¶

Measures of centrality (e.g. mean) used to combine predictions across an ensemble. The input to these functions are log affinities, and they are expected to return a centrality measure also in log-space.

mhcflurry.ensemble_centrality.robust_mean(log_values)[source]¶

Mean of values falling within the 25-75 percentiles.

Parameters

log_values2-d numpy.array: Center is computed along the second axis (i.e. per row).

Returns

centernumpy.array of length log_values.shape[1]

mhcflurry.fasta module¶

Adapted from pyensembl, github.com/openvax/pyensembl Original implementation by Alex Rubinsteyn.

The worse sin in bioinformatics is to write your own FASTA parser. We’re doing this to avoid adding another dependency to MHCflurry, however.

mhcflurry.fasta.read_fasta_to_dataframe(filename)[source]¶

class mhcflurry.fasta.FastaParser[source]¶

Bases: object

FastaParser object consumes lines of a FASTA file incrementally.

iterate_over_file(fasta_path)[source]¶: Generator that yields identifiers paired with sequences.

static open_file(fasta_path)[source]¶: Open either a text file or compressed gzip file as a stream of bytes.

mhcflurry.flanking_encoding module¶

Class for encoding variable-length flanking and peptides to fixed-size numerical matrices

class mhcflurry.flanking_encoding.EncodingResult(array, peptide_lengths)¶

Bases: tuple

Create new instance of EncodingResult(array, peptide_lengths)

array¶: Alias for field number 0

peptide_lengths¶: Alias for field number 1

class mhcflurry.flanking_encoding.FlankingEncoding(peptides, n_flanks, c_flanks)[source]¶

Bases: object

Encode peptides and optionally their N- and C-flanking sequences into fixed size numerical matrices. Similar to EncodableSequences but with support for flanking sequences and the encoding scheme used by the processing predictor.

Instances of this class have an immutable list of peptides with flanking sequences. Encodings are cached in the instances for faster performance when the same set of peptides needs to encoded more than once.

Constructor. Sequences of any lengths can be passed.

Parameters

peptideslist of string: Peptide sequences
n_flankslist of string [same length as peptides]: Upstream sequences
c_flankslist of string [same length as peptides]: Downstream sequences

unknown_character = 'X'¶

vector_encode(vector_encoding_name, peptide_max_length, n_flank_length, c_flank_length, allow_unsupported_amino_acids=True, throw=True)[source]¶

Encode variable-length sequences to a fixed-size matrix.

Parameters

vector_encoding_namestring: How to represent amino acids. One of “BLOSUM62”, “one-hot”, etc. See amino_acid.available_vector_encodings().
peptide_max_lengthint: Maximum supported peptide length.
n_flank_lengthint: Maximum supported N-flank length
c_flank_lengthint: Maximum supported C-flank length
allow_unsupported_amino_acidsbool: If True, non-canonical amino acids will be replaced with the X character before encoding.
throwbool: Whether to raise exception on unsupported peptides

Returns

numpy.array with shape (num sequences, length, m)

where

num sequences is number of peptides, i.e. len(self)
length is peptide_max_length + n_flank_length + c_flank_length
m is the vector encoding length (usually 21).

static encode(vector_encoding_name, df, peptide_max_length, n_flank_length, c_flank_length, allow_unsupported_amino_acids=False, throw=True)[source]¶

Encode variable-length sequences to a fixed-size matrix.

Helper function. Users should use vector_encode.

Parameters

vector_encoding_namestring
dfpandas.DataFrame
peptide_max_lengthint
n_flank_lengthint
c_flank_lengthint
allow_unsupported_amino_acidsbool
throwbool

Returns

numpy.array

mhcflurry.hyperparameters module¶

Hyperparameter (neural network options) management

class mhcflurry.hyperparameters.HyperparameterDefaults(**defaults)[source]¶

Bases: object

Class for managing hyperparameters. Thin wrapper around a dict.

Instances of this class are a specification of the hyperparameters supported by a model and their defaults. The particular hyperparameter settings to be used, for example, to train a model are kept in plain dicts.

extend(other)[source]¶

Return a new HyperparameterDefaults instance containing the hyperparameters from the current instance combined with those from other.

It is an error if self and other have any hyperparameters in common.

with_defaults(obj)[source]¶: Given a dict of hyperparameter settings, return a dict containing those settings augmented by the defaults for any keys missing from the dict.

subselect(obj)[source]¶: Filter a dict of hyperparameter settings to only those keys defined in this HyperparameterDefaults .

check_valid_keys(obj)[source]¶: Given a dict of hyperparameter settings, throw an exception if any keys are not defined in this HyperparameterDefaults instance.

models_grid(**kwargs)[source]¶

Make a grid of models by taking the cartesian product of all specified model parameter lists.

Parameters

The valid kwarg parameters are the entries of this
HyperparameterDefaults instance. Each parameter must be a list
giving the values to search across.

Returns

list of dict giving the parameters for each model. The length of the
list is the product of the lengths of the input lists.

mhcflurry.local_parallelism module¶

Infrastructure for “local” parallelism, i.e. multiprocess parallelism on one compute node.

mhcflurry.local_parallelism.add_local_parallelism_args(parser)[source]¶

Add local parallelism arguments to the given argparse.ArgumentParser.

Parameters

parserargparse.ArgumentParser

mhcflurry.local_parallelism.worker_pool_with_gpu_assignments_from_args(args)[source]¶

Create a multiprocessing.Pool where each worker uses its own GPU.

Uses commandline arguments. See worker_pool_with_gpu_assignments.

Parameters

argsargparse.ArgumentParser

Returns

multiprocessing.Pool

mhcflurry.local_parallelism.worker_pool_with_gpu_assignments(num_jobs, num_gpus=0, backend=None, max_workers_per_gpu=1, max_tasks_per_worker=None, worker_log_dir=None)[source]¶

Create a multiprocessing.Pool where each worker uses its own GPU.

Parameters

num_jobsint: Number of worker processes.
num_gpusint
backendstring
max_workers_per_gpuint
max_tasks_per_workerint
worker_log_dirstring

Returns

multiprocessing.Pool

mhcflurry.local_parallelism.make_worker_pool(processes=None, initializer=None, initializer_kwargs_per_process=None, max_tasks_per_worker=None)[source]¶

Convenience wrapper to create a multiprocessing.Pool.

This function adds support for per-worker initializer arguments, which are not natively supported by the multiprocessing module. The motivation for this feature is to support allocating each worker to a (different) GPU.

IMPLEMENTATION NOTE:

The per-worker initializer arguments are implemented using a Queue. Each worker reads its arguments from this queue when it starts. When it terminates, it adds its initializer arguments back to the queue, so a future process can initialize itself using these arguments.

There is one issue with this approach, however. If a worker crashes, it never repopulates the queue of initializer arguments. This will prevent any future worker from re-using those arguments. To deal with this issue we add a second ‘backup queue’. This queue always contains the full set of initializer arguments: whenever a worker reads from it, it always pushes the pop’d args back to the end of the queue immediately. If the primary arg queue is ever empty, then workers will read from this backup queue.

Parameters

processesint: Number of workers. Default: num CPUs.
initializerfunction, optional: Init function to call in each worker
initializer_kwargs_per_processlist of dict, optional: Arguments to pass to initializer function for each worker. Length of list must equal the number of workers.
max_tasks_per_workerint, optional: Restart workers after this many tasks. Requires Python >=3.2.

Returns

multiprocessing.Pool

mhcflurry.local_parallelism.worker_init_entry_point(init_function, arg_queue=None, backup_arg_queue=None)[source]¶

mhcflurry.local_parallelism.worker_init(keras_backend=None, gpu_device_nums=None, worker_log_dir=None)[source]¶

exception mhcflurry.local_parallelism.WrapException[source]¶

Bases: Exception

Add traceback info to exception so exceptions raised in worker processes can still show traceback info when re-raised in the parent.

mhcflurry.local_parallelism.call_wrapped(function, *args, **kwargs)[source]¶

Run function on args and kwargs and return result, wrapping any exception raised in a WrapException.

Parameters

functionarbitrary function
Any other arguments provided are passed to the function.

Returns

object

mhcflurry.local_parallelism.call_wrapped_kwargs(function, kwargs)[source]¶

Invoke function on given kwargs and return result, wrapping any exception raised in a WrapException.

Parameters

functionarbitrary function
kwargsdict

Returns

object
result of calling function(**kwargs)

mhcflurry.percent_rank_transform module¶

Class for transforming arbitrary values into percent ranks given a distribution.

class mhcflurry.percent_rank_transform.PercentRankTransform[source]¶

Bases: object

Transform arbitrary values into percent ranks.

fit(values, bins)[source]¶

Fit the transform using the given values (e.g. ic50s).

Parameters

valuespredictions (e.g. ic50 values)
binsbins for the cumulative distribution function: Anything that can be passed to numpy.histogram’s “bins” argument can be used here.

transform(values)[source]¶: Return percent ranks (range [0, 100]) for the given values.

to_series()[source]¶

Serialize the fit to a pandas.Series.

The index on the series gives the bin edges and the values give the CDF.

Returns

pandas.Series

static from_series(series)[source]¶

Deseralize a PercentRankTransform the given pandas.Series, as returned by to_series().

Parameters

seriespandas.Series

Returns

PercentRankTransform

mhcflurry.predict_command module¶

Run MHCflurry predictor on specified peptides.

By default, the presentation predictor is used, and predictions for MHC I binding affinity, antigen processing, and the composite presentation score are returned. If you just want binding affinity predictions, pass –affinity-only.

Examples:

Write a CSV file containing the contents of INPUT.csv plus additional columns giving MHCflurry predictions:

$ mhcflurry-predict INPUT.csv –out RESULT.csv

The input CSV file is expected to contain columns “allele”, “peptide”, and, optionally, “n_flank”, and “c_flank”.

If --out is not specified, results are written to stdout.

You can also run on alleles and peptides specified on the commandline, in which case predictions are written for all combinations of alleles and peptides:

$ mhcflurry-predict –alleles HLA-A0201 H-2Kb –peptides SIINFEKL DENDREKLLL

Instead of individual alleles (in a CSV or on the command line), you can also give a comma separated list of alleles giving a sample genotype. In this case, the tightest binding affinity across the alleles for the sample will be returned. For example:

$ mhcflurry-predict –peptides SIINFEKL DENDREKLLL –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:01,HLA-C*07:02 HLA-A*01:01,HLA-A*02:06,HLA-B*44:02,HLA-B*07:02,HLA-C*01:01,HLA-C*03:01

will give the tightest predicted affinities across alleles for each of the two genotypes specified for each peptide.

mhcflurry.predict_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶

mhcflurry.predict_scan_command module¶

Scan protein sequences using the MHCflurry presentation predictor.

By default, sub-sequences (peptides) with affinity percentile ranks less than 2.0 are returned. You can also specify –results-all to return predictions for all peptides, or –results-best to return the top peptide for each sequence.

Examples:

Scan a set of sequences in a FASTA file for binders to any alleles in a MHC I genotype:

$ mhcflurry-predict-scan test/data/example.fasta –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:01,HLA-C*07:02

Instead of a FASTA, you can also pass a CSV that has “sequence_id” and “sequence” columns.

You can also specify multiple MHC I genotypes to scan as space-separated arguments to the –alleles option:

$ mhcflurry-predict-scan test/data/example.fasta –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:02,HLA-C*07:02 HLA-A*01:01,HLA-A*02:06,HLA-B*44:02,HLA-B*07:02,HLA-C*01:02,HLA-C*03:01

If --out is not specified, results are written to standard out.

You can also specify sequences on the commandline:

mhcflurry-predict-scan –sequences MGYINVFAFPFTIYSLLLCRMNSRNYIAQVDVVNFNLT –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:02,HLA-C*07:02

mhcflurry.predict_scan_command.parse_peptide_lengths(value)[source]¶

mhcflurry.predict_scan_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶

mhcflurry.random_negative_peptides module¶

class mhcflurry.random_negative_peptides.RandomNegativePeptides(**hyperparameters)[source]¶

Bases: object

Generate random negative (peptide, allele) pairs. These are used during model training, where they are resampled at each epoch.

hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>¶

Hyperperameters for random negative peptides.

Number of random negatives will be:: random_negative_rate * (num measurements) + random_negative_constant

where the exact meaning of (num measurements) depends on the particular random_negative_method in use.

If random_negative_match_distribution is True, then the amino acid frequencies of the training data peptides are used to generate the random peptides.

Valid values for random_negative_method are:

“by_length”: used for allele-specific prediction. See description in: RandomNegativePeptides.plan_by_length method.
“by_allele”: used for pan-allele prediction. See: RandomNegativePeptides.plan_by_allele method.
“by_allele_equalize_nonbinders”: used for pan-allele prediction. See: RandomNegativePeptides.plan_by_allele_equalize_nonbinders method.
“recommended”: the default. Use by_length if the predictor is allele-: specific and by_allele if it’s pan-allele.

plan(peptides, affinities, alleles=None, inequalities=None)[source]¶

Calculate the number of random negatives for each allele and peptide length. Call this once after instantiating the object.

Parameters

peptideslist of string
affinitieslist of float
alleleslist of string, optional
inequalitieslist of string (“>”, “<”, or “=”), optional

Returns

pandas.DataFrame indicating number of random negatives for each length
and allele.

plan_by_length(df_all, df_binders=None, df_nonbinders=None)[source]¶

Generate a random negative plan using the “by_length” policy.

Parameters are as in the plan method. No return value.

Used for allele-specific predictors. Does not work well for pan-allele.

Different numbers of random negatives per length. Alleles are sampled proportionally to the number of times they are used in the training data.

plan_by_allele(df_all, df_binders=None, df_nonbinders=None)[source]¶

Generate a random negative plan using the “by_allele” policy.

Parameters are as in the plan method. No return value.

For each allele, a particular number of random negatives are used for all lengths. Across alleles, the number of random negatives varies; within an allele, the number of random negatives for each length is a constant

plan_by_allele_equalize_nonbinders(df_all, df_binders, df_nonbinders)[source]¶

Generate a random negative plan using the “by_allele_equalize_nonbinders” policy.

Parameters are as in the plan method. No return value.

Requires that the random_negative_binder_threshold hyperparameter is set.

In a first step, the number of random negatives selected by the “by_allele” method are added (see plan_by_allele). Then, the total number of non-binders are calculated for each allele and length. This total includes non-binder measurements in the training data plus the random negative peptides added in the first step. In a second step, additional random negative peptides are added so that for each allele, all peptide lengths have the same total number of non-binders.

get_alleles()[source]¶

Get the list of alleles corresponding to each random negative peptide as returned by get_peptides. This does NOT change and can be safely called once and reused.

Returns

list of string

get_peptides()[source]¶

Get the list of random negative peptides. This will be different each time the method is called.

Returns

list of string

get_total_count()[source]¶

Total number of planned random negative peptides.

Returns

int

mhcflurry.regression_target module¶

mhcflurry.regression_target.from_ic50(ic50, max_ic50=50000.0)[source]¶

Convert ic50s to regression targets in the range [0.0, 1.0].

Parameters

ic50numpy.array of float

Returns

numpy.array of float

mhcflurry.regression_target.to_ic50(x, max_ic50=50000.0)[source]¶

Convert regression targets in the range [0.0, 1.0] to ic50s in the range [0, 50000.0].

Parameters

xnumpy.array of float

Returns

numpy.array of float

mhcflurry.scoring module¶

Measures of prediction accuracy

mhcflurry.scoring.make_scores(ic50_y, ic50_y_pred, sample_weight=None, threshold_nm=500, max_ic50=50000)[source]¶

Calculate AUC, F1, and Kendall Tau scores.

Parameters

ic50_yfloat list: true IC50s (i.e. affinities)
ic50_y_predfloat list: predicted IC50s
sample_weightfloat list [optional]
threshold_nmfloat [optional]
max_ic50float [optional]

Returns

dict with entries “auc”, “f1”, “tau”

mhcflurry.select_allele_specific_models_command module¶

Model select class1 single allele models.

mhcflurry.select_allele_specific_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶

class mhcflurry.select_allele_specific_models_command.ScrambledPredictor(predictor)[source]¶

Bases: object

predict(peptides, allele)[source]¶

mhcflurry.select_allele_specific_models_command.model_select(allele, constant_data={})[source]¶

mhcflurry.select_allele_specific_models_command.cache_encoding(predictor, peptides)[source]¶

class mhcflurry.select_allele_specific_models_command.ScoreFunction(function, summary=None)[source]¶

Bases: object

Thin wrapper over a score function (Class1AffinityPredictor -> float). Used to keep a summary string associated with the function.

class mhcflurry.select_allele_specific_models_command.CombinedModelSelector(model_selectors, weights=None, min_contribution_percent=1.0)[source]¶

Bases: object

Model selector that computes a weighted average over other model selectors.

usable_for_allele(allele)[source]¶

plan_summary(allele)[source]¶

score_function(allele, dry_run=False)[source]¶

class mhcflurry.select_allele_specific_models_command.ConsensusModelSelector(predictor, num_peptides_per_length=10000, multiply_score_by_value=10.0)[source]¶

Bases: object

Model selector that scores sub-ensembles based on their Kendall tau consistency with the full ensemble over a set of random peptides.

usable_for_allele(allele)[source]¶

max_absolute_value(allele)[source]¶

plan_summary(allele)[source]¶

score_function(allele)[source]¶

class mhcflurry.select_allele_specific_models_command.MSEModelSelector(df, predictor, min_measurements=1, multiply_score_by_data_size=True)[source]¶

Bases: object

Model selector that uses mean-squared error to score models. Inequalities are supported.

usable_for_allele(allele)[source]¶

max_absolute_value(allele)[source]¶

plan_summary(allele)[source]¶

score_function(allele)[source]¶

class mhcflurry.select_allele_specific_models_command.MassSpecModelSelector(df, predictor, decoys_per_length=0, min_measurements=100, multiply_score_by_data_size=True)[source]¶

Bases: object

Model selector that uses PPV of differentiating decoys from hits from mass-spec experiments.

static ppv(y_true, predictions)[source]¶

usable_for_allele(allele)[source]¶

max_absolute_value(allele)[source]¶

plan_summary(allele)[source]¶

score_function(allele)[source]¶

mhcflurry.select_pan_allele_models_command module¶

Model select class1 pan-allele models.

APPROACH: For each training fold, we select at least min and at most max models (where min and max are set by the –{min/max}-models-per-fold argument) using a step-up (forward) selection procedure. The final ensemble is the union of all selected models across all folds.

mhcflurry.select_pan_allele_models_command.mse(predictions, actual, inequalities=None, affinities_are_already_01_transformed=False)[source]¶

Mean squared error of predictions vs. actual

Parameters

predictionslist of float
actuallist of float
inequalitieslist of string (“>”, “<”, or “=”)
affinities_are_already_01_transformedboolean: Predictions and actual are taken to be nanomolar affinities if affinities_are_already_01_transformed is False, otherwise 0-1 values.

Returns

float

mhcflurry.select_pan_allele_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶

mhcflurry.select_pan_allele_models_command.do_model_select_task(item, constant_data={})[source]¶

mhcflurry.select_pan_allele_models_command.model_select(fold_num, models, min_models, max_models, constant_data={})[source]¶

Model select for a fold.

Parameters

fold_numint
modelslist of Class1NeuralNetwork
min_modelsint
max_modelsint
constant_datadict

Returns

dict with keys ‘fold_num’, ‘selected_indices’, ‘summary’

mhcflurry.select_processing_models_command module¶

Model select antigen processing models.

APPROACH: For each training fold, we select at least min and at most max models (where min and max are set by the –{min/max}-models-per-fold argument) using a step-up (forward) selection procedure. The final ensemble is the union of all selected models across all folds. AUC is used as the metric.

mhcflurry.select_processing_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶

mhcflurry.select_processing_models_command.do_model_select_task(item, constant_data={})[source]¶

mhcflurry.select_processing_models_command.model_select(fold_num, models, min_models, max_models, constant_data={})[source]¶

Model select for a fold.

Parameters

fold_numint
modelslist of Class1NeuralNetwork
min_modelsint
max_modelsint
constant_datadict

Returns

dict with keys ‘fold_num’, ‘selected_indices’, ‘summary’

mhcflurry.testing_utils module¶

Utilities used in MHCflurry unit tests.

mhcflurry.testing_utils.startup()[source]¶: Configure Keras backend for running unit tests.

mhcflurry.testing_utils.cleanup()[source]¶: Clear tensorflow session and other process-wide resources.

mhcflurry.train_allele_specific_models_command module¶

Train Class1 single allele models.

mhcflurry.train_allele_specific_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶

mhcflurry.train_allele_specific_models_command.alleles_by_similarity(allele)[source]¶

mhcflurry.train_allele_specific_models_command.train_model(n_models, allele_num, n_alleles, hyperparameter_set_num, num_hyperparameter_sets, allele, hyperparameters, verbose, progress_print_interval, predictor, save_to, constant_data={})[source]¶

mhcflurry.train_allele_specific_models_command.subselect_df_held_out(df, recriprocal_held_out_fraction=10, seed=0)[source]¶

mhcflurry.train_pan_allele_models_command module¶

Train Class1 pan-allele models.

mhcflurry.train_pan_allele_models_command.assign_folds(df, num_folds, held_out_fraction, held_out_max)[source]¶

Split training data into multple test/train pairs, which we refer to as folds. Note that a given data point may be assigned to multiple test or train sets; these folds are NOT a non-overlapping partition as used in cross validation.

A fold is defined by a boolean value for each data point, indicating whether it is included in the training data for that fold. If it’s not in the training data, then it’s in the test data.

Folds are balanced in terms of allele content.

Parameters

dfpandas.DataFrame: training data
num_foldsint
held_out_fractionfloat: Fraction of data to hold out as test data in each fold
held_out_max: For a given allele, do not hold out more than held_out_max number of data points in any fold.

Returns

pandas.DataFrame: index is same as df.index, columns are “fold_0”, … “fold_N” giving whether the data point is in the training data for the fold

mhcflurry.train_pan_allele_models_command.pretrain_data_iterator(filename, master_allele_encoding, peptides_per_chunk=1024)[source]¶

Step through a CSV file giving predictions for a large number of peptides (rows) and alleles (columns).

Parameters

filenamestring
master_allele_encodingAlleleEncoding
peptides_per_chunkint

Returns

Generator of (AlleleEncoding, EncodableSequences, float affinities) tuples

mhcflurry.train_pan_allele_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶

mhcflurry.train_pan_allele_models_command.main(args)[source]¶

mhcflurry.train_pan_allele_models_command.initialize_training(args)[source]¶

mhcflurry.train_pan_allele_models_command.train_models(args)[source]¶

mhcflurry.train_pan_allele_models_command.train_model(work_item_name, work_item_num, num_work_items, architecture_num, num_architectures, fold_num, num_folds, replicate_num, num_replicates, hyperparameters, pretrain_data_filename, verbose, progress_print_interval, predictor, save_to, constant_data={})[source]¶

mhcflurry.train_presentation_models_command module¶

Train Class1 presentation models.

mhcflurry.train_presentation_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶

mhcflurry.train_presentation_models_command.main(args)[source]¶

mhcflurry.train_processing_models_command module¶

Train Class1 processing models.

mhcflurry.train_processing_models_command.assign_folds(df, num_folds, held_out_samples)[source]¶

Split training data into mulitple test/train pairs, which we refer to as folds. Note that a given data point may be assigned to multiple test or train sets; these folds are NOT a non-overlapping partition as used in cross validation.

A fold is defined by a boolean value for each data point, indicating whether it is included in the training data for that fold. If it’s not in the training data, then it’s in the test data.

Parameters

dfpandas.DataFrame: training data
num_foldsint
held_out_samplesint

Returns

pandas.DataFrame: index is same as df.index, columns are “fold_0”, … “fold_N” giving whether the data point is in the training data for the fold

mhcflurry.train_processing_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶

mhcflurry.train_processing_models_command.main(args)[source]¶

mhcflurry.train_processing_models_command.initialize_training(args)[source]¶

mhcflurry.train_processing_models_command.train_models(args)[source]¶

mhcflurry.train_processing_models_command.train_model(work_item_name, work_item_num, num_work_items, architecture_num, num_architectures, fold_num, num_folds, replicate_num, num_replicates, hyperparameters, verbose, progress_print_interval, predictor, save_to, constant_data={})[source]¶

API Documentation¶

Submodules¶

mhcflurry.allele_encoding module¶

mhcflurry.amino_acid module¶

mhcflurry.calibrate_percentile_ranks_command module¶

mhcflurry.class1_affinity_predictor module¶

mhcflurry.class1_neural_network module¶

mhcflurry.class1_presentation_predictor module¶

mhcflurry.class1_processing_neural_network module¶

mhcflurry.class1_processing_predictor module¶

mhcflurry.cluster_parallelism module¶

mhcflurry.common module¶

mhcflurry.custom_loss module¶

mhcflurry.data_dependent_weights_initialization module¶

mhcflurry.downloads module¶

mhcflurry.downloads_command module¶

mhcflurry.encodable_sequences module¶

mhcflurry.ensemble_centrality module¶

mhcflurry.fasta module¶

mhcflurry.flanking_encoding module¶

mhcflurry.hyperparameters module¶

mhcflurry.local_parallelism module¶

mhcflurry.percent_rank_transform module¶

mhcflurry.predict_command module¶

mhcflurry.predict_scan_command module¶

mhcflurry.random_negative_peptides module¶

mhcflurry.regression_target module¶

mhcflurry.scoring module¶

mhcflurry.select_allele_specific_models_command module¶

mhcflurry.select_pan_allele_models_command module¶

mhcflurry.select_processing_models_command module¶

mhcflurry.testing_utils module¶

mhcflurry.train_allele_specific_models_command module¶

mhcflurry.train_pan_allele_models_command module¶

mhcflurry.train_presentation_models_command module¶

mhcflurry.train_processing_models_command module¶

mhcflurry.version module¶