API Documentation

Class I MHC ligand prediction package

class mhcflurry.Class1AffinityPredictor(allele_to_allele_specific_models=None, class1_pan_allele_models=None, allele_to_sequence=None, manifest_df=None, allele_to_percent_rank_transform=None, metadata_dataframes=None, provenance_string=None)[source]

Bases: object

High-level interface for peptide/MHC I binding affinity prediction.

This class manages low-level Class1NeuralNetwork instances, each of which wraps a single Keras network. The purpose of Class1AffinityPredictor is to implement ensembles, handling of multiple alleles, and predictor loading and saving. It also provides a place to keep track of metadata like prediction histograms for percentile rank calibration.

Parameters
allele_to_allele_specific_modelsdict of string -> list of Class1NeuralNetwork

Ensemble of single-allele models to use for each allele.

class1_pan_allele_modelslist of Class1NeuralNetwork

Ensemble of pan-allele models.

allele_to_sequencedict of string -> string

MHC allele name to fixed-length amino acid sequence (sometimes referred to as the pseudosequence). Required only if class1_pan_allele_models is specified.

manifest_dfpandas.DataFrame, optional

Must have columns: model_name, allele, config_json, model. Only required if you want to update an existing serialization of a Class1AffinityPredictor. Otherwise this dataframe will be generated automatically based on the supplied models.

allele_to_percent_rank_transformdict of string -> PercentRankTransform, optional

PercentRankTransform instances to use for each allele

metadata_dataframesdict of string -> pandas.DataFrame, optional

Optional additional dataframes to write to the models dir when save() is called. Useful for tracking provenance.

provenance_stringstring, optional

Optional info string to use in __str__.

property manifest_df

A pandas.DataFrame describing the models included in this predictor.

Based on: - self.class1_pan_allele_models - self.allele_to_allele_specific_models

Returns
pandas.DataFrame
clear_cache()[source]

Clear values cached based on the neural networks in this predictor.

Users should call this after mutating any of the following:
  • self.class1_pan_allele_models

  • self.allele_to_allele_specific_models

  • self.allele_to_sequence

Methods that mutate these instance variables will call this method on their own if needed.

property neural_networks

List of the neural networks in the ensemble.

Returns
list of Class1NeuralNetwork
classmethod merge(predictors)[source]

Merge the ensembles of two or more Class1AffinityPredictor instances.

Note: the resulting merged predictor will NOT have calibrated percentile ranks. Call calibrate_percentile_ranks on it if these are needed.

Parameters
predictorssequence of Class1AffinityPredictor
Returns
Class1AffinityPredictor instance
merge_in_place(others)[source]

Add the models present in other predictors into the current predictor.

Parameters
otherslist of Class1AffinityPredictor

Other predictors to merge into the current predictor.

Returns
list of stringnames of newly added models
property supported_alleles

Alleles for which predictions can be made.

Returns
list of string
property supported_peptide_lengths

(minimum, maximum) lengths of peptides supported by all models, inclusive.

Returns
(int, int) tuple
check_consistency()[source]

Verify that self.manifest_df is consistent with: - self.class1_pan_allele_models - self.allele_to_allele_specific_models

Currently only checks for agreement on the total number of models.

Throws AssertionError if inconsistent.

save(models_dir, model_names_to_write=None, write_metadata=True)[source]

Serialize the predictor to a directory on disk. If the directory does not exist it will be created.

The serialization format consists of a file called “manifest.csv” with the configurations of each Class1NeuralNetwork, along with per-network files giving the model weights. If there are pan-allele predictors in the ensemble, the allele sequences are also stored in the directory. There is also a small file “index.txt” with basic metadata: when the models were trained, by whom, on what host.

Parameters
models_dirstring

Path to directory. It will be created if it doesn’t exist.

model_names_to_writelist of string, optional

Only write the weights for the specified models. Useful for incremental updates during training.

write_metadataboolean, optional

Whether to write optional metadata

static load(models_dir=None, max_models=None, optimization_level=None)[source]

Deserialize a predictor from a directory on disk.

Parameters
models_dirstring

Path to directory. If unspecified the default downloaded models are used.

max_modelsint, optional

Maximum number of Class1NeuralNetwork instances to load

optimization_levelint

If >0, model optimization will be attempted. Defaults to value of environment variable MHCFLURRY_OPTIMIZATION_LEVEL.

Returns
Class1AffinityPredictor instance
optimize(warn=True)[source]

EXPERIMENTAL: Optimize the predictor for faster predictions.

Currently the only optimization implemented is to merge multiple pan- allele predictors at the tensorflow level.

The optimization is performed in-place, mutating the instance.

Returns
bool

Whether optimization was performed

static model_name(allele, num)[source]

Generate a model name

Parameters
allelestring
numint
Returns
string
static weights_path(models_dir, model_name)[source]

Generate the path to the weights file for a model

Parameters
models_dirstring
model_namestring
Returns
string
property master_allele_encoding

An AlleleEncoding containing the universe of alleles specified by self.allele_to_sequence.

Returns
AlleleEncoding
fit_allele_specific_predictors(n_models, architecture_hyperparameters_list, allele, peptides, affinities, inequalities=None, train_rounds=None, models_dir_for_save=None, verbose=0, progress_preamble='', progress_print_interval=5.0)[source]

Fit one or more allele specific predictors for a single allele using one or more neural network architectures.

The new predictors are saved in the Class1AffinityPredictor instance and will be used on subsequent calls to predict.

Parameters
n_modelsint

Number of neural networks to fit

architecture_hyperparameters_listlist of dict

List of hyperparameter sets.

allelestring
peptidesEncodableSequences or list of string
affinitieslist of float

nM affinities

inequalitieslist of string, each element one of “>”, “<”, or “=”

See Class1NeuralNetwork.fit for details.

train_roundssequence of int

Each training point i will be used on training rounds r for which train_rounds[i] > r, r >= 0.

models_dir_for_savestring, optional

If specified, the Class1AffinityPredictor is (incrementally) written to the given models dir after each neural network is fit.

verboseint

Keras verbosity

progress_preamblestring

Optional string of information to include in each progress update

progress_print_intervalfloat

How often (in seconds) to print progress. Set to None to disable.

Returns
list of Class1NeuralNetwork
fit_class1_pan_allele_models(n_models, architecture_hyperparameters, alleles, peptides, affinities, inequalities, models_dir_for_save=None, verbose=1, progress_preamble='', progress_print_interval=5.0)[source]

Fit one or more pan-allele predictors using a single neural network architecture.

The new predictors are saved in the Class1AffinityPredictor instance and will be used on subsequent calls to predict.

Parameters
n_modelsint

Number of neural networks to fit

architecture_hyperparametersdict
alleleslist of string

Allele names (not sequences) corresponding to each peptide

peptidesEncodableSequences or list of string
affinitieslist of float

nM affinities

inequalitieslist of string, each element one of “>”, “<”, or “=”

See Class1NeuralNetwork.fit for details.

models_dir_for_savestring, optional

If specified, the Class1AffinityPredictor is (incrementally) written to the given models dir after each neural network is fit.

verboseint

Keras verbosity

progress_preamblestring

Optional string of information to include in each progress update

progress_print_intervalfloat

How often (in seconds) to print progress. Set to None to disable.

Returns
list of Class1NeuralNetwork
add_pan_allele_model(model, models_dir_for_save=None)[source]

Add a pan-allele model to the ensemble and optionally do an incremental save.

Parameters
modelClass1NeuralNetwork
models_dir_for_savestring

Directory to save resulting ensemble to

percentile_ranks(affinities, allele=None, alleles=None, throw=True)[source]

Return percentile ranks for the given ic50 affinities and alleles.

The ‘allele’ and ‘alleles’ argument are as in the predict method. Specify one of these.

Parameters
affinitiessequence of float

nM affinities

allelestring
allelessequence of string
throwboolean

If True, a ValueError will be raised in the case of unsupported alleles. If False, a warning will be logged and NaN will be returned for those percentile ranks.

Returns
numpy.array of float
predict(peptides, alleles=None, allele=None, throw=True, centrality_measure='mean', model_kwargs={})[source]

Predict nM binding affinities.

If multiple predictors are available for an allele, the predictions are the geometric means of the individual model (nM) predictions.

One of ‘allele’ or ‘alleles’ must be specified. If ‘allele’ is specified all predictions will be for the given allele. If ‘alleles’ is specified it must be the same length as ‘peptides’ and give the allele corresponding to each peptide.

Parameters
peptidesEncodableSequences or list of string
alleleslist of string
allelestring
throwboolean

If True, a ValueError will be raised in the case of unsupported alleles or peptide lengths. If False, a warning will be logged and the predictions for the unsupported alleles or peptides will be NaN.

centrality_measurestring or callable

Measure of central tendency to use to combine predictions in the ensemble. Options include: mean, median, robust_mean.

model_kwargsdict

Additional keyword arguments to pass to Class1NeuralNetwork.predict

Returns
numpy.array of predictions
predict_to_dataframe(peptides, alleles=None, allele=None, throw=True, include_individual_model_predictions=False, include_percentile_ranks=True, include_confidence_intervals=True, centrality_measure='mean', model_kwargs={})[source]

Predict nM binding affinities. Gives more detailed output than predict method, including 5-95% prediction intervals.

If multiple predictors are available for an allele, the predictions are the geometric means of the individual model predictions.

One of ‘allele’ or ‘alleles’ must be specified. If ‘allele’ is specified all predictions will be for the given allele. If ‘alleles’ is specified it must be the same length as ‘peptides’ and give the allele corresponding to each peptide.

Parameters
peptidesEncodableSequences or list of string
alleleslist of string
allelestring
throwboolean

If True, a ValueError will be raised in the case of unsupported alleles or peptide lengths. If False, a warning will be logged and the predictions for the unsupported alleles or peptides will be NaN.

include_individual_model_predictionsboolean

If True, the predictions of each individual model are included as columns in the result DataFrame.

include_percentile_ranksboolean, default True

If True, a “prediction_percentile” column will be included giving the percentile ranks. If no percentile rank info is available, this will be ignored with a warning.

centrality_measurestring or callable

Measure of central tendency to use to combine predictions in the ensemble. Options include: mean, median, robust_mean.

model_kwargsdict

Additional keyword arguments to pass to Class1NeuralNetwork.predict

Returns
pandas.DataFrame of predictions
calibrate_percentile_ranks(peptides=None, num_peptides_per_length=100000, alleles=None, bins=None, motif_summary=False, summary_top_peptide_fractions=[0.001], verbose=False, model_kwargs={})[source]

Compute the cumulative distribution of ic50 values for a set of alleles over a large universe of random peptides, to enable taking quantiles of this distribution later.

Parameters
peptidessequence of string or EncodableSequences, optional

Peptides to use

num_peptides_per_lengthint, optional

If peptides argument is not specified, then num_peptides_per_length peptides are randomly sampled from a uniform distribution for each supported length

allelessequence of string, optional

Alleles to perform calibration for. If not specified all supported alleles will be calibrated.

binsobject

Anything that can be passed to numpy.histogram’s “bins” argument can be used here, i.e. either an integer or a sequence giving bin edges. This is in ic50 space.

motif_summarybool

If True, the length distribution and per-position amino acid frequencies are also calculated for the top x fraction of tightest- binding peptides, where each value of x is given in the summary_top_peptide_fractions list.

summary_top_peptide_fractionslist of float

Only used if motif_summary is True

verboseboolean

Whether to print status updates to stdout

model_kwargsdict

Additional low-level Class1NeuralNetwork.predict() kwargs.

Returns
dict of string -> pandas.DataFrame
If motif_summary is True, this will have keys “frequency_matrices” and
“length_distributions”. Otherwise it will be empty.
model_select(score_function, alleles=None, min_models=1, max_models=10000)[source]

Perform model selection using a user-specified scoring function.

This works only with allele-specific models, not pan-allele models.

Model selection is done using a “step up” variable selection procedure, in which models are repeatedly added to an ensemble until the score stops improving.

Parameters
score_functionClass1AffinityPredictor -> float function

Scoring function

alleleslist of string, optional

If not specified, model selection is performed for all alleles.

min_modelsint, optional

Min models to select per allele

max_modelsint, optional

Max models to select per allele

Returns
Class1AffinityPredictorpredictor containing the selected models
class mhcflurry.Class1NeuralNetwork(**hyperparameters)[source]

Bases: object

Low level class I predictor consisting of a single neural network.

Both single allele and pan-allele prediction are supported.

Users will generally use Class1AffinityPredictor, which gives a higher-level interface and supports ensembles.

network_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Hyperparameters (and their default values) that affect the neural network architecture.

compile_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Loss and optimizer hyperparameters.

fit_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Hyperparameters for neural network training.

early_stopping_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Hyperparameters for early stopping.

miscelaneous_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Miscelaneous hyperaparameters. These parameters are not used by this class but may be interpreted by other code.

hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Combined set of all supported hyperparameters and their default values.

hyperparameter_renames = {'embedding_init_method': None, 'embedding_input_dim': None, 'embedding_output_dim': None, 'kmer_size': None, 'left_edge': None, 'min_delta': None, 'mode': None, 'monitor': None, 'peptide_amino_acid_encoding': None, 'pseudosequence_use_embedding': None, 'right_edge': None, 'take_best_epoch': None, 'use_embedding': None, 'verbose': None}
classmethod apply_hyperparameter_renames(hyperparameters)[source]

Handle hyperparameter renames.

Parameters
hyperparametersdict
Returns
dictupdated hyperparameters
KERAS_MODELS_CACHE = {}

Process-wide keras model cache, a map from: architecture JSON string to (Keras model, existing network weights)

classmethod clear_model_cache()[source]

Clear the Keras model cache.

classmethod borrow_cached_network(network_json, network_weights)[source]

Return a keras Model with the specified architecture and weights. As an optimization, when possible this will reuse architectures from a process-wide cache.

The returned object is “borrowed” in the sense that its weights can change later after subsequent calls to this method from other objects.

If you’re using this from a parallel implementation you’ll need to hold a lock while using the returned object.

Parameters
network_jsonstring of JSON
network_weightslist of numpy.array
Returns
keras.models.Model
network(borrow=False)[source]

Return the keras model associated with this predictor.

Parameters
borrowbool

Whether to return a cached model if possible. See borrow_cached_network for details

Returns
keras.models.Model
update_network_description()[source]

Update self.network_json and self.network_weights properties based on this instances’s neural network.

static keras_network_cache_key(network_json)[source]

Given a Keras JSON description of a neural network, return a key that uniquely defines this network. Networks that share the same key should have compatible weights matrices and give the same prediction outputs when their weights are the same.

Parameters
network_jsonstring
Returns
string
get_config()[source]

serialize to a dict all attributes except model weights

Returns
dict
classmethod from_config(config, weights=None, weights_loader=None)[source]

deserialize from a dict returned by get_config().

Parameters
configdict
weightslist of array, optional

Network weights to restore

weights_loadercallable, optional

Function to call (no arguments) to load weights when needed

Returns
Class1NeuralNetwork
load_weights()[source]

Load weights by evaluating self.network_weights_loader, if needed.

After calling this, self.network_weights_loader will be None and self.network_weights will be the weights list, if available.

get_weights()[source]

Get the network weights

Returns
list of numpy.array giving weights for each layer or None if there is no
network
peptides_to_network_input(peptides)[source]

Encode peptides to the fixed-length encoding expected by the neural network (which depends on the architecture).

Parameters
peptidesEncodableSequences or list of string
Returns
numpy.array
property supported_peptide_lengths

(minimum, maximum) lengths of peptides supported, inclusive.

Returns
(int, int) tuple
allele_encoding_to_network_input(allele_encoding)[source]

Encode alleles to the fixed-length encoding expected by the neural network (which depends on the architecture).

Parameters
allele_encodingAlleleEncoding
Returns
(numpy.array, numpy.array)
Indices and allele representations.
static data_dependent_weights_initialization(network, x_dict=None, method='lsuv', verbose=1)[source]

Data dependent weights initialization.

Parameters
networkkeras.Model
x_dictdict of string -> numpy.ndarray

Training data as would be passed keras.Model.fit().

methodstring

Initialization method. Currently only “lsuv” is supported.

verboseint

Status updates printed to stdout if verbose > 0

fit_generator(generator, validation_peptide_encoding, validation_affinities, validation_allele_encoding=None, validation_inequalities=None, validation_output_indices=None, steps_per_epoch=10, epochs=1000, min_epochs=0, patience=10, min_delta=0.0, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]

Fit using a generator. Does not support many of the features of fit(), such as random negative peptides.

Fitting proceeds until early stopping is hit, using the peptides, affinities, etc. given by the parameters starting with “validation_”.

This is used for pre-training pan-allele models using data synthesized by the allele-specific models.

Parameters
generatorgenerator yielding (alleles, peptides, affinities) tuples

where alleles and peptides are lists of strings, and affinities is list of floats.

validation_peptide_encodingEncodableSequences
validation_affinitieslist of float
validation_allele_encodingAlleleEncoding
validation_inequalitieslist of string
validation_output_indiceslist of int
steps_per_epochint
epochsint
min_epochsint
patienceint
min_deltafloat
verboseint
progress_callbackthunk
progress_preamblestring
progress_print_intervalfloat
fit(peptides, affinities, allele_encoding=None, inequalities=None, output_indices=None, sample_weights=None, shuffle_permutation=None, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]

Fit the neural network.

Parameters
peptidesEncodableSequences or list of string
affinitieslist of float

nM affinities. Must be same length of as peptides.

allele_encodingAlleleEncoding

If not specified, the model will be a single-allele predictor.

inequalitieslist of string, each element one of “>”, “<”, or “=”.

Inequalities to use for fitting. Same length as affinities. Each element must be one of “>”, “<”, or “=”. For example, a “>” will train on y_pred > y_true for that element in the training set. Requires using a custom losses that support inequalities (e.g. mse_with_ineqalities). If None all inequalities are taken to be “=”.

output_indiceslist of int

For multi-output models only. Same length as affinities. Indicates the index of the output (starting from 0) for each training example.

sample_weightslist of float

If not specified, all samples (including random negatives added during training) will have equal weight. If specified, the random negatives will be assigned weight=1.0.

shuffle_permutationlist of int

Permutation (integer list) of same length as peptides and affinities If None, then a random permutation will be generated.

verboseint

Keras verbosity level

progress_callbackfunction

No-argument function to call after each epoch.

progress_preamblestring

Optional string of information to include in each progress update

progress_print_intervalfloat

How often (in seconds) to print progress update. Set to None to disable.

predict(peptides, allele_encoding=None, batch_size=4096, output_index=0)[source]

Predict affinities.

If peptides are specified as EncodableSequences, then the predictions will be cached for this predictor as long as the EncodableSequences object remains in memory. The cache is keyed in the object identity of the EncodableSequences, not the sequences themselves. The cache is used only for allele-specific models (i.e. when allele_encoding is None).

Parameters
peptidesEncodableSequences or list of string
allele_encodingAlleleEncoding, optional

Only required when this model is a pan-allele model

batch_sizeint

batch_size passed to Keras

output_indexint or None

For multi-output models. Gives the output index to return. If set to None, then all outputs are returned as a samples x outputs matrix.

Returns
numpy.array of nM affinity predictions
classmethod merge(models, merge_method='average')[source]

Merge multiple models at the tensorflow (or other backend) level.

Only certain neural network architectures support merging. Others will result in a NotImplementedError.

Parameters
modelslist of Class1NeuralNetwork

instances to merge

merge_methodstring, one of “average”, “sum”, or “concatenate”

How to merge the predictions of the different models

Returns
Class1NeuralNetwork

The merged neural network

make_network(peptide_encoding, allele_amino_acid_encoding, allele_dense_layer_sizes, peptide_dense_layer_sizes, peptide_allele_merge_method, peptide_allele_merge_activation, layer_sizes, dense_layer_l1_regularization, dense_layer_l2_regularization, activation, init, output_activation, dropout_probability, batch_normalization, locally_connected_layers, topology, num_outputs=1, allele_representations=None)[source]

Helper function to make a keras network for class 1 affinity prediction.

clear_allele_representations()[source]

Set allele representations to an empty array. Useful before saving to save a smaller version of the model.

set_allele_representations(allele_representations, force_surgery=False)[source]

Set the allele representations in use by this model. This means mutating the weights for the allele input embedding layer.

Rationale: instead of passing in the allele sequence for each data point during model training or prediction (which is expensive in terms of memory usage), we pass in an allele index between 0 and n-1 where n is the number of alleles in some universe of possible alleles. This index is used in the model to lookup the corresponding allele sequence. This function sets the lookup table.

See also: AlleleEncoding.allele_representations()

Parameters
allele_representationsnumpy.ndarray of shape (a, l, m)
where a is the total number of alleles,

l is the allele sequence length, m is the length of the vectors used to represent amino acids

class mhcflurry.Class1ProcessingPredictor(models, manifest_df=None, metadata_dataframes=None, provenance_string=None)[source]

Bases: object

User-facing interface to antigen processing prediction.

Delegates to an ensemble of Class1ProcessingNeuralNetwork instances.

Instantiate a new Class1ProcessingPredictor

Users will generally call load() to restore a saved predictor rather than using this constructor.

Parameters
modelslist of Class1ProcessingNeuralNetwork

Neural networks in the ensemble.

manifest_dfpandas.DataFrame

Manifest dataframe. If not specified a new one will be created when needed.

metadata_dataframesdict of string -> pandas.DataFrame

Arbitrary metadata associated with this predictor

provenance_stringstring, optional

Optional info string to use in __str__.

property sequence_lengths

Supported maximum sequence lengths.

Passing a peptide greater than the maximum supported length results in an error.

Passing an N- or C-flank sequence greater than the maximum supported length results in some part of it being ignored.

Returns
dict of string -> int
Keys are “peptide”, “n_flank”, “c_flank”. Values give the maximum
supported sequence length.
add_models(models)[source]

Add models to the ensemble (in-place).

Parameters
modelslist of Class1ProcessingNeuralNetwork
Returns
list of string
Names of the new models.
property manifest_df

A pandas.DataFrame describing the models included in this predictor.

Returns
pandas.DataFrame
static model_name(num)[source]

Generate a model name

Returns
string
static weights_path(models_dir, model_name)[source]

Generate the path to the weights file for a model

Parameters
models_dirstring
model_namestring
Returns
string
predict(peptides, n_flanks=None, c_flanks=None, throw=True, batch_size=4096)[source]

Predict antigen processing.

Parameters
peptideslist of string

Peptide sequences

n_flankslist of string

Upstream sequence before each peptide

c_flankslist of string

Downstream sequence after each peptide

throwboolean

If True, a ValueError will be raised in the case of unsupported peptides. If False, a warning will be logged and the predictions for those peptides will be NaN.

batch_sizeint

Prediction keras batch size.

Returns
numpy.array
Processing scores. Range is 0-1, higher indicates more favorable
processing.
predict_to_dataframe(peptides, n_flanks=None, c_flanks=None, throw=True, batch_size=4096)[source]

Predict antigen processing.

See predict method for parameter descriptions.

Returns
pandas.DataFrame
Processing predictions are in the “score” column. Also includes
peptides and flanking sequences.
predict_to_dataframe_encoded(sequences, throw=True, batch_size=4096)[source]

Predict antigen processing.

See predict method for more information.

Parameters
sequencesFlankingEncoding
batch_sizeint
throwboolean
Returns
pandas.DataFrame
check_consistency()[source]

Verify that self.manifest_df is consistent with instance variables.

Currently only checks for agreement on the total number of models.

Throws AssertionError if inconsistent.

save(models_dir, model_names_to_write=None, write_metadata=True)[source]

Serialize the predictor to a directory on disk. If the directory does not exist it will be created.

The serialization format consists of a file called “manifest.csv” with the configurations of each Class1ProcessingNeuralNetwork, along with per-network files giving the model weights.

Parameters
models_dirstring

Path to directory. It will be created if it doesn’t exist.

classmethod load(models_dir=None, max_models=None)[source]

Deserialize a predictor from a directory on disk.

Parameters
models_dirstring

Path to directory. If unspecified the default downloaded models are used.

max_modelsint, optional

Maximum number of models to load

Returns
Class1ProcessingPredictor instance
class mhcflurry.Class1ProcessingNeuralNetwork(**hyperparameters)[source]

Bases: object

A neural network for antigen processing prediction

network_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Hyperparameters (and their default values) that affect the neural network architecture.

fit_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Hyperparameters for neural network training.

early_stopping_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Hyperparameters for early stopping.

compile_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Loss and optimizer hyperparameters. Any values supported by keras may be used.

auxiliary_input_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Allele feature hyperparameters.

hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>
property sequence_lengths

Supported maximum sequence lengths

Returns
dict of string -> int
Keys are “peptide”, “n_flank”, “c_flank”. Values give the maximum
supported sequence length.
network()[source]

Return the keras model associated with this network.

update_network_description()[source]

Update self.network_json and self.network_weights properties based on this instances’s neural network.

fit(sequences, targets, sample_weights=None, shuffle_permutation=None, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]

Fit the neural network.

Parameters
sequencesFlankingEncoding

Peptides and upstream/downstream flanking sequences

targetslist of float

1 indicates hit, 0 indicates decoy

sample_weightslist of float

If not specified all samples have equal weight.

shuffle_permutationlist of int

Permutation (integer list) of same length as peptides and affinities If None, then a random permutation will be generated.

verboseint

Keras verbosity level

progress_callbackfunction

No-argument function to call after each epoch.

progress_preamblestring

Optional string of information to include in each progress update

progress_print_intervalfloat

How often (in seconds) to print progress update. Set to None to disable.

predict(peptides, n_flanks=None, c_flanks=None, batch_size=4096)[source]

Predict antigen processing.

Parameters
peptideslist of string

Peptide sequences

n_flankslist of string

Upstream sequence before each peptide

c_flankslist of string

Downstream sequence after each peptide

batch_sizeint

Prediction keras batch size.

Returns
numpy.array
Processing scores. Range is 0-1, higher indicates more favorable
processing.
predict_encoded(sequences, throw=True, batch_size=4096)[source]

Predict antigen processing.

Parameters
sequencesFlankingEncoding

Peptides and flanking sequences

throwboolean

Whether to throw exception on unsupported peptides

batch_sizeint

Prediction keras batch size.

Returns
numpy.array
network_input(sequences, throw=True)[source]

Encode peptides to the fixed-length encoding expected by the neural network (which depends on the architecture).

Parameters
sequencesFlankingEncoding

Peptides and flanking sequences

throwboolean

Whether to throw exception on unsupported peptides

Returns
numpy.array
make_network(amino_acid_encoding, peptide_max_length, n_flank_length, c_flank_length, flanking_averages, convolutional_filters, convolutional_kernel_size, convolutional_activation, convolutional_kernel_l1_l2, dropout_rate, post_convolutional_dense_layer_sizes)[source]

Helper function to make a keras network given hyperparameters.

get_weights()[source]

Get the network weights

Returns
list of numpy.array giving weights for each layer or None if there is no
network
get_config()[source]

serialize to a dict all attributes except model weights

Returns
dict
classmethod from_config(config, weights=None)[source]

deserialize from a dict returned by get_config().

Parameters
configdict
weightslist of array, optional

Network weights to restore

Returns
Class1ProcessingNeuralNetwork
class mhcflurry.Class1PresentationPredictor(affinity_predictor=None, processing_predictor_with_flanks=None, processing_predictor_without_flanks=None, weights_dataframe=None, metadata_dataframes=None, percent_rank_transform=None, provenance_string=None)[source]

Bases: object

A logistic regression model over predicted binding affinity (BA) and antigen processing (AP) score.

Instances of this class delegate to Class1AffinityPredictor and Class1ProcessingPredictor instances to generate BA and AP predictions. These predictions are combined using a logistic regression model to give a “presentation score” prediction.

Most users will call the load static method to get an instance of this class, then call the predict method to generate predictions.

model_inputs = ['affinity_score', 'processing_score']
property supported_alleles

List of alleles supported by the underlying Class1AffinityPredictor

property supported_peptide_lengths

(min, max) of supported peptide lengths, inclusive.

property supports_affinity_prediction

Is there an affinity predictor associated with this instance?

property supports_processing_prediction

Is there a processing predictor associated with this instance?

property supports_presentation_prediction

Can this instance predict presentation?

predict_affinity(peptides, alleles, sample_names=None, include_affinity_percentile=True, verbose=1, throw=True)[source]

Predict binding affinities across samples (each corresponding to up to six MHC I alleles).

Two modes are supported: each peptide can be evaluated for binding to any of the alleles in any sample (this is what happens when sample_names is None), or the i’th peptide can be evaluated for binding the alleles of the sample given by the i’th entry in sample_names.

For example, if we don’t specify sample_names, then predictions are taken for all combinations of samples and peptides, for a result size of num peptides * num samples:

>>> predictor = Class1PresentationPredictor.load()
>>> predictor.predict_affinity(
...    peptides=["SIINFEKL", "PEPTIDE"],
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    verbose=0)
    peptide  peptide_num sample_name   affinity best_allele  affinity_percentile
0  SIINFEKL            0     sample1  11927.161       A0201                6.296
1   PEPTIDE            1     sample1  32507.083       A0201               71.249
2  SIINFEKL            0     sample2   2725.593       C0202                6.662
3   PEPTIDE            1     sample2  28304.330       C0202               54.652

In contrast, here we specify sample_names, so peptide is evaluated for binding the alleles in the corresponding sample, for a result size equal to the number of peptides:

>>> predictor.predict_affinity(
...    peptides=["SIINFEKL", "PEPTIDE"],
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    sample_names=["sample2", "sample1"],
...    verbose=0)
    peptide  peptide_num sample_name   affinity best_allele  affinity_percentile
0  SIINFEKL            0     sample2   2725.592       C0202                6.662
1   PEPTIDE            1     sample1  32507.079       A0201               71.249
Parameters
peptideslist of string

Peptide sequences

allelesdict of string -> list of string

Keys are sample names, values are the alleles (genotype) for that sample

sample_nameslist of string [same length as peptides]

Sample names corresponding to each peptide. If None, then predictions are generated for all sample genotypes across all peptides.

include_affinity_percentilebool

Whether to include affinity percentile ranks

verboseint

Set to 0 for quiet.

throwverbose

Whether to throw exception (vs. just log a warning) on invalid peptides, etc.

Returns
pandas.DataFramepredictions
predict_processing(peptides, n_flanks=None, c_flanks=None, throw=True, verbose=1)[source]

Predict antigen processing scores for individual peptides, optionally including flanking sequences for better cleavage prediction.

Parameters
peptideslist of string
n_flankslist of string [same length as peptides]
c_flankslist of string [same length as peptides]
throwboolean

Whether to raise exception on unsupported peptides

verboseint
Returns
numpy.arrayAntigen processing scores for each peptide
fit(targets, peptides, sample_names, alleles, n_flanks=None, c_flanks=None, verbose=1)[source]

Fit the presentation score logistic regression model.

Parameters
targetslist of int/float

1 indicates hit, 0 indicates decoy

peptideslist of string [same length as targets]
sample_nameslist of string [same length as targets]
allelesdict of string -> list of string

Keys are sample names, values are the alleles for that sample

n_flankslist of string [same length as targets]
c_flankslist of string [same length as targets]
verboseint
get_model(name=None)[source]

Load or instantiate a new logistic regression model. Private helper method.

Parameters
namestring

If None (the default), an un-fit LR model is returned. Otherwise the weights are loaded for the specified model.

Returns
sklearn.linear_model.LogisticRegression
predict(peptides, alleles, sample_names=None, n_flanks=None, c_flanks=None, include_affinity_percentile=False, verbose=1, throw=True)[source]

Predict presentation scores across a set of peptides.

Presentation scores combine predictions for MHC I binding affinity and antigen processing.

This method returns a pandas.DataFrame giving presentation scores plus the binding affinity and processing predictions and other intermediate results.

Example:

>>> predictor = Class1PresentationPredictor.load()
>>> predictor.predict(
...    peptides=["SIINFEKL", "PEPTIDE"],
...    n_flanks=["NNN", "SNS"],
...    c_flanks=["CCC", "CNC"],
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    verbose=0)
    peptide n_flank c_flank  peptide_num sample_name   affinity best_allele  processing_score  presentation_score  presentation_percentile
0  SIINFEKL     NNN     CCC            0     sample1  11927.161       A0201             0.838               0.145                    2.282
1   PEPTIDE     SNS     CNC            1     sample1  32507.083       A0201             0.025               0.003                  100.000
2  SIINFEKL     NNN     CCC            0     sample2   2725.593       C0202             0.838               0.416                    1.017
3   PEPTIDE     SNS     CNC            1     sample2  28304.330       C0202             0.025               0.003                   99.287

You can also specify sample_names, in which case peptide is evaluated for binding the alleles in the corresponding sample only. See predict_affinity for an examples.

Parameters
peptideslist of string

Peptide sequences

alleleslist of string or dict of string -> list of string

If you are predicting for a single sample, pass a list of strings (up to 6) indicating the genotype. If you are predicting across multiple samples, pass a dict where the keys are (arbitrary) sample names and the values are the alleles to predict for that sample. Set to an empty list or dict to perform processing prediction only.

sample_nameslist of string [same length as peptides]

If you are passing a dict for ‘alleles’, you can use this argument to specify which peptides go with which samples. If it is None, then predictions will be performed for each peptide across all samples.

n_flankslist of string [same length as peptides]

Upstream sequences before the peptide. Sequences of any length can be given and a suffix of the size supported by the model will be used.

c_flankslist of string [same length as peptides]

Downstream sequences after the peptide. Sequences of any length can be given and a prefix of the size supported by the model will be used.

include_affinity_percentilebool

Whether to include affinity percentile ranks

verboseint

Set to 0 for quiet.

throwverbose

Whether to throw exception (vs. just log a warning) on invalid peptides, etc.

Returns
pandas.DataFrame
Presentation scores and intermediate results.
predict_sequences(sequences, alleles, result='best', comparison_quantity=None, filter_value=None, peptide_lengths=8, 9, 10, 11, use_flanks=True, include_affinity_percentile=True, verbose=1, throw=True)[source]

Predict presentation across protein sequences.

Example:

>>> predictor = Class1PresentationPredictor.load()
>>> predictor.predict_sequences(
...    sequences={
...        'protein1': "MDSKGSSQKGSRLLLLLVVSNLL",
...        'protein2': "SSLPTPEDKEQAQQTHH",
...    },
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    result="filtered",
...    comparison_quantity="affinity",
...    filter_value=500,
...    verbose=0)
  sequence_name  pos     peptide n_flank c_flank sample_name  affinity best_allele  affinity_percentile  processing_score  presentation_score  presentation_percentile
0      protein1   14   LLLVVSNLL   GSRLL             sample1    57.180       A0201                0.398             0.233               0.754                    0.351
1      protein1   13   LLLLVVSNL   KGSRL       L     sample1    57.339       A0201                0.398             0.031               0.586                    0.643
2      protein1    5   SSQKGSRLL   MDSKG   LLLVV     sample2   110.779       C0202                0.782             0.061               0.456                    0.920
3      protein1    6   SQKGSRLLL   DSKGS   LLVVS     sample2   254.480       C0202                1.735             0.102               0.303                    1.356
4      protein1   13  LLLLVVSNLL   KGSRL             sample1   260.390       A0201                1.012             0.158               0.345                    1.215
5      protein1   12  LLLLLVVSNL   QKGSR       L     sample1   308.150       A0201                1.094             0.015               0.206                    1.802
6      protein2    0   SSLPTPEDK           EQAQQ     sample2   410.354       C0202                2.398             0.003               0.158                    2.155
7      protein1    5    SSQKGSRL   MDSKG   LLLLV     sample2   444.321       C0202                2.512             0.026               0.159                    2.138
8      protein2    0   SSLPTPEDK           EQAQQ     sample1   459.296       A0301                0.971             0.003               0.144                    2.292
9      protein1    4   GSSQKGSRL    MDSK   LLLLV     sample2   469.052       C0202                2.595             0.014               0.146                    2.261
Parameters
sequencesstr, list of string, or string -> string dict

Protein sequences. If a dict is given, the keys are arbitrary ( e.g. protein names), and the values are the amino acid sequences.

alleleslist of string, list of list of string, or dict of string -> list of string

MHC I alleles. Can be: (1) a string (a single allele), (2) a list of strings (a single genotype), (3) a list of list of strings (multiple genotypes, where the total number of genotypes must equal the number of sequences), or (4) a dict giving multiple genotypes, which will each be run over the sequences.

resultstring

Specify ‘best’ to return the strongest peptide for each sequence, ‘all’ to return predictions for all peptides, or ‘filtered’ to return predictions where the comparison_quantity is stronger (i.e (<) for affinity, (>) for scores) than filter_value.

comparison_quantitystring

One of “presentation_score”, “processing_score”, “affinity”, or “affinity_percentile”. Prediction to use to rank (if result is “best”) or filter (if result is “filtered”) results. Default is “presentation_score”.

filter_valuefloat

Threshold value to use, only relevant when result is “filtered”. If comparison_quantity is “affinity”, then all results less than (i.e. tighter than) the specified nM affinity are retained. If it’s “presentation_score” or “processing_score” then results greater than the indicated filter_value are retained.

peptide_lengthslist of int

Peptide lengths to predict for.

use_flanksbool

Whether to include flanking sequences when running the AP predictor (for better cleavage prediction).

include_affinity_percentilebool

Whether to include affinity percentile ranks in output.

verboseint

Set to 0 for quiet mode.

throwboolean

Whether to throw exceptions (vs. log warnings) on invalid inputs.

Returns
pandas.DataFrame with columns:

peptide, n_flank, c_flank, sequence_name, affinity, best_allele, processing_score, presentation_score

save(models_dir, write_affinity_predictor=True, write_processing_predictor=True, write_weights=True, write_percent_ranks=True, write_info=True, write_metdata=True)[source]

Save the predictor to a directory on disk. If the directory does not exist it will be created.

The wrapped Class1AffinityPredictor and Class1ProcessingPredictor instances are included in the saved data.

Parameters
models_dirstring

Path to directory. It will be created if it doesn’t exist.

classmethod load(models_dir=None, max_models=None)[source]

Deserialize a predictor from a directory on disk.

This will also load the wrapped Class1AffinityPredictor and Class1ProcessingPredictor instances.

Parameters
models_dirstring

Path to directory. If unspecified the default downloaded models are used.

max_modelsint, optional

Maximum number of affinity and processing (counted separately) models to load

Returns
Class1PresentationPredictor instance
percentile_ranks(presentation_scores, throw=True)[source]

Return percentile ranks for the given presentation scores.

Parameters
presentation_scoressequence of float
Returns
numpy.array of float
calibrate_percentile_ranks(scores, bins=None)[source]

Compute the cumulative distribution of scores, to enable taking quantiles of this distribution later.

Parameters
scoressequence of float

Presentation prediction scores

binsobject

Anything that can be passed to numpy.histogram’s “bins” argument can be used here, i.e. either an integer or a sequence giving bin edges.

Submodules

mhcflurry.allele_encoding module

class mhcflurry.allele_encoding.AlleleEncoding(alleles=None, allele_to_sequence=None, borrow_from=None)[source]

Bases: object

A place to cache encodings for a sequence of alleles.

We frequently work with alleles by integer indices, for example as inputs to neural networks. This class is used to map allele names to integer indices in a consistent way by keeping track of the universe of alleles under use, i.e. a distinction is made between the universe of supported alleles (what’s in allele_to_sequence) and the actual set of alleles used for some task (what’s in alleles).

Parameters
alleleslist of string

Allele names. If any allele is None instead of string, it will be mapped to the special index value -1.

allele_to_sequencedict of str -> str

Allele name to amino acid sequence

borrow_fromAlleleEncoding, optional

If specified, do not specify allele_to_sequence. The sequences from the provided instance are used. This guarantees that the mappings from allele to index and from allele to sequence are the same between the instances.

compact()[source]

Return a new AlleleEncoding in which the universe of supported alleles is only the alleles actually used.

Returns
AlleleEncoding
allele_representations(encoding_name)[source]

Encode the universe of supported allele sequences to a matrix.

Parameters
encoding_namestring

How to represent amino acids. Valid names are “BLOSUM62” or “one-hot”. See amino_acid.ENCODING_DATA_FRAMES.

Returns
numpy.array of shape

(num alleles in universe, sequence length, vector size)

where vector size is usually 21 (20 amino acids + X character)
fixed_length_vector_encoded_sequences(encoding_name)[source]

Encode allele sequences (not the universe of alleles) to a matrix.

Parameters
encoding_namestring

How to represent amino acids. Valid names are “BLOSUM62” or “one-hot”. See amino_acid.ENCODING_DATA_FRAMES.

Returns
numpy.array with shape:

(num alleles, sequence length, vector size)

where vector size is usually 21 (20 amino acids + X character)

mhcflurry.amino_acid module

Functions for encoding fixed length sequences of amino acids into various vector representations, such as one-hot and BLOSUM62.

mhcflurry.amino_acid.available_vector_encodings()[source]

Return list of supported amino acid vector encodings.

Returns
list of string
mhcflurry.amino_acid.vector_encoding_length(name)[source]

Return the length of the given vector encoding.

Parameters
namestring
Returns
int
mhcflurry.amino_acid.index_encoding(sequences, letter_to_index_dict)[source]

Encode a sequence of same-length strings to a matrix of integers of the same shape. The map from characters to integers is given by letter_to_index_dict.

Given a sequence of n strings all of length k, return a k * n array where the (i, j)th element is letter_to_index_dict[sequence[i][j]].

Parameters
sequenceslist of length n of strings of length k
letter_to_index_dictdict
Returns
numpy.array of integers with shape (k, n)
mhcflurry.amino_acid.fixed_vectors_encoding(index_encoded_sequences, letter_to_vector_df)[source]

Given a n x k matrix of integers such as that returned by index_encoding() and a dataframe mapping each index to an arbitrary vector, return a n * k * m array where the (i, j)’th element is letter_to_vector_df.iloc[sequence[i][j]].

The dataframe index and columns names are ignored here; the indexing is done entirely by integer position in the dataframe.

Parameters
index_encoded_sequencesn x k array of integers
letter_to_vector_dfpandas.DataFrame of shape (alphabet size, m)
Returns
numpy.array of integers with shape (n, k, m)

mhcflurry.calibrate_percentile_ranks_command module

Calibrate percentile ranks for models. Runs in-place.

mhcflurry.calibrate_percentile_ranks_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]
mhcflurry.calibrate_percentile_ranks_command.run_class1_presentation_predictor(args, peptides)[source]
mhcflurry.calibrate_percentile_ranks_command.run_class1_affinity_predictor(args, peptides)[source]
mhcflurry.calibrate_percentile_ranks_command.do_class1_affinity_calibrate_percentile_ranks(alleles, constant_data={})[source]
mhcflurry.calibrate_percentile_ranks_command.class1_affinity_calibrate_percentile_ranks(allele, predictor, peptides=None, motif_summary=False, summary_top_peptide_fractions=[0.001], verbose=False, model_kwargs={})[source]

mhcflurry.class1_affinity_predictor module

class mhcflurry.class1_affinity_predictor.Class1AffinityPredictor(allele_to_allele_specific_models=None, class1_pan_allele_models=None, allele_to_sequence=None, manifest_df=None, allele_to_percent_rank_transform=None, metadata_dataframes=None, provenance_string=None)[source]

Bases: object

High-level interface for peptide/MHC I binding affinity prediction.

This class manages low-level Class1NeuralNetwork instances, each of which wraps a single Keras network. The purpose of Class1AffinityPredictor is to implement ensembles, handling of multiple alleles, and predictor loading and saving. It also provides a place to keep track of metadata like prediction histograms for percentile rank calibration.

Parameters
allele_to_allele_specific_modelsdict of string -> list of Class1NeuralNetwork

Ensemble of single-allele models to use for each allele.

class1_pan_allele_modelslist of Class1NeuralNetwork

Ensemble of pan-allele models.

allele_to_sequencedict of string -> string

MHC allele name to fixed-length amino acid sequence (sometimes referred to as the pseudosequence). Required only if class1_pan_allele_models is specified.

manifest_dfpandas.DataFrame, optional

Must have columns: model_name, allele, config_json, model. Only required if you want to update an existing serialization of a Class1AffinityPredictor. Otherwise this dataframe will be generated automatically based on the supplied models.

allele_to_percent_rank_transformdict of string -> PercentRankTransform, optional

PercentRankTransform instances to use for each allele

metadata_dataframesdict of string -> pandas.DataFrame, optional

Optional additional dataframes to write to the models dir when save() is called. Useful for tracking provenance.

provenance_stringstring, optional

Optional info string to use in __str__.

property manifest_df

A pandas.DataFrame describing the models included in this predictor.

Based on: - self.class1_pan_allele_models - self.allele_to_allele_specific_models

Returns
pandas.DataFrame
clear_cache()[source]

Clear values cached based on the neural networks in this predictor.

Users should call this after mutating any of the following:
  • self.class1_pan_allele_models

  • self.allele_to_allele_specific_models

  • self.allele_to_sequence

Methods that mutate these instance variables will call this method on their own if needed.

property neural_networks

List of the neural networks in the ensemble.

Returns
list of Class1NeuralNetwork
classmethod merge(predictors)[source]

Merge the ensembles of two or more Class1AffinityPredictor instances.

Note: the resulting merged predictor will NOT have calibrated percentile ranks. Call calibrate_percentile_ranks on it if these are needed.

Parameters
predictorssequence of Class1AffinityPredictor
Returns
Class1AffinityPredictor instance
merge_in_place(others)[source]

Add the models present in other predictors into the current predictor.

Parameters
otherslist of Class1AffinityPredictor

Other predictors to merge into the current predictor.

Returns
list of stringnames of newly added models
property supported_alleles

Alleles for which predictions can be made.

Returns
list of string
property supported_peptide_lengths

(minimum, maximum) lengths of peptides supported by all models, inclusive.

Returns
(int, int) tuple
check_consistency()[source]

Verify that self.manifest_df is consistent with: - self.class1_pan_allele_models - self.allele_to_allele_specific_models

Currently only checks for agreement on the total number of models.

Throws AssertionError if inconsistent.

save(models_dir, model_names_to_write=None, write_metadata=True)[source]

Serialize the predictor to a directory on disk. If the directory does not exist it will be created.

The serialization format consists of a file called “manifest.csv” with the configurations of each Class1NeuralNetwork, along with per-network files giving the model weights. If there are pan-allele predictors in the ensemble, the allele sequences are also stored in the directory. There is also a small file “index.txt” with basic metadata: when the models were trained, by whom, on what host.

Parameters
models_dirstring

Path to directory. It will be created if it doesn’t exist.

model_names_to_writelist of string, optional

Only write the weights for the specified models. Useful for incremental updates during training.

write_metadataboolean, optional

Whether to write optional metadata

static load(models_dir=None, max_models=None, optimization_level=None)[source]

Deserialize a predictor from a directory on disk.

Parameters
models_dirstring

Path to directory. If unspecified the default downloaded models are used.

max_modelsint, optional

Maximum number of Class1NeuralNetwork instances to load

optimization_levelint

If >0, model optimization will be attempted. Defaults to value of environment variable MHCFLURRY_OPTIMIZATION_LEVEL.

Returns
Class1AffinityPredictor instance
optimize(warn=True)[source]

EXPERIMENTAL: Optimize the predictor for faster predictions.

Currently the only optimization implemented is to merge multiple pan- allele predictors at the tensorflow level.

The optimization is performed in-place, mutating the instance.

Returns
bool

Whether optimization was performed

static model_name(allele, num)[source]

Generate a model name

Parameters
allelestring
numint
Returns
string
static weights_path(models_dir, model_name)[source]

Generate the path to the weights file for a model

Parameters
models_dirstring
model_namestring
Returns
string
property master_allele_encoding

An AlleleEncoding containing the universe of alleles specified by self.allele_to_sequence.

Returns
AlleleEncoding
fit_allele_specific_predictors(n_models, architecture_hyperparameters_list, allele, peptides, affinities, inequalities=None, train_rounds=None, models_dir_for_save=None, verbose=0, progress_preamble='', progress_print_interval=5.0)[source]

Fit one or more allele specific predictors for a single allele using one or more neural network architectures.

The new predictors are saved in the Class1AffinityPredictor instance and will be used on subsequent calls to predict.

Parameters
n_modelsint

Number of neural networks to fit

architecture_hyperparameters_listlist of dict

List of hyperparameter sets.

allelestring
peptidesEncodableSequences or list of string
affinitieslist of float

nM affinities

inequalitieslist of string, each element one of “>”, “<”, or “=”

See Class1NeuralNetwork.fit for details.

train_roundssequence of int

Each training point i will be used on training rounds r for which train_rounds[i] > r, r >= 0.

models_dir_for_savestring, optional

If specified, the Class1AffinityPredictor is (incrementally) written to the given models dir after each neural network is fit.

verboseint

Keras verbosity

progress_preamblestring

Optional string of information to include in each progress update

progress_print_intervalfloat

How often (in seconds) to print progress. Set to None to disable.

Returns
list of Class1NeuralNetwork
fit_class1_pan_allele_models(n_models, architecture_hyperparameters, alleles, peptides, affinities, inequalities, models_dir_for_save=None, verbose=1, progress_preamble='', progress_print_interval=5.0)[source]

Fit one or more pan-allele predictors using a single neural network architecture.

The new predictors are saved in the Class1AffinityPredictor instance and will be used on subsequent calls to predict.

Parameters
n_modelsint

Number of neural networks to fit

architecture_hyperparametersdict
alleleslist of string

Allele names (not sequences) corresponding to each peptide

peptidesEncodableSequences or list of string
affinitieslist of float

nM affinities

inequalitieslist of string, each element one of “>”, “<”, or “=”

See Class1NeuralNetwork.fit for details.

models_dir_for_savestring, optional

If specified, the Class1AffinityPredictor is (incrementally) written to the given models dir after each neural network is fit.

verboseint

Keras verbosity

progress_preamblestring

Optional string of information to include in each progress update

progress_print_intervalfloat

How often (in seconds) to print progress. Set to None to disable.

Returns
list of Class1NeuralNetwork
add_pan_allele_model(model, models_dir_for_save=None)[source]

Add a pan-allele model to the ensemble and optionally do an incremental save.

Parameters
modelClass1NeuralNetwork
models_dir_for_savestring

Directory to save resulting ensemble to

percentile_ranks(affinities, allele=None, alleles=None, throw=True)[source]

Return percentile ranks for the given ic50 affinities and alleles.

The ‘allele’ and ‘alleles’ argument are as in the predict method. Specify one of these.

Parameters
affinitiessequence of float

nM affinities

allelestring
allelessequence of string
throwboolean

If True, a ValueError will be raised in the case of unsupported alleles. If False, a warning will be logged and NaN will be returned for those percentile ranks.

Returns
numpy.array of float
predict(peptides, alleles=None, allele=None, throw=True, centrality_measure='mean', model_kwargs={})[source]

Predict nM binding affinities.

If multiple predictors are available for an allele, the predictions are the geometric means of the individual model (nM) predictions.

One of ‘allele’ or ‘alleles’ must be specified. If ‘allele’ is specified all predictions will be for the given allele. If ‘alleles’ is specified it must be the same length as ‘peptides’ and give the allele corresponding to each peptide.

Parameters
peptidesEncodableSequences or list of string
alleleslist of string
allelestring
throwboolean

If True, a ValueError will be raised in the case of unsupported alleles or peptide lengths. If False, a warning will be logged and the predictions for the unsupported alleles or peptides will be NaN.

centrality_measurestring or callable

Measure of central tendency to use to combine predictions in the ensemble. Options include: mean, median, robust_mean.

model_kwargsdict

Additional keyword arguments to pass to Class1NeuralNetwork.predict

Returns
numpy.array of predictions
predict_to_dataframe(peptides, alleles=None, allele=None, throw=True, include_individual_model_predictions=False, include_percentile_ranks=True, include_confidence_intervals=True, centrality_measure='mean', model_kwargs={})[source]

Predict nM binding affinities. Gives more detailed output than predict method, including 5-95% prediction intervals.

If multiple predictors are available for an allele, the predictions are the geometric means of the individual model predictions.

One of ‘allele’ or ‘alleles’ must be specified. If ‘allele’ is specified all predictions will be for the given allele. If ‘alleles’ is specified it must be the same length as ‘peptides’ and give the allele corresponding to each peptide.

Parameters
peptidesEncodableSequences or list of string
alleleslist of string
allelestring
throwboolean

If True, a ValueError will be raised in the case of unsupported alleles or peptide lengths. If False, a warning will be logged and the predictions for the unsupported alleles or peptides will be NaN.

include_individual_model_predictionsboolean

If True, the predictions of each individual model are included as columns in the result DataFrame.

include_percentile_ranksboolean, default True

If True, a “prediction_percentile” column will be included giving the percentile ranks. If no percentile rank info is available, this will be ignored with a warning.

centrality_measurestring or callable

Measure of central tendency to use to combine predictions in the ensemble. Options include: mean, median, robust_mean.

model_kwargsdict

Additional keyword arguments to pass to Class1NeuralNetwork.predict

Returns
pandas.DataFrame of predictions
calibrate_percentile_ranks(peptides=None, num_peptides_per_length=100000, alleles=None, bins=None, motif_summary=False, summary_top_peptide_fractions=[0.001], verbose=False, model_kwargs={})[source]

Compute the cumulative distribution of ic50 values for a set of alleles over a large universe of random peptides, to enable taking quantiles of this distribution later.

Parameters
peptidessequence of string or EncodableSequences, optional

Peptides to use

num_peptides_per_lengthint, optional

If peptides argument is not specified, then num_peptides_per_length peptides are randomly sampled from a uniform distribution for each supported length

allelessequence of string, optional

Alleles to perform calibration for. If not specified all supported alleles will be calibrated.

binsobject

Anything that can be passed to numpy.histogram’s “bins” argument can be used here, i.e. either an integer or a sequence giving bin edges. This is in ic50 space.

motif_summarybool

If True, the length distribution and per-position amino acid frequencies are also calculated for the top x fraction of tightest- binding peptides, where each value of x is given in the summary_top_peptide_fractions list.

summary_top_peptide_fractionslist of float

Only used if motif_summary is True

verboseboolean

Whether to print status updates to stdout

model_kwargsdict

Additional low-level Class1NeuralNetwork.predict() kwargs.

Returns
dict of string -> pandas.DataFrame
If motif_summary is True, this will have keys “frequency_matrices” and
“length_distributions”. Otherwise it will be empty.
model_select(score_function, alleles=None, min_models=1, max_models=10000)[source]

Perform model selection using a user-specified scoring function.

This works only with allele-specific models, not pan-allele models.

Model selection is done using a “step up” variable selection procedure, in which models are repeatedly added to an ensemble until the score stops improving.

Parameters
score_functionClass1AffinityPredictor -> float function

Scoring function

alleleslist of string, optional

If not specified, model selection is performed for all alleles.

min_modelsint, optional

Min models to select per allele

max_modelsint, optional

Max models to select per allele

Returns
Class1AffinityPredictorpredictor containing the selected models

mhcflurry.class1_neural_network module

class mhcflurry.class1_neural_network.Class1NeuralNetwork(**hyperparameters)[source]

Bases: object

Low level class I predictor consisting of a single neural network.

Both single allele and pan-allele prediction are supported.

Users will generally use Class1AffinityPredictor, which gives a higher-level interface and supports ensembles.

network_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Hyperparameters (and their default values) that affect the neural network architecture.

compile_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Loss and optimizer hyperparameters.

fit_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Hyperparameters for neural network training.

early_stopping_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Hyperparameters for early stopping.

miscelaneous_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Miscelaneous hyperaparameters. These parameters are not used by this class but may be interpreted by other code.

hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Combined set of all supported hyperparameters and their default values.

hyperparameter_renames = {'embedding_init_method': None, 'embedding_input_dim': None, 'embedding_output_dim': None, 'kmer_size': None, 'left_edge': None, 'min_delta': None, 'mode': None, 'monitor': None, 'peptide_amino_acid_encoding': None, 'pseudosequence_use_embedding': None, 'right_edge': None, 'take_best_epoch': None, 'use_embedding': None, 'verbose': None}
classmethod apply_hyperparameter_renames(hyperparameters)[source]

Handle hyperparameter renames.

Parameters
hyperparametersdict
Returns
dictupdated hyperparameters
KERAS_MODELS_CACHE = {}

Process-wide keras model cache, a map from: architecture JSON string to (Keras model, existing network weights)

classmethod clear_model_cache()[source]

Clear the Keras model cache.

classmethod borrow_cached_network(network_json, network_weights)[source]

Return a keras Model with the specified architecture and weights. As an optimization, when possible this will reuse architectures from a process-wide cache.

The returned object is “borrowed” in the sense that its weights can change later after subsequent calls to this method from other objects.

If you’re using this from a parallel implementation you’ll need to hold a lock while using the returned object.

Parameters
network_jsonstring of JSON
network_weightslist of numpy.array
Returns
keras.models.Model
network(borrow=False)[source]

Return the keras model associated with this predictor.

Parameters
borrowbool

Whether to return a cached model if possible. See borrow_cached_network for details

Returns
keras.models.Model
update_network_description()[source]

Update self.network_json and self.network_weights properties based on this instances’s neural network.

static keras_network_cache_key(network_json)[source]

Given a Keras JSON description of a neural network, return a key that uniquely defines this network. Networks that share the same key should have compatible weights matrices and give the same prediction outputs when their weights are the same.

Parameters
network_jsonstring
Returns
string
get_config()[source]

serialize to a dict all attributes except model weights

Returns
dict
classmethod from_config(config, weights=None, weights_loader=None)[source]

deserialize from a dict returned by get_config().

Parameters
configdict
weightslist of array, optional

Network weights to restore

weights_loadercallable, optional

Function to call (no arguments) to load weights when needed

Returns
Class1NeuralNetwork
load_weights()[source]

Load weights by evaluating self.network_weights_loader, if needed.

After calling this, self.network_weights_loader will be None and self.network_weights will be the weights list, if available.

get_weights()[source]

Get the network weights

Returns
list of numpy.array giving weights for each layer or None if there is no
network
peptides_to_network_input(peptides)[source]

Encode peptides to the fixed-length encoding expected by the neural network (which depends on the architecture).

Parameters
peptidesEncodableSequences or list of string
Returns
numpy.array
property supported_peptide_lengths

(minimum, maximum) lengths of peptides supported, inclusive.

Returns
(int, int) tuple
allele_encoding_to_network_input(allele_encoding)[source]

Encode alleles to the fixed-length encoding expected by the neural network (which depends on the architecture).

Parameters
allele_encodingAlleleEncoding
Returns
(numpy.array, numpy.array)
Indices and allele representations.
static data_dependent_weights_initialization(network, x_dict=None, method='lsuv', verbose=1)[source]

Data dependent weights initialization.

Parameters
networkkeras.Model
x_dictdict of string -> numpy.ndarray

Training data as would be passed keras.Model.fit().

methodstring

Initialization method. Currently only “lsuv” is supported.

verboseint

Status updates printed to stdout if verbose > 0

fit_generator(generator, validation_peptide_encoding, validation_affinities, validation_allele_encoding=None, validation_inequalities=None, validation_output_indices=None, steps_per_epoch=10, epochs=1000, min_epochs=0, patience=10, min_delta=0.0, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]

Fit using a generator. Does not support many of the features of fit(), such as random negative peptides.

Fitting proceeds until early stopping is hit, using the peptides, affinities, etc. given by the parameters starting with “validation_”.

This is used for pre-training pan-allele models using data synthesized by the allele-specific models.

Parameters
generatorgenerator yielding (alleles, peptides, affinities) tuples

where alleles and peptides are lists of strings, and affinities is list of floats.

validation_peptide_encodingEncodableSequences
validation_affinitieslist of float
validation_allele_encodingAlleleEncoding
validation_inequalitieslist of string
validation_output_indiceslist of int
steps_per_epochint
epochsint
min_epochsint
patienceint
min_deltafloat
verboseint
progress_callbackthunk
progress_preamblestring
progress_print_intervalfloat
fit(peptides, affinities, allele_encoding=None, inequalities=None, output_indices=None, sample_weights=None, shuffle_permutation=None, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]

Fit the neural network.

Parameters
peptidesEncodableSequences or list of string
affinitieslist of float

nM affinities. Must be same length of as peptides.

allele_encodingAlleleEncoding

If not specified, the model will be a single-allele predictor.

inequalitieslist of string, each element one of “>”, “<”, or “=”.

Inequalities to use for fitting. Same length as affinities. Each element must be one of “>”, “<”, or “=”. For example, a “>” will train on y_pred > y_true for that element in the training set. Requires using a custom losses that support inequalities (e.g. mse_with_ineqalities). If None all inequalities are taken to be “=”.

output_indiceslist of int

For multi-output models only. Same length as affinities. Indicates the index of the output (starting from 0) for each training example.

sample_weightslist of float

If not specified, all samples (including random negatives added during training) will have equal weight. If specified, the random negatives will be assigned weight=1.0.

shuffle_permutationlist of int

Permutation (integer list) of same length as peptides and affinities If None, then a random permutation will be generated.

verboseint

Keras verbosity level

progress_callbackfunction

No-argument function to call after each epoch.

progress_preamblestring

Optional string of information to include in each progress update

progress_print_intervalfloat

How often (in seconds) to print progress update. Set to None to disable.

predict(peptides, allele_encoding=None, batch_size=4096, output_index=0)[source]

Predict affinities.

If peptides are specified as EncodableSequences, then the predictions will be cached for this predictor as long as the EncodableSequences object remains in memory. The cache is keyed in the object identity of the EncodableSequences, not the sequences themselves. The cache is used only for allele-specific models (i.e. when allele_encoding is None).

Parameters
peptidesEncodableSequences or list of string
allele_encodingAlleleEncoding, optional

Only required when this model is a pan-allele model

batch_sizeint

batch_size passed to Keras

output_indexint or None

For multi-output models. Gives the output index to return. If set to None, then all outputs are returned as a samples x outputs matrix.

Returns
numpy.array of nM affinity predictions
classmethod merge(models, merge_method='average')[source]

Merge multiple models at the tensorflow (or other backend) level.

Only certain neural network architectures support merging. Others will result in a NotImplementedError.

Parameters
modelslist of Class1NeuralNetwork

instances to merge

merge_methodstring, one of “average”, “sum”, or “concatenate”

How to merge the predictions of the different models

Returns
Class1NeuralNetwork

The merged neural network

make_network(peptide_encoding, allele_amino_acid_encoding, allele_dense_layer_sizes, peptide_dense_layer_sizes, peptide_allele_merge_method, peptide_allele_merge_activation, layer_sizes, dense_layer_l1_regularization, dense_layer_l2_regularization, activation, init, output_activation, dropout_probability, batch_normalization, locally_connected_layers, topology, num_outputs=1, allele_representations=None)[source]

Helper function to make a keras network for class 1 affinity prediction.

clear_allele_representations()[source]

Set allele representations to an empty array. Useful before saving to save a smaller version of the model.

set_allele_representations(allele_representations, force_surgery=False)[source]

Set the allele representations in use by this model. This means mutating the weights for the allele input embedding layer.

Rationale: instead of passing in the allele sequence for each data point during model training or prediction (which is expensive in terms of memory usage), we pass in an allele index between 0 and n-1 where n is the number of alleles in some universe of possible alleles. This index is used in the model to lookup the corresponding allele sequence. This function sets the lookup table.

See also: AlleleEncoding.allele_representations()

Parameters
allele_representationsnumpy.ndarray of shape (a, l, m)
where a is the total number of alleles,

l is the allele sequence length, m is the length of the vectors used to represent amino acids

mhcflurry.class1_presentation_predictor module

class mhcflurry.class1_presentation_predictor.Class1PresentationPredictor(affinity_predictor=None, processing_predictor_with_flanks=None, processing_predictor_without_flanks=None, weights_dataframe=None, metadata_dataframes=None, percent_rank_transform=None, provenance_string=None)[source]

Bases: object

A logistic regression model over predicted binding affinity (BA) and antigen processing (AP) score.

Instances of this class delegate to Class1AffinityPredictor and Class1ProcessingPredictor instances to generate BA and AP predictions. These predictions are combined using a logistic regression model to give a “presentation score” prediction.

Most users will call the load static method to get an instance of this class, then call the predict method to generate predictions.

model_inputs = ['affinity_score', 'processing_score']
property supported_alleles

List of alleles supported by the underlying Class1AffinityPredictor

property supported_peptide_lengths

(min, max) of supported peptide lengths, inclusive.

property supports_affinity_prediction

Is there an affinity predictor associated with this instance?

property supports_processing_prediction

Is there a processing predictor associated with this instance?

property supports_presentation_prediction

Can this instance predict presentation?

predict_affinity(peptides, alleles, sample_names=None, include_affinity_percentile=True, verbose=1, throw=True)[source]

Predict binding affinities across samples (each corresponding to up to six MHC I alleles).

Two modes are supported: each peptide can be evaluated for binding to any of the alleles in any sample (this is what happens when sample_names is None), or the i’th peptide can be evaluated for binding the alleles of the sample given by the i’th entry in sample_names.

For example, if we don’t specify sample_names, then predictions are taken for all combinations of samples and peptides, for a result size of num peptides * num samples:

>>> predictor = Class1PresentationPredictor.load()
>>> predictor.predict_affinity(
...    peptides=["SIINFEKL", "PEPTIDE"],
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    verbose=0)
    peptide  peptide_num sample_name   affinity best_allele  affinity_percentile
0  SIINFEKL            0     sample1  11927.161       A0201                6.296
1   PEPTIDE            1     sample1  32507.083       A0201               71.249
2  SIINFEKL            0     sample2   2725.593       C0202                6.662
3   PEPTIDE            1     sample2  28304.330       C0202               54.652

In contrast, here we specify sample_names, so peptide is evaluated for binding the alleles in the corresponding sample, for a result size equal to the number of peptides:

>>> predictor.predict_affinity(
...    peptides=["SIINFEKL", "PEPTIDE"],
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    sample_names=["sample2", "sample1"],
...    verbose=0)
    peptide  peptide_num sample_name   affinity best_allele  affinity_percentile
0  SIINFEKL            0     sample2   2725.592       C0202                6.662
1   PEPTIDE            1     sample1  32507.079       A0201               71.249
Parameters
peptideslist of string

Peptide sequences

allelesdict of string -> list of string

Keys are sample names, values are the alleles (genotype) for that sample

sample_nameslist of string [same length as peptides]

Sample names corresponding to each peptide. If None, then predictions are generated for all sample genotypes across all peptides.

include_affinity_percentilebool

Whether to include affinity percentile ranks

verboseint

Set to 0 for quiet.

throwverbose

Whether to throw exception (vs. just log a warning) on invalid peptides, etc.

Returns
pandas.DataFramepredictions
predict_processing(peptides, n_flanks=None, c_flanks=None, throw=True, verbose=1)[source]

Predict antigen processing scores for individual peptides, optionally including flanking sequences for better cleavage prediction.

Parameters
peptideslist of string
n_flankslist of string [same length as peptides]
c_flankslist of string [same length as peptides]
throwboolean

Whether to raise exception on unsupported peptides

verboseint
Returns
numpy.arrayAntigen processing scores for each peptide
fit(targets, peptides, sample_names, alleles, n_flanks=None, c_flanks=None, verbose=1)[source]

Fit the presentation score logistic regression model.

Parameters
targetslist of int/float

1 indicates hit, 0 indicates decoy

peptideslist of string [same length as targets]
sample_nameslist of string [same length as targets]
allelesdict of string -> list of string

Keys are sample names, values are the alleles for that sample

n_flankslist of string [same length as targets]
c_flankslist of string [same length as targets]
verboseint
get_model(name=None)[source]

Load or instantiate a new logistic regression model. Private helper method.

Parameters
namestring

If None (the default), an un-fit LR model is returned. Otherwise the weights are loaded for the specified model.

Returns
sklearn.linear_model.LogisticRegression
predict(peptides, alleles, sample_names=None, n_flanks=None, c_flanks=None, include_affinity_percentile=False, verbose=1, throw=True)[source]

Predict presentation scores across a set of peptides.

Presentation scores combine predictions for MHC I binding affinity and antigen processing.

This method returns a pandas.DataFrame giving presentation scores plus the binding affinity and processing predictions and other intermediate results.

Example:

>>> predictor = Class1PresentationPredictor.load()
>>> predictor.predict(
...    peptides=["SIINFEKL", "PEPTIDE"],
...    n_flanks=["NNN", "SNS"],
...    c_flanks=["CCC", "CNC"],
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    verbose=0)
    peptide n_flank c_flank  peptide_num sample_name   affinity best_allele  processing_score  presentation_score  presentation_percentile
0  SIINFEKL     NNN     CCC            0     sample1  11927.161       A0201             0.838               0.145                    2.282
1   PEPTIDE     SNS     CNC            1     sample1  32507.083       A0201             0.025               0.003                  100.000
2  SIINFEKL     NNN     CCC            0     sample2   2725.593       C0202             0.838               0.416                    1.017
3   PEPTIDE     SNS     CNC            1     sample2  28304.330       C0202             0.025               0.003                   99.287

You can also specify sample_names, in which case peptide is evaluated for binding the alleles in the corresponding sample only. See predict_affinity for an examples.

Parameters
peptideslist of string

Peptide sequences

alleleslist of string or dict of string -> list of string

If you are predicting for a single sample, pass a list of strings (up to 6) indicating the genotype. If you are predicting across multiple samples, pass a dict where the keys are (arbitrary) sample names and the values are the alleles to predict for that sample. Set to an empty list or dict to perform processing prediction only.

sample_nameslist of string [same length as peptides]

If you are passing a dict for ‘alleles’, you can use this argument to specify which peptides go with which samples. If it is None, then predictions will be performed for each peptide across all samples.

n_flankslist of string [same length as peptides]

Upstream sequences before the peptide. Sequences of any length can be given and a suffix of the size supported by the model will be used.

c_flankslist of string [same length as peptides]

Downstream sequences after the peptide. Sequences of any length can be given and a prefix of the size supported by the model will be used.

include_affinity_percentilebool

Whether to include affinity percentile ranks

verboseint

Set to 0 for quiet.

throwverbose

Whether to throw exception (vs. just log a warning) on invalid peptides, etc.

Returns
pandas.DataFrame
Presentation scores and intermediate results.
predict_sequences(sequences, alleles, result='best', comparison_quantity=None, filter_value=None, peptide_lengths=8, 9, 10, 11, use_flanks=True, include_affinity_percentile=True, verbose=1, throw=True)[source]

Predict presentation across protein sequences.

Example:

>>> predictor = Class1PresentationPredictor.load()
>>> predictor.predict_sequences(
...    sequences={
...        'protein1': "MDSKGSSQKGSRLLLLLVVSNLL",
...        'protein2': "SSLPTPEDKEQAQQTHH",
...    },
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    result="filtered",
...    comparison_quantity="affinity",
...    filter_value=500,
...    verbose=0)
  sequence_name  pos     peptide n_flank c_flank sample_name  affinity best_allele  affinity_percentile  processing_score  presentation_score  presentation_percentile
0      protein1   14   LLLVVSNLL   GSRLL             sample1    57.180       A0201                0.398             0.233               0.754                    0.351
1      protein1   13   LLLLVVSNL   KGSRL       L     sample1    57.339       A0201                0.398             0.031               0.586                    0.643
2      protein1    5   SSQKGSRLL   MDSKG   LLLVV     sample2   110.779       C0202                0.782             0.061               0.456                    0.920
3      protein1    6   SQKGSRLLL   DSKGS   LLVVS     sample2   254.480       C0202                1.735             0.102               0.303                    1.356
4      protein1   13  LLLLVVSNLL   KGSRL             sample1   260.390       A0201                1.012             0.158               0.345                    1.215
5      protein1   12  LLLLLVVSNL   QKGSR       L     sample1   308.150       A0201                1.094             0.015               0.206                    1.802
6      protein2    0   SSLPTPEDK           EQAQQ     sample2   410.354       C0202                2.398             0.003               0.158                    2.155
7      protein1    5    SSQKGSRL   MDSKG   LLLLV     sample2   444.321       C0202                2.512             0.026               0.159                    2.138
8      protein2    0   SSLPTPEDK           EQAQQ     sample1   459.296       A0301                0.971             0.003               0.144                    2.292
9      protein1    4   GSSQKGSRL    MDSK   LLLLV     sample2   469.052       C0202                2.595             0.014               0.146                    2.261
Parameters
sequencesstr, list of string, or string -> string dict

Protein sequences. If a dict is given, the keys are arbitrary ( e.g. protein names), and the values are the amino acid sequences.

alleleslist of string, list of list of string, or dict of string -> list of string

MHC I alleles. Can be: (1) a string (a single allele), (2) a list of strings (a single genotype), (3) a list of list of strings (multiple genotypes, where the total number of genotypes must equal the number of sequences), or (4) a dict giving multiple genotypes, which will each be run over the sequences.

resultstring

Specify ‘best’ to return the strongest peptide for each sequence, ‘all’ to return predictions for all peptides, or ‘filtered’ to return predictions where the comparison_quantity is stronger (i.e (<) for affinity, (>) for scores) than filter_value.

comparison_quantitystring

One of “presentation_score”, “processing_score”, “affinity”, or “affinity_percentile”. Prediction to use to rank (if result is “best”) or filter (if result is “filtered”) results. Default is “presentation_score”.

filter_valuefloat

Threshold value to use, only relevant when result is “filtered”. If comparison_quantity is “affinity”, then all results less than (i.e. tighter than) the specified nM affinity are retained. If it’s “presentation_score” or “processing_score” then results greater than the indicated filter_value are retained.

peptide_lengthslist of int

Peptide lengths to predict for.

use_flanksbool

Whether to include flanking sequences when running the AP predictor (for better cleavage prediction).

include_affinity_percentilebool

Whether to include affinity percentile ranks in output.

verboseint

Set to 0 for quiet mode.

throwboolean

Whether to throw exceptions (vs. log warnings) on invalid inputs.

Returns
pandas.DataFrame with columns:

peptide, n_flank, c_flank, sequence_name, affinity, best_allele, processing_score, presentation_score

save(models_dir, write_affinity_predictor=True, write_processing_predictor=True, write_weights=True, write_percent_ranks=True, write_info=True, write_metdata=True)[source]

Save the predictor to a directory on disk. If the directory does not exist it will be created.

The wrapped Class1AffinityPredictor and Class1ProcessingPredictor instances are included in the saved data.

Parameters
models_dirstring

Path to directory. It will be created if it doesn’t exist.

classmethod load(models_dir=None, max_models=None)[source]

Deserialize a predictor from a directory on disk.

This will also load the wrapped Class1AffinityPredictor and Class1ProcessingPredictor instances.

Parameters
models_dirstring

Path to directory. If unspecified the default downloaded models are used.

max_modelsint, optional

Maximum number of affinity and processing (counted separately) models to load

Returns
Class1PresentationPredictor instance
percentile_ranks(presentation_scores, throw=True)[source]

Return percentile ranks for the given presentation scores.

Parameters
presentation_scoressequence of float
Returns
numpy.array of float
calibrate_percentile_ranks(scores, bins=None)[source]

Compute the cumulative distribution of scores, to enable taking quantiles of this distribution later.

Parameters
scoressequence of float

Presentation prediction scores

binsobject

Anything that can be passed to numpy.histogram’s “bins” argument can be used here, i.e. either an integer or a sequence giving bin edges.

mhcflurry.class1_processing_neural_network module

Antigen processing neural network implementation

class mhcflurry.class1_processing_neural_network.Class1ProcessingNeuralNetwork(**hyperparameters)[source]

Bases: object

A neural network for antigen processing prediction

network_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Hyperparameters (and their default values) that affect the neural network architecture.

fit_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Hyperparameters for neural network training.

early_stopping_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Hyperparameters for early stopping.

compile_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Loss and optimizer hyperparameters. Any values supported by keras may be used.

auxiliary_input_hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Allele feature hyperparameters.

hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>
property sequence_lengths

Supported maximum sequence lengths

Returns
dict of string -> int
Keys are “peptide”, “n_flank”, “c_flank”. Values give the maximum
supported sequence length.
network()[source]

Return the keras model associated with this network.

update_network_description()[source]

Update self.network_json and self.network_weights properties based on this instances’s neural network.

fit(sequences, targets, sample_weights=None, shuffle_permutation=None, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]

Fit the neural network.

Parameters
sequencesFlankingEncoding

Peptides and upstream/downstream flanking sequences

targetslist of float

1 indicates hit, 0 indicates decoy

sample_weightslist of float

If not specified all samples have equal weight.

shuffle_permutationlist of int

Permutation (integer list) of same length as peptides and affinities If None, then a random permutation will be generated.

verboseint

Keras verbosity level

progress_callbackfunction

No-argument function to call after each epoch.

progress_preamblestring

Optional string of information to include in each progress update

progress_print_intervalfloat

How often (in seconds) to print progress update. Set to None to disable.

predict(peptides, n_flanks=None, c_flanks=None, batch_size=4096)[source]

Predict antigen processing.

Parameters
peptideslist of string

Peptide sequences

n_flankslist of string

Upstream sequence before each peptide

c_flankslist of string

Downstream sequence after each peptide

batch_sizeint

Prediction keras batch size.

Returns
numpy.array
Processing scores. Range is 0-1, higher indicates more favorable
processing.
predict_encoded(sequences, throw=True, batch_size=4096)[source]

Predict antigen processing.

Parameters
sequencesFlankingEncoding

Peptides and flanking sequences

throwboolean

Whether to throw exception on unsupported peptides

batch_sizeint

Prediction keras batch size.

Returns
numpy.array
network_input(sequences, throw=True)[source]

Encode peptides to the fixed-length encoding expected by the neural network (which depends on the architecture).

Parameters
sequencesFlankingEncoding

Peptides and flanking sequences

throwboolean

Whether to throw exception on unsupported peptides

Returns
numpy.array
make_network(amino_acid_encoding, peptide_max_length, n_flank_length, c_flank_length, flanking_averages, convolutional_filters, convolutional_kernel_size, convolutional_activation, convolutional_kernel_l1_l2, dropout_rate, post_convolutional_dense_layer_sizes)[source]

Helper function to make a keras network given hyperparameters.

get_weights()[source]

Get the network weights

Returns
list of numpy.array giving weights for each layer or None if there is no
network
get_config()[source]

serialize to a dict all attributes except model weights

Returns
dict
classmethod from_config(config, weights=None)[source]

deserialize from a dict returned by get_config().

Parameters
configdict
weightslist of array, optional

Network weights to restore

Returns
Class1ProcessingNeuralNetwork

mhcflurry.class1_processing_predictor module

class mhcflurry.class1_processing_predictor.Class1ProcessingPredictor(models, manifest_df=None, metadata_dataframes=None, provenance_string=None)[source]

Bases: object

User-facing interface to antigen processing prediction.

Delegates to an ensemble of Class1ProcessingNeuralNetwork instances.

Instantiate a new Class1ProcessingPredictor

Users will generally call load() to restore a saved predictor rather than using this constructor.

Parameters
modelslist of Class1ProcessingNeuralNetwork

Neural networks in the ensemble.

manifest_dfpandas.DataFrame

Manifest dataframe. If not specified a new one will be created when needed.

metadata_dataframesdict of string -> pandas.DataFrame

Arbitrary metadata associated with this predictor

provenance_stringstring, optional

Optional info string to use in __str__.

property sequence_lengths

Supported maximum sequence lengths.

Passing a peptide greater than the maximum supported length results in an error.

Passing an N- or C-flank sequence greater than the maximum supported length results in some part of it being ignored.

Returns
dict of string -> int
Keys are “peptide”, “n_flank”, “c_flank”. Values give the maximum
supported sequence length.
add_models(models)[source]

Add models to the ensemble (in-place).

Parameters
modelslist of Class1ProcessingNeuralNetwork
Returns
list of string
Names of the new models.
property manifest_df

A pandas.DataFrame describing the models included in this predictor.

Returns
pandas.DataFrame
static model_name(num)[source]

Generate a model name

Returns
string
static weights_path(models_dir, model_name)[source]

Generate the path to the weights file for a model

Parameters
models_dirstring
model_namestring
Returns
string
predict(peptides, n_flanks=None, c_flanks=None, throw=True, batch_size=4096)[source]

Predict antigen processing.

Parameters
peptideslist of string

Peptide sequences

n_flankslist of string

Upstream sequence before each peptide

c_flankslist of string

Downstream sequence after each peptide

throwboolean

If True, a ValueError will be raised in the case of unsupported peptides. If False, a warning will be logged and the predictions for those peptides will be NaN.

batch_sizeint

Prediction keras batch size.

Returns
numpy.array
Processing scores. Range is 0-1, higher indicates more favorable
processing.
predict_to_dataframe(peptides, n_flanks=None, c_flanks=None, throw=True, batch_size=4096)[source]

Predict antigen processing.

See predict method for parameter descriptions.

Returns
pandas.DataFrame
Processing predictions are in the “score” column. Also includes
peptides and flanking sequences.
predict_to_dataframe_encoded(sequences, throw=True, batch_size=4096)[source]

Predict antigen processing.

See predict method for more information.

Parameters
sequencesFlankingEncoding
batch_sizeint
throwboolean
Returns
pandas.DataFrame
check_consistency()[source]

Verify that self.manifest_df is consistent with instance variables.

Currently only checks for agreement on the total number of models.

Throws AssertionError if inconsistent.

save(models_dir, model_names_to_write=None, write_metadata=True)[source]

Serialize the predictor to a directory on disk. If the directory does not exist it will be created.

The serialization format consists of a file called “manifest.csv” with the configurations of each Class1ProcessingNeuralNetwork, along with per-network files giving the model weights.

Parameters
models_dirstring

Path to directory. It will be created if it doesn’t exist.

classmethod load(models_dir=None, max_models=None)[source]

Deserialize a predictor from a directory on disk.

Parameters
models_dirstring

Path to directory. If unspecified the default downloaded models are used.

max_modelsint, optional

Maximum number of models to load

Returns
Class1ProcessingPredictor instance

mhcflurry.cluster_parallelism module

Simple, relatively naive parallel map implementation for HPC clusters.

Used for training MHCflurry models.

mhcflurry.cluster_parallelism.add_cluster_parallelism_args(parser)[source]

Add commandline arguments controlling cluster parallelism to an argparse ArgumentParser.

Parameters
parserargparse.ArgumentParser
mhcflurry.cluster_parallelism.cluster_results_from_args(args, work_function, work_items, constant_data=None, input_serialization_method='pickle', result_serialization_method='pickle', clear_constant_data=False)[source]

Parallel map configurable using commandline arguments. See the cluster_results() function for docs.

The args parameter should be an argparse.Namespace from an argparse parser generated using the add_cluster_parallelism_args() function.

Parameters
args
work_function
work_items
constant_data
result_serialization_method
clear_constant_data
Returns
generator
mhcflurry.cluster_parallelism.cluster_results(work_function, work_items, constant_data=None, submit_command='sh', results_workdir='./cluster-workdir', additional_complete_file=None, script_prefix_path=None, input_serialization_method='pickle', result_serialization_method='pickle', max_retries=3, clear_constant_data=False)[source]

Parallel map on an HPC cluster.

Returns [work_function(item) for item in work_items] where each invocation of work_function is performed as a separate HPC cluster job. Order is preserved.

Optionally, “constant data” can be specified, which will be passed to each work_function() invocation as a keyword argument called constant_data. This data is serialized once and all workers read it from the same source, which is more efficient than serializing it separately for each worker.

Each worker’s input is serialized to a shared NFS directory and the submit_command is used to launch a job to process that input. The shared filesystem is polled occasionally to watch for results, which are fed back to the user.

Parameters
work_functionA -> B
work_itemslist of A
constant_dataobject
submit_commandstring

For running on LSF, we use “bsub” here.

results_workdirstring

Path to NFS shared directory where inputs and results can be written

script_prefix_pathstring

Path to script that will be invoked to run each worker. A line calling the _mhcflurry-cluster-worker-entry-point command will be appended to the contents of this file.

result_serialization_methodstring, one of “pickle” or “save_predictor”

The “save_predictor” works only when the return type of work_function is Class1AffinityPredictor

max_retriesint

How many times to attempt to re-launch a failed worker

clear_constant_databool

If True, the constant data dict is cleared on the launching host after it is serialized to disk.

Returns
generator of B
mhcflurry.cluster_parallelism.worker_entry_point(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]

Entry point for the worker command.

Parameters
argvlist of string

mhcflurry.common module

mhcflurry.common.configure_tensorflow(backend=None, gpu_device_nums=None, num_threads=None)[source]

Configure Keras backend to use GPU or CPU. Only tensorflow is supported.

Parameters
backendstring, optional

one of ‘tensorflow-default’, ‘tensorflow-cpu’, ‘tensorflow-gpu’

gpu_device_numslist of int, optional

GPU devices to potentially use

num_threadsint, optional

Tensorflow threads to use

mhcflurry.common.configure_logging(verbose=False)[source]

Configure logging module using defaults.

Parameters
verboseboolean

If true, output will be at level DEBUG, otherwise, INFO.

mhcflurry.common.amino_acid_distribution(peptides, smoothing=0.0)[source]

Compute the fraction of each amino acid across a collection of peptides.

Parameters
peptideslist of string
smoothingfloat, optional

Small number (e.g. 0.01) to add to all amino acid fractions. The higher the number the more uniform the distribution.

Returns
pandas.Series indexed by amino acids
mhcflurry.common.random_peptides(num, length=9, distribution=None)[source]

Generate random peptides (kmers).

Parameters
numint

Number of peptides to return

lengthint

Length of each peptide

distributionpandas.Series

Maps 1-letter amino acid abbreviations to probabilities. If not specified a uniform distribution is used.

Returns
list of string
mhcflurry.common.positional_frequency_matrix(peptides)[source]

Given a set of peptides, calculate a length x amino acids frequency matrix.

Parameters
peptideslist of string

All of same length

Returns
pandas.DataFrame

Index is position, columns are amino acids

mhcflurry.common.save_weights(weights_list, filename)[source]

Save model weights to the given filename using numpy’s “.npz” format.

Parameters
weights_listlist of numpy array
filenamestring
mhcflurry.common.load_weights(filename)[source]

Restore model weights from the given filename, which should have been created with save_weights.

Parameters
filenamestring
Returns
list of array
class mhcflurry.common.NumpyJSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: json.encoder.JSONEncoder

JSON encoder (used with json module) that can handle numpy arrays.

Constructor for JSONEncoder, with sensible defaults.

If skipkeys is false, then it is a TypeError to attempt encoding of keys that are not str, int, float or None. If skipkeys is True, such items are simply skipped.

If ensure_ascii is true, the output is guaranteed to be str objects with all incoming non-ASCII characters escaped. If ensure_ascii is false, the output can contain non-ASCII characters.

If check_circular is true, then lists, dicts, and custom encoded objects will be checked for circular references during encoding to prevent an infinite recursion (which would cause an OverflowError). Otherwise, no such check takes place.

If allow_nan is true, then NaN, Infinity, and -Infinity will be encoded as such. This behavior is not JSON specification compliant, but is consistent with most JavaScript based encoders and decoders. Otherwise, it will be a ValueError to encode such floats.

If sort_keys is true, then the output of dictionaries will be sorted by key; this is useful for regression tests to ensure that JSON serializations can be compared on a day-to-day basis.

If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.

If specified, separators should be an (item_separator, key_separator) tuple. The default is (‘, ‘, ‘: ‘) if indent is None and (‘,’, ‘: ‘) otherwise. To get the most compact JSON representation, you should specify (‘,’, ‘:’) to eliminate whitespace.

If specified, default is a function that gets called for objects that can’t otherwise be serialized. It should return a JSON encodable version of the object or raise a TypeError.

default(obj)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)

mhcflurry.custom_loss module

Custom loss functions.

For losses supporting inequalities, each training data point is associated with one of (=), (<), or (>). For e.g. (>) inequalities, penalization is applied only if the prediction is less than the given value.

mhcflurry.custom_loss.get_loss(name)[source]

Get a custom_loss.Loss instance by name.

Parameters
namestring
Returns
custom_loss.Loss
class mhcflurry.custom_loss.Loss(name=None)[source]

Bases: object

Thin wrapper to keep track of neural network loss functions, which could be custom or baked into Keras.

Each subclass or instance should define these properties/methods: - name : string - loss : string or function

This is what gets passed to keras.fit()

  • encode_ynumpy.ndarray -> numpy.ndarray

    Transformation to apply to regression target before fitting

loss(y_true, y_pred)[source]
get_keras_loss(reduction='sum_over_batch_size')[source]
class mhcflurry.custom_loss.StandardKerasLoss(loss_name='mse')[source]

Bases: mhcflurry.custom_loss.Loss

A loss function supported by Keras, such as MSE.

supports_inequalities = False
supports_multiple_outputs = False
static encode_y(y)[source]
class mhcflurry.custom_loss.TransformPredictionsLossWrapper(loss, y_pred_transform=None)[source]

Bases: mhcflurry.custom_loss.Loss

Wrapper that applies an arbitrary transform to y_pred before calling an underlying loss function.

The y_pred_transform function should be a tensor -> tensor function.

encode_y(*args, **kwargs)[source]
loss(y_true, y_pred)[source]
class mhcflurry.custom_loss.MSEWithInequalities(name=None)[source]

Bases: mhcflurry.custom_loss.Loss

Supports training a regression model on data that includes inequalities (e.g. x < 100). Mean square error is used as the loss for elements with an (=) inequality. For elements with e.g. a (> 0.5) inequality, then the loss for that element is (y - 0.5)^2 (standard MSE) if y < 500 and 0 otherwise.

This loss assumes that the normal range for y_true and y_pred is 0 - 1. As a hack, the implementation uses other intervals for y_pred to encode the inequality information.

y_true is interpreted as follows:

between 0 - 1

Regular MSE loss is used. Penalty (y_pred - y_true)**2 is applied if y_pred is greater or less than y_true.

between 2 - 3:

Treated as a “>” inequality. Penalty (y_pred - (y_true - 2))**2 is applied only if y_pred is less than y_true - 2.

between 4 - 5:

Treated as a “<” inequality. Penalty (y_pred - (y_true - 4))**2 is applied only if y_pred is greater than y_true - 4.

name = 'mse_with_inequalities'
supports_inequalities = True
supports_multiple_outputs = False
static encode_y(y, inequalities=None)[source]
loss(y_true, y_pred)[source]
class mhcflurry.custom_loss.MSEWithInequalitiesAndMultipleOutputs(name=None)[source]

Bases: mhcflurry.custom_loss.Loss

Loss supporting inequalities and multiple outputs.

This loss assumes that the normal range for y_true and y_pred is 0 - 1. As a hack, the implementation uses other intervals for y_pred to encode the inequality and output-index information.

Inequalities are encoded into the regression target as in the MSEWithInequalities loss.

Multiple outputs are encoded by mapping each regression target x (after transforming for inequalities) using the rule x -> x + i * 10 where i is the output index.

The reason for explicitly encoding multiple outputs this way (rather than just making the regression target a matrix instead of a vector) is that in our use cases we frequently have missing data in the regression target. This encoding gives a simple way to penalize only on (data point, output index) pairs that have labels.

name = 'mse_with_inequalities_and_multiple_outputs'
supports_inequalities = True
supports_multiple_outputs = True
static encode_y(y, inequalities=None, output_indices=None)[source]
loss(y_true, y_pred)[source]
class mhcflurry.custom_loss.MultiallelicMassSpecLoss(delta=0.2, multiplier=1.0)[source]

Bases: mhcflurry.custom_loss.Loss

name = 'multiallelic_mass_spec_loss'
supports_inequalities = True
supports_multiple_outputs = False
static encode_y(y)[source]
loss(y_true, y_pred)[source]
mhcflurry.custom_loss.check_shape(name, arr, expected_shape)[source]

Raise ValueError if arr.shape != expected_shape.

Parameters
namestring

Included in error message to aid debugging

arrnumpy.ndarray
expected_shapetuple of int
mhcflurry.custom_loss.cls

alias of mhcflurry.custom_loss.MultiallelicMassSpecLoss

mhcflurry.data_dependent_weights_initialization module

Layer-sequential unit-variance initialization for neural networks.

See:

Mishkin and Matas, “All you need is a good init”. 2016. https://arxiv.org/abs/1511.06422

mhcflurry.data_dependent_weights_initialization.svd_orthonormal(shape)[source]
mhcflurry.data_dependent_weights_initialization.get_activations(model, layer, X_batch)[source]
mhcflurry.data_dependent_weights_initialization.lsuv_init(model, batch, verbose=True, margin=0.1, max_iter=100)[source]

Initialize neural network weights using layer-sequential unit-variance initialization.

See:

Mishkin and Matas, “All you need is a good init”. 2016. https://arxiv.org/abs/1511.06422

Parameters
modelkeras.Model
batchdict

Training data, as would be passed keras.Model.fit()

verboseboolean

Whether to print progress to stdout

marginfloat
max_iterint
Returns
keras.Model

Same as what was passed in.

mhcflurry.downloads module

Manage local downloaded data.

mhcflurry.downloads.get_downloads_dir()[source]

Return the path to local downloaded data

mhcflurry.downloads.get_current_release()[source]

Return the current downloaded data release

mhcflurry.downloads.get_downloads_metadata()[source]

Return the contents of downloads.yml as a dict

mhcflurry.downloads.get_default_class1_models_dir(test_exists=True)[source]

Return the absolute path to the default class1 models dir.

If environment variable MHCFLURRY_DEFAULT_CLASS1_MODELS is set to an absolute path, return that path. If it’s set to a relative path (i.e. does not start with /) then return that path taken to be relative to the mhcflurry downloads dir.

If environment variable MHCFLURRY_DEFAULT_CLASS1_MODELS is NOT set, then return the path to downloaded models in the “models_class1” download.

Parameters
test_existsboolean, optional

Whether to raise an exception of the path does not exist

Returns
stringabsolute path
mhcflurry.downloads.get_default_class1_presentation_models_dir(test_exists=True)[source]

Return the absolute path to the default class1 presentation models dir.

See get_default_class1_models_dir.

If environment variable MHCFLURRY_DEFAULT_CLASS1_PRESENTATION_MODELS is set to an absolute path, return that path. If it’s set to a relative path (does not start with /) then return that path taken to be relative to the mhcflurry downloads dir.

Parameters
test_existsboolean, optional

Whether to raise an exception of the path does not exist

Returns
stringabsolute path
mhcflurry.downloads.get_default_class1_processing_models_dir(test_exists=True)[source]

Return the absolute path to the default class1 processing models dir.

See get_default_class1_models_dir.

If environment variable MHCFLURRY_DEFAULT_CLASS1_PROCESSING_MODELS is set to an absolute path, return that path. If it’s set to a relative path (does not start with /) then return that path taken to be relative to the mhcflurry downloads dir.

Parameters
test_existsboolean, optional

Whether to raise an exception of the path does not exist

Returns
stringabsolute path
mhcflurry.downloads.get_current_release_downloads()[source]

Return a dict of all available downloads in the current release.

The dict keys are the names of the downloads. The values are a dict with two entries:

downloadedbool

Whether the download is currently available locally

metadatadict

Info about the download from downloads.yml such as URL

up_to_datebool or None

Whether the download URL(s) match what was used to download the current data. This is None if it cannot be determined.

mhcflurry.downloads.get_path(download_name, filename='', test_exists=True)[source]

Get the local path to a file in a MHCflurry download

Parameters
download_namestring
filenamestring

Relative path within the download to the file of interest

test_existsboolean

If True (default) throw an error telling the user how to download the data if the file does not exist

Returns
string giving local absolute path
mhcflurry.downloads.configure()[source]

Setup various global variables based on environment variables.

mhcflurry.downloads_command module

Download MHCflurry released datasets and trained models.

Examples

Fetch the default downloads:

$ mhcflurry-downloads fetch

Fetch a specific download:

$ mhcflurry-downloads fetch models_class1_pan

Get the path to a download:

$ mhcflurry-downloads path models_class1_pan

Get the URL of a download:

$ mhcflurry-downloads url models_class1_pan

Summarize available and fetched downloads:

$ mhcflurry-downloads info

mhcflurry.downloads_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]
mhcflurry.downloads_command.mkdir_p(path)[source]

Make directories as needed, similar to mkdir -p in a shell.

From: http://stackoverflow.com/questions/600268/mkdir-p-functionality-in-python

mhcflurry.downloads_command.yes_no(boolean)[source]
class mhcflurry.downloads_command.TqdmUpTo(*args, **kwargs)[source]

Bases: tqdm.std.tqdm

Provides update_to(n) which uses tqdm.update(delta_n).

Parameters
iterableiterable, optional

Iterable to decorate with a progressbar. Leave blank to manually manage the updates.

descstr, optional

Prefix for the progressbar.

totalint or float, optional

The number of expected iterations. If unspecified, len(iterable) is used if possible. If float(“inf”) or as a last resort, only basic progress statistics are displayed (no ETA, no progressbar). If gui is True and this parameter needs subsequent updating, specify an initial arbitrary large positive number, e.g. 9e9.

leavebool, optional

If [default: True], keeps all traces of the progressbar upon termination of iteration. If None, will leave only if position is 0.

fileio.TextIOWrapper or io.StringIO, optional

Specifies where to output the progress messages (default: sys.stderr). Uses file.write(str) and file.flush() methods. For encoding, see write_bytes.

ncolsint, optional

The width of the entire output message. If specified, dynamically resizes the progressbar to stay within this bound. If unspecified, attempts to use environment width. The fallback is a meter width of 10 and no limit for the counter and statistics. If 0, will not print any meter (only stats).

minintervalfloat, optional

Minimum progress display update interval [default: 0.1] seconds.

maxintervalfloat, optional

Maximum progress display update interval [default: 10] seconds. Automatically adjusts miniters to correspond to mininterval after long display update lag. Only works if dynamic_miniters or monitor thread is enabled.

minitersint or float, optional

Minimum progress display update interval, in iterations. If 0 and dynamic_miniters, will automatically adjust to equal mininterval (more CPU efficient, good for tight loops). If > 0, will skip display of specified number of iterations. Tweak this and mininterval to get very efficient loops. If your progress is erratic with both fast and slow iterations (network, skipping items, etc) you should set miniters=1.

asciibool or str, optional

If unspecified or False, use unicode (smooth blocks) to fill the meter. The fallback is to use ASCII characters ” 123456789#”.

disablebool, optional

Whether to disable the entire progressbar wrapper [default: False]. If set to None, disable on non-TTY.

unitstr, optional

String that will be used to define the unit of each iteration [default: it].

unit_scalebool or int or float, optional

If 1 or True, the number of iterations will be reduced/scaled automatically and a metric prefix following the International System of Units standard will be added (kilo, mega, etc.) [default: False]. If any other non-zero number, will scale total and n.

dynamic_ncolsbool, optional

If set, constantly alters ncols and nrows to the environment (allowing for window resizes) [default: False].

smoothingfloat, optional

Exponential moving average smoothing factor for speed estimates (ignored in GUI mode). Ranges from 0 (average speed) to 1 (current/instantaneous speed) [default: 0.3].

bar_formatstr, optional

Specify a custom bar string formatting. May impact performance. [default: ‘{l_bar}{bar}{r_bar}’], where l_bar=’{desc}: {percentage:3.0f}%|’ and r_bar=’| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, ‘

‘{rate_fmt}{postfix}]’

Possible vars: l_bar, bar, r_bar, n, n_fmt, total, total_fmt,

percentage, elapsed, elapsed_s, ncols, nrows, desc, unit, rate, rate_fmt, rate_noinv, rate_noinv_fmt, rate_inv, rate_inv_fmt, postfix, unit_divisor, remaining, remaining_s.

Note that a trailing “: ” is automatically removed after {desc} if the latter is empty.

initialint or float, optional

The initial counter value. Useful when restarting a progress bar [default: 0]. If using float, consider specifying {n:.3f} or similar in bar_format, or specifying unit_scale.

positionint, optional

Specify the line offset to print this bar (starting from 0) Automatic if unspecified. Useful to manage multiple bars at once (eg, from threads).

postfixdict or *, optional

Specify additional stats to display at the end of the bar. Calls set_postfix(**postfix) if possible (dict).

unit_divisorfloat, optional

[default: 1000], ignored unless unit_scale is True.

write_bytesbool, optional

If (default: None) and file is unspecified, bytes will be written in Python 2. If True will also write bytes. In all other cases will default to unicode.

lock_argstuple, optional

Passed to refresh for intermediate output (initialisation, iterating, and updating).

nrowsint, optional

The screen height. If specified, hides nested bars outside this bound. If unspecified, attempts to use environment height. The fallback is 20.

guibool, optional

WARNING: internal parameter - do not use. Use tqdm.gui.tqdm(…) instead. If set, will attempt to use matplotlib animations for a graphical output [default: False].

Returns
outdecorated iterator.
update_to(b=1, bsize=1, tsize=None)[source]
bint, optional

Number of blocks transferred so far [default: 1].

bsizeint, optional

Size of each block (in tqdm units) [default: 1].

tsizeint, optional

Total size (in tqdm units). If [default: None] remains unchanged.

mhcflurry.downloads_command.fetch_subcommand(args)[source]
mhcflurry.downloads_command.info_subcommand(args)[source]
mhcflurry.downloads_command.path_subcommand(args)[source]

Print the local path to a download

mhcflurry.downloads_command.url_subcommand(args)[source]

Print the URL(s) for a download

mhcflurry.encodable_sequences module

Class for encoding variable-length peptides to fixed-size numerical matrices

exception mhcflurry.encodable_sequences.EncodingError(message, supported_peptide_lengths)[source]

Bases: ValueError

Exception raised when peptides cannot be encoded

class mhcflurry.encodable_sequences.EncodableSequences(sequences)[source]

Bases: object

Class for encoding variable-length peptides to fixed-size numerical matrices

This class caches various encodings of a list of sequences.

In practice this is used only for peptides. To encode MHC allele sequences, see AlleleEncoding.

unknown_character = 'X'
classmethod create(sequences)[source]

Factory that returns an EncodableSequences given a list of strings. As a convenience, you can also pass it an EncodableSequences instance, in which case the object is returned unchanged.

variable_length_to_fixed_length_categorical(alignment_method='pad_middle', left_edge=4, right_edge=4, max_length=15)[source]

Encode variable-length sequences to a fixed-size index-encoded (integer) matrix.

See sequences_to_fixed_length_index_encoded_array for details.

Parameters
alignment_methodstring

One of “pad_middle” or “left_pad_right_pad”

left_edgeint, size of fixed-position left side

Only relevant for pad_middle alignment method

right_edgeint, size of the fixed-position right side

Only relevant for pad_middle alignment method

max_lengthmaximum supported peptide length
Returns
numpy.array of integers with shape (num sequences, encoded length)
For pad_middle, the encoded length is max_length. For left_pad_right_pad,
it’s 3 * max_length.
variable_length_to_fixed_length_vector_encoding(vector_encoding_name, alignment_method='pad_middle', left_edge=4, right_edge=4, max_length=15, trim=False, allow_unsupported_amino_acids=False)[source]

Encode variable-length sequences to a fixed-size matrix. Amino acids are encoded as specified by the vector_encoding_name argument.

See sequences_to_fixed_length_index_encoded_array for details.

See also: variable_length_to_fixed_length_categorical.

Parameters
vector_encoding_namestring

How to represent amino acids. One of “BLOSUM62”, “one-hot”, etc. Full list of supported vector encodings is given by available_vector_encodings().

alignment_methodstring

One of “pad_middle” or “left_pad_right_pad”

left_edgeint

Size of fixed-position left side. Only relevant for pad_middle alignment method

right_edgeint

Size of the fixed-position right side. Only relevant for pad_middle alignment method

max_lengthint

Maximum supported peptide length

trimbool

If True, longer sequences will be trimmed to fit the maximum supported length. Not supported for all alignment methods.

allow_unsupported_amino_acidsbool

If True, non-canonical amino acids will be replaced with the X character before encoding.

Returns
numpy.array with shape (num sequences, encoded length, m)
where
  • m is the vector encoding length (usually 21).

  • encoded length is max_length if alignment_method is pad_middle; 3 * max_length if it’s left_pad_right_pad.

classmethod sequences_to_fixed_length_index_encoded_array(sequences, alignment_method='pad_middle', left_edge=4, right_edge=4, max_length=15, trim=False, allow_unsupported_amino_acids=False)[source]

Encode variable-length sequences to a fixed-size index-encoded (integer) matrix.

How variable length sequences get mapped to fixed length is set by the “alignment_method” argument. Supported alignment methods are:

pad_middle

Encoding designed for preserving the anchor positions of class I peptides. This is what is used in allele-specific models.

Each string must be of length at least left_edge + right_edge and at most max_length. The first left_edge characters in the input always map to the first left_edge characters in the output. Similarly for the last right_edge characters. The middle characters are filled in based on the length, with the X character filling in the blanks.

Example:

AAAACDDDD -> AAAAXXXCXXXDDDD

left_pad_centered_right_pad

Encoding that makes no assumptions on anchor positions but is 3x larger than pad_middle, since it duplicates the peptide (left aligned + centered + right aligned). This is what is used for the pan-allele models.

Example:

AAAACDDDD -> AAAACDDDDXXXXXXXXXAAAACDDDDXXXXXXXXXAAAACDDDD

left_pad_right_pad

Same as left_pad_centered_right_pad but only includes left- and right-padded peptide.

Example:

AAAACDDDD -> AAAACDDDDXXXXXXXXXXXXAAAACDDDD

Parameters
sequenceslist of string
alignment_methodstring

One of “pad_middle” or “left_pad_right_pad”

left_edgeint

Size of fixed-position left side. Only relevant for pad_middle alignment method

right_edgeint

Size of the fixed-position right side. Only relevant for pad_middle alignment method

max_lengthint

maximum supported peptide length

trimbool

If True, longer sequences will be trimmed to fit the maximum supported length. Not supported for all alignment methods.

allow_unsupported_amino_acidsbool

If True, non-canonical amino acids will be replaced with the X character before encoding.

Returns
numpy.array of integers with shape (num sequences, encoded length)
For pad_middle, the encoded length is max_length. For left_pad_right_pad,
it’s 2 * max_length. For left_pad_centered_right_pad, it’s
3 * max_length.

mhcflurry.ensemble_centrality module

Measures of centrality (e.g. mean) used to combine predictions across an ensemble. The input to these functions are log affinities, and they are expected to return a centrality measure also in log-space.

mhcflurry.ensemble_centrality.robust_mean(log_values)[source]

Mean of values falling within the 25-75 percentiles.

Parameters
log_values2-d numpy.array

Center is computed along the second axis (i.e. per row).

Returns
centernumpy.array of length log_values.shape[1]

mhcflurry.fasta module

Adapted from pyensembl, github.com/openvax/pyensembl Original implementation by Alex Rubinsteyn.

The worse sin in bioinformatics is to write your own FASTA parser. We’re doing this to avoid adding another dependency to MHCflurry, however.

mhcflurry.fasta.read_fasta_to_dataframe(filename)[source]
class mhcflurry.fasta.FastaParser[source]

Bases: object

FastaParser object consumes lines of a FASTA file incrementally.

iterate_over_file(fasta_path)[source]

Generator that yields identifiers paired with sequences.

static open_file(fasta_path)[source]

Open either a text file or compressed gzip file as a stream of bytes.

mhcflurry.flanking_encoding module

Class for encoding variable-length flanking and peptides to fixed-size numerical matrices

class mhcflurry.flanking_encoding.EncodingResult(array, peptide_lengths)

Bases: tuple

Create new instance of EncodingResult(array, peptide_lengths)

array

Alias for field number 0

peptide_lengths

Alias for field number 1

class mhcflurry.flanking_encoding.FlankingEncoding(peptides, n_flanks, c_flanks)[source]

Bases: object

Encode peptides and optionally their N- and C-flanking sequences into fixed size numerical matrices. Similar to EncodableSequences but with support for flanking sequences and the encoding scheme used by the processing predictor.

Instances of this class have an immutable list of peptides with flanking sequences. Encodings are cached in the instances for faster performance when the same set of peptides needs to encoded more than once.

Constructor. Sequences of any lengths can be passed.

Parameters
peptideslist of string

Peptide sequences

n_flankslist of string [same length as peptides]

Upstream sequences

c_flankslist of string [same length as peptides]

Downstream sequences

unknown_character = 'X'
vector_encode(vector_encoding_name, peptide_max_length, n_flank_length, c_flank_length, allow_unsupported_amino_acids=True, throw=True)[source]

Encode variable-length sequences to a fixed-size matrix.

Parameters
vector_encoding_namestring

How to represent amino acids. One of “BLOSUM62”, “one-hot”, etc. See amino_acid.available_vector_encodings().

peptide_max_lengthint

Maximum supported peptide length.

n_flank_lengthint

Maximum supported N-flank length

c_flank_lengthint

Maximum supported C-flank length

allow_unsupported_amino_acidsbool

If True, non-canonical amino acids will be replaced with the X character before encoding.

throwbool

Whether to raise exception on unsupported peptides

Returns
numpy.array with shape (num sequences, length, m)
where
  • num sequences is number of peptides, i.e. len(self)

  • length is peptide_max_length + n_flank_length + c_flank_length

  • m is the vector encoding length (usually 21).

static encode(vector_encoding_name, df, peptide_max_length, n_flank_length, c_flank_length, allow_unsupported_amino_acids=False, throw=True)[source]

Encode variable-length sequences to a fixed-size matrix.

Helper function. Users should use vector_encode.

Parameters
vector_encoding_namestring
dfpandas.DataFrame
peptide_max_lengthint
n_flank_lengthint
c_flank_lengthint
allow_unsupported_amino_acidsbool
throwbool
Returns
numpy.array

mhcflurry.hyperparameters module

Hyperparameter (neural network options) management

class mhcflurry.hyperparameters.HyperparameterDefaults(**defaults)[source]

Bases: object

Class for managing hyperparameters. Thin wrapper around a dict.

Instances of this class are a specification of the hyperparameters supported by a model and their defaults. The particular hyperparameter settings to be used, for example, to train a model are kept in plain dicts.

extend(other)[source]

Return a new HyperparameterDefaults instance containing the hyperparameters from the current instance combined with those from other.

It is an error if self and other have any hyperparameters in common.

with_defaults(obj)[source]

Given a dict of hyperparameter settings, return a dict containing those settings augmented by the defaults for any keys missing from the dict.

subselect(obj)[source]

Filter a dict of hyperparameter settings to only those keys defined in this HyperparameterDefaults .

check_valid_keys(obj)[source]

Given a dict of hyperparameter settings, throw an exception if any keys are not defined in this HyperparameterDefaults instance.

models_grid(**kwargs)[source]

Make a grid of models by taking the cartesian product of all specified model parameter lists.

Parameters
The valid kwarg parameters are the entries of this
HyperparameterDefaults instance. Each parameter must be a list
giving the values to search across.
Returns
list of dict giving the parameters for each model. The length of the
list is the product of the lengths of the input lists.

mhcflurry.local_parallelism module

Infrastructure for “local” parallelism, i.e. multiprocess parallelism on one compute node.

mhcflurry.local_parallelism.add_local_parallelism_args(parser)[source]

Add local parallelism arguments to the given argparse.ArgumentParser.

Parameters
parserargparse.ArgumentParser
mhcflurry.local_parallelism.worker_pool_with_gpu_assignments_from_args(args)[source]

Create a multiprocessing.Pool where each worker uses its own GPU.

Uses commandline arguments. See worker_pool_with_gpu_assignments.

Parameters
argsargparse.ArgumentParser
Returns
multiprocessing.Pool
mhcflurry.local_parallelism.worker_pool_with_gpu_assignments(num_jobs, num_gpus=0, backend=None, max_workers_per_gpu=1, max_tasks_per_worker=None, worker_log_dir=None)[source]

Create a multiprocessing.Pool where each worker uses its own GPU.

Parameters
num_jobsint

Number of worker processes.

num_gpusint
backendstring
max_workers_per_gpuint
max_tasks_per_workerint
worker_log_dirstring
Returns
multiprocessing.Pool
mhcflurry.local_parallelism.make_worker_pool(processes=None, initializer=None, initializer_kwargs_per_process=None, max_tasks_per_worker=None)[source]

Convenience wrapper to create a multiprocessing.Pool.

This function adds support for per-worker initializer arguments, which are not natively supported by the multiprocessing module. The motivation for this feature is to support allocating each worker to a (different) GPU.

IMPLEMENTATION NOTE:

The per-worker initializer arguments are implemented using a Queue. Each worker reads its arguments from this queue when it starts. When it terminates, it adds its initializer arguments back to the queue, so a future process can initialize itself using these arguments.

There is one issue with this approach, however. If a worker crashes, it never repopulates the queue of initializer arguments. This will prevent any future worker from re-using those arguments. To deal with this issue we add a second ‘backup queue’. This queue always contains the full set of initializer arguments: whenever a worker reads from it, it always pushes the pop’d args back to the end of the queue immediately. If the primary arg queue is ever empty, then workers will read from this backup queue.

Parameters
processesint

Number of workers. Default: num CPUs.

initializerfunction, optional

Init function to call in each worker

initializer_kwargs_per_processlist of dict, optional

Arguments to pass to initializer function for each worker. Length of list must equal the number of workers.

max_tasks_per_workerint, optional

Restart workers after this many tasks. Requires Python >=3.2.

Returns
multiprocessing.Pool
mhcflurry.local_parallelism.worker_init_entry_point(init_function, arg_queue=None, backup_arg_queue=None)[source]
mhcflurry.local_parallelism.worker_init(keras_backend=None, gpu_device_nums=None, worker_log_dir=None)[source]
exception mhcflurry.local_parallelism.WrapException[source]

Bases: Exception

Add traceback info to exception so exceptions raised in worker processes can still show traceback info when re-raised in the parent.

mhcflurry.local_parallelism.call_wrapped(function, *args, **kwargs)[source]

Run function on args and kwargs and return result, wrapping any exception raised in a WrapException.

Parameters
functionarbitrary function
Any other arguments provided are passed to the function.
Returns
object
mhcflurry.local_parallelism.call_wrapped_kwargs(function, kwargs)[source]

Invoke function on given kwargs and return result, wrapping any exception raised in a WrapException.

Parameters
functionarbitrary function
kwargsdict
Returns
object
result of calling function(**kwargs)

mhcflurry.percent_rank_transform module

Class for transforming arbitrary values into percent ranks given a distribution.

class mhcflurry.percent_rank_transform.PercentRankTransform[source]

Bases: object

Transform arbitrary values into percent ranks.

fit(values, bins)[source]

Fit the transform using the given values (e.g. ic50s).

Parameters
valuespredictions (e.g. ic50 values)
binsbins for the cumulative distribution function

Anything that can be passed to numpy.histogram’s “bins” argument can be used here.

transform(values)[source]

Return percent ranks (range [0, 100]) for the given values.

to_series()[source]

Serialize the fit to a pandas.Series.

The index on the series gives the bin edges and the values give the CDF.

Returns
pandas.Series
static from_series(series)[source]

Deseralize a PercentRankTransform the given pandas.Series, as returned by to_series().

Parameters
seriespandas.Series
Returns
PercentRankTransform

mhcflurry.predict_command module

Run MHCflurry predictor on specified peptides.

By default, the presentation predictor is used, and predictions for MHC I binding affinity, antigen processing, and the composite presentation score are returned. If you just want binding affinity predictions, pass –affinity-only.

Examples:

Write a CSV file containing the contents of INPUT.csv plus additional columns giving MHCflurry predictions:

$ mhcflurry-predict INPUT.csv –out RESULT.csv

The input CSV file is expected to contain columns “allele”, “peptide”, and, optionally, “n_flank”, and “c_flank”.

If --out is not specified, results are written to stdout.

You can also run on alleles and peptides specified on the commandline, in which case predictions are written for all combinations of alleles and peptides:

$ mhcflurry-predict –alleles HLA-A0201 H-2Kb –peptides SIINFEKL DENDREKLLL

Instead of individual alleles (in a CSV or on the command line), you can also give a comma separated list of alleles giving a sample genotype. In this case, the tightest binding affinity across the alleles for the sample will be returned. For example:

$ mhcflurry-predict –peptides SIINFEKL DENDREKLLL –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:01,HLA-C*07:02 HLA-A*01:01,HLA-A*02:06,HLA-B*44:02,HLA-B*07:02,HLA-C*01:01,HLA-C*03:01

will give the tightest predicted affinities across alleles for each of the two genotypes specified for each peptide.

mhcflurry.predict_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]

mhcflurry.predict_scan_command module

Scan protein sequences using the MHCflurry presentation predictor.

By default, sub-sequences (peptides) with affinity percentile ranks less than 2.0 are returned. You can also specify –results-all to return predictions for all peptides, or –results-best to return the top peptide for each sequence.

Examples:

Scan a set of sequences in a FASTA file for binders to any alleles in a MHC I genotype:

$ mhcflurry-predict-scan test/data/example.fasta –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:01,HLA-C*07:02

Instead of a FASTA, you can also pass a CSV that has “sequence_id” and “sequence” columns.

You can also specify multiple MHC I genotypes to scan as space-separated arguments to the –alleles option:

$ mhcflurry-predict-scan test/data/example.fasta –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:02,HLA-C*07:02 HLA-A*01:01,HLA-A*02:06,HLA-B*44:02,HLA-B*07:02,HLA-C*01:02,HLA-C*03:01

If --out is not specified, results are written to standard out.

You can also specify sequences on the commandline:

mhcflurry-predict-scan –sequences MGYINVFAFPFTIYSLLLCRMNSRNYIAQVDVVNFNLT –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:02,HLA-C*07:02

mhcflurry.predict_scan_command.parse_peptide_lengths(value)[source]
mhcflurry.predict_scan_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]

mhcflurry.random_negative_peptides module

class mhcflurry.random_negative_peptides.RandomNegativePeptides(**hyperparameters)[source]

Bases: object

Generate random negative (peptide, allele) pairs. These are used during model training, where they are resampled at each epoch.

hyperparameter_defaults = <mhcflurry.hyperparameters.HyperparameterDefaults object>

Hyperperameters for random negative peptides.

Number of random negatives will be:

random_negative_rate * (num measurements) + random_negative_constant

where the exact meaning of (num measurements) depends on the particular random_negative_method in use.

If random_negative_match_distribution is True, then the amino acid frequencies of the training data peptides are used to generate the random peptides.

Valid values for random_negative_method are:
“by_length”: used for allele-specific prediction. See description in

RandomNegativePeptides.plan_by_length method.

“by_allele”: used for pan-allele prediction. See

RandomNegativePeptides.plan_by_allele method.

“by_allele_equalize_nonbinders”: used for pan-allele prediction. See

RandomNegativePeptides.plan_by_allele_equalize_nonbinders method.

“recommended”: the default. Use by_length if the predictor is allele-

specific and by_allele if it’s pan-allele.

plan(peptides, affinities, alleles=None, inequalities=None)[source]

Calculate the number of random negatives for each allele and peptide length. Call this once after instantiating the object.

Parameters
peptideslist of string
affinitieslist of float
alleleslist of string, optional
inequalitieslist of string (“>”, “<”, or “=”), optional
Returns
pandas.DataFrame indicating number of random negatives for each length
and allele.
plan_by_length(df_all, df_binders=None, df_nonbinders=None)[source]

Generate a random negative plan using the “by_length” policy.

Parameters are as in the plan method. No return value.

Used for allele-specific predictors. Does not work well for pan-allele.

Different numbers of random negatives per length. Alleles are sampled proportionally to the number of times they are used in the training data.

plan_by_allele(df_all, df_binders=None, df_nonbinders=None)[source]

Generate a random negative plan using the “by_allele” policy.

Parameters are as in the plan method. No return value.

For each allele, a particular number of random negatives are used for all lengths. Across alleles, the number of random negatives varies; within an allele, the number of random negatives for each length is a constant

plan_by_allele_equalize_nonbinders(df_all, df_binders, df_nonbinders)[source]

Generate a random negative plan using the “by_allele_equalize_nonbinders” policy.

Parameters are as in the plan method. No return value.

Requires that the random_negative_binder_threshold hyperparameter is set.

In a first step, the number of random negatives selected by the “by_allele” method are added (see plan_by_allele). Then, the total number of non-binders are calculated for each allele and length. This total includes non-binder measurements in the training data plus the random negative peptides added in the first step. In a second step, additional random negative peptides are added so that for each allele, all peptide lengths have the same total number of non-binders.

get_alleles()[source]

Get the list of alleles corresponding to each random negative peptide as returned by get_peptides. This does NOT change and can be safely called once and reused.

Returns
list of string
get_peptides()[source]

Get the list of random negative peptides. This will be different each time the method is called.

Returns
list of string
get_total_count()[source]

Total number of planned random negative peptides.

Returns
int

mhcflurry.regression_target module

mhcflurry.regression_target.from_ic50(ic50, max_ic50=50000.0)[source]

Convert ic50s to regression targets in the range [0.0, 1.0].

Parameters
ic50numpy.array of float
Returns
numpy.array of float
mhcflurry.regression_target.to_ic50(x, max_ic50=50000.0)[source]

Convert regression targets in the range [0.0, 1.0] to ic50s in the range [0, 50000.0].

Parameters
xnumpy.array of float
Returns
numpy.array of float

mhcflurry.scoring module

Measures of prediction accuracy

mhcflurry.scoring.make_scores(ic50_y, ic50_y_pred, sample_weight=None, threshold_nm=500, max_ic50=50000)[source]

Calculate AUC, F1, and Kendall Tau scores.

Parameters
ic50_yfloat list

true IC50s (i.e. affinities)

ic50_y_predfloat list

predicted IC50s

sample_weightfloat list [optional]
threshold_nmfloat [optional]
max_ic50float [optional]
Returns
dict with entries “auc”, “f1”, “tau”

mhcflurry.select_allele_specific_models_command module

Model select class1 single allele models.

mhcflurry.select_allele_specific_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]
class mhcflurry.select_allele_specific_models_command.ScrambledPredictor(predictor)[source]

Bases: object

predict(peptides, allele)[source]
mhcflurry.select_allele_specific_models_command.model_select(allele, constant_data={})[source]
mhcflurry.select_allele_specific_models_command.cache_encoding(predictor, peptides)[source]
class mhcflurry.select_allele_specific_models_command.ScoreFunction(function, summary=None)[source]

Bases: object

Thin wrapper over a score function (Class1AffinityPredictor -> float). Used to keep a summary string associated with the function.

class mhcflurry.select_allele_specific_models_command.CombinedModelSelector(model_selectors, weights=None, min_contribution_percent=1.0)[source]

Bases: object

Model selector that computes a weighted average over other model selectors.

usable_for_allele(allele)[source]
plan_summary(allele)[source]
score_function(allele, dry_run=False)[source]
class mhcflurry.select_allele_specific_models_command.ConsensusModelSelector(predictor, num_peptides_per_length=10000, multiply_score_by_value=10.0)[source]

Bases: object

Model selector that scores sub-ensembles based on their Kendall tau consistency with the full ensemble over a set of random peptides.

usable_for_allele(allele)[source]
max_absolute_value(allele)[source]
plan_summary(allele)[source]
score_function(allele)[source]
class mhcflurry.select_allele_specific_models_command.MSEModelSelector(df, predictor, min_measurements=1, multiply_score_by_data_size=True)[source]

Bases: object

Model selector that uses mean-squared error to score models. Inequalities are supported.

usable_for_allele(allele)[source]
max_absolute_value(allele)[source]
plan_summary(allele)[source]
score_function(allele)[source]
class mhcflurry.select_allele_specific_models_command.MassSpecModelSelector(df, predictor, decoys_per_length=0, min_measurements=100, multiply_score_by_data_size=True)[source]

Bases: object

Model selector that uses PPV of differentiating decoys from hits from mass-spec experiments.

static ppv(y_true, predictions)[source]
usable_for_allele(allele)[source]
max_absolute_value(allele)[source]
plan_summary(allele)[source]
score_function(allele)[source]

mhcflurry.select_pan_allele_models_command module

Model select class1 pan-allele models.

APPROACH: For each training fold, we select at least min and at most max models (where min and max are set by the –{min/max}-models-per-fold argument) using a step-up (forward) selection procedure. The final ensemble is the union of all selected models across all folds.

mhcflurry.select_pan_allele_models_command.mse(predictions, actual, inequalities=None, affinities_are_already_01_transformed=False)[source]

Mean squared error of predictions vs. actual

Parameters
predictionslist of float
actuallist of float
inequalitieslist of string (“>”, “<”, or “=”)
affinities_are_already_01_transformedboolean

Predictions and actual are taken to be nanomolar affinities if affinities_are_already_01_transformed is False, otherwise 0-1 values.

Returns
float
mhcflurry.select_pan_allele_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]
mhcflurry.select_pan_allele_models_command.do_model_select_task(item, constant_data={})[source]
mhcflurry.select_pan_allele_models_command.model_select(fold_num, models, min_models, max_models, constant_data={})[source]

Model select for a fold.

Parameters
fold_numint
modelslist of Class1NeuralNetwork
min_modelsint
max_modelsint
constant_datadict
Returns
dict with keys ‘fold_num’, ‘selected_indices’, ‘summary’

mhcflurry.select_processing_models_command module

Model select antigen processing models.

APPROACH: For each training fold, we select at least min and at most max models (where min and max are set by the –{min/max}-models-per-fold argument) using a step-up (forward) selection procedure. The final ensemble is the union of all selected models across all folds. AUC is used as the metric.

mhcflurry.select_processing_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]
mhcflurry.select_processing_models_command.do_model_select_task(item, constant_data={})[source]
mhcflurry.select_processing_models_command.model_select(fold_num, models, min_models, max_models, constant_data={})[source]

Model select for a fold.

Parameters
fold_numint
modelslist of Class1NeuralNetwork
min_modelsint
max_modelsint
constant_datadict
Returns
dict with keys ‘fold_num’, ‘selected_indices’, ‘summary’

mhcflurry.testing_utils module

Utilities used in MHCflurry unit tests.

mhcflurry.testing_utils.startup()[source]

Configure Keras backend for running unit tests.

mhcflurry.testing_utils.cleanup()[source]

Clear tensorflow session and other process-wide resources.

mhcflurry.train_allele_specific_models_command module

Train Class1 single allele models.

mhcflurry.train_allele_specific_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]
mhcflurry.train_allele_specific_models_command.alleles_by_similarity(allele)[source]
mhcflurry.train_allele_specific_models_command.train_model(n_models, allele_num, n_alleles, hyperparameter_set_num, num_hyperparameter_sets, allele, hyperparameters, verbose, progress_print_interval, predictor, save_to, constant_data={})[source]
mhcflurry.train_allele_specific_models_command.subselect_df_held_out(df, recriprocal_held_out_fraction=10, seed=0)[source]

mhcflurry.train_pan_allele_models_command module

Train Class1 pan-allele models.

mhcflurry.train_pan_allele_models_command.assign_folds(df, num_folds, held_out_fraction, held_out_max)[source]

Split training data into multple test/train pairs, which we refer to as folds. Note that a given data point may be assigned to multiple test or train sets; these folds are NOT a non-overlapping partition as used in cross validation.

A fold is defined by a boolean value for each data point, indicating whether it is included in the training data for that fold. If it’s not in the training data, then it’s in the test data.

Folds are balanced in terms of allele content.

Parameters
dfpandas.DataFrame

training data

num_foldsint
held_out_fractionfloat

Fraction of data to hold out as test data in each fold

held_out_max

For a given allele, do not hold out more than held_out_max number of data points in any fold.

Returns
pandas.DataFrame

index is same as df.index, columns are “fold_0”, … “fold_N” giving whether the data point is in the training data for the fold

mhcflurry.train_pan_allele_models_command.pretrain_data_iterator(filename, master_allele_encoding, peptides_per_chunk=1024)[source]

Step through a CSV file giving predictions for a large number of peptides (rows) and alleles (columns).

Parameters
filenamestring
master_allele_encodingAlleleEncoding
peptides_per_chunkint
Returns
Generator of (AlleleEncoding, EncodableSequences, float affinities) tuples
mhcflurry.train_pan_allele_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]
mhcflurry.train_pan_allele_models_command.main(args)[source]
mhcflurry.train_pan_allele_models_command.initialize_training(args)[source]
mhcflurry.train_pan_allele_models_command.train_models(args)[source]
mhcflurry.train_pan_allele_models_command.train_model(work_item_name, work_item_num, num_work_items, architecture_num, num_architectures, fold_num, num_folds, replicate_num, num_replicates, hyperparameters, pretrain_data_filename, verbose, progress_print_interval, predictor, save_to, constant_data={})[source]

mhcflurry.train_presentation_models_command module

Train Class1 presentation models.

mhcflurry.train_presentation_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]
mhcflurry.train_presentation_models_command.main(args)[source]

mhcflurry.train_processing_models_command module

Train Class1 processing models.

mhcflurry.train_processing_models_command.assign_folds(df, num_folds, held_out_samples)[source]

Split training data into mulitple test/train pairs, which we refer to as folds. Note that a given data point may be assigned to multiple test or train sets; these folds are NOT a non-overlapping partition as used in cross validation.

A fold is defined by a boolean value for each data point, indicating whether it is included in the training data for that fold. If it’s not in the training data, then it’s in the test data.

Parameters
dfpandas.DataFrame

training data

num_foldsint
held_out_samplesint
Returns
pandas.DataFrame

index is same as df.index, columns are “fold_0”, … “fold_N” giving whether the data point is in the training data for the fold

mhcflurry.train_processing_models_command.run(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]
mhcflurry.train_processing_models_command.main(args)[source]
mhcflurry.train_processing_models_command.initialize_training(args)[source]
mhcflurry.train_processing_models_command.train_models(args)[source]
mhcflurry.train_processing_models_command.train_model(work_item_name, work_item_num, num_work_items, architecture_num, num_architectures, fold_num, num_folds, replicate_num, num_replicates, hyperparameters, verbose, progress_print_interval, predictor, save_to, constant_data={})[source]

mhcflurry.version module