API Documentation¶
Class I MHC ligand prediction package
-
class
mhcflurry.
Class1AffinityPredictor
(allele_to_allele_specific_models=None, class1_pan_allele_models=None, allele_to_sequence=None, manifest_df=None, allele_to_percent_rank_transform=None, metadata_dataframes=None, provenance_string=None)[source]¶ Bases:
object
High-level interface for peptide/MHC I binding affinity prediction.
This class manages low-level
Class1NeuralNetwork
instances, each of which wraps a single Keras network. The purpose ofClass1AffinityPredictor
is to implement ensembles, handling of multiple alleles, and predictor loading and saving. It also provides a place to keep track of metadata like prediction histograms for percentile rank calibration.- Parameters
- allele_to_allele_specific_modelsdict of string -> list of
Class1NeuralNetwork
Ensemble of single-allele models to use for each allele.
- class1_pan_allele_modelslist of
Class1NeuralNetwork
Ensemble of pan-allele models.
- allele_to_sequencedict of string -> string
MHC allele name to fixed-length amino acid sequence (sometimes referred to as the pseudosequence). Required only if class1_pan_allele_models is specified.
- manifest_df
pandas.DataFrame
, optional Must have columns: model_name, allele, config_json, model. Only required if you want to update an existing serialization of a Class1AffinityPredictor. Otherwise this dataframe will be generated automatically based on the supplied models.
- allele_to_percent_rank_transformdict of string ->
PercentRankTransform
, optional PercentRankTransform
instances to use for each allele- metadata_dataframesdict of string -> pandas.DataFrame, optional
Optional additional dataframes to write to the models dir when save() is called. Useful for tracking provenance.
- provenance_stringstring, optional
Optional info string to use in __str__.
- allele_to_allele_specific_modelsdict of string -> list of
-
property
manifest_df
¶ A pandas.DataFrame describing the models included in this predictor.
Based on: - self.class1_pan_allele_models - self.allele_to_allele_specific_models
- Returns
- pandas.DataFrame
-
clear_cache
()[source]¶ Clear values cached based on the neural networks in this predictor.
- Users should call this after mutating any of the following:
self.class1_pan_allele_models
self.allele_to_allele_specific_models
self.allele_to_sequence
Methods that mutate these instance variables will call this method on their own if needed.
-
property
neural_networks
¶ List of the neural networks in the ensemble.
- Returns
- list of
Class1NeuralNetwork
- list of
-
classmethod
merge
(predictors)[source]¶ Merge the ensembles of two or more
Class1AffinityPredictor
instances.Note: the resulting merged predictor will NOT have calibrated percentile ranks. Call
calibrate_percentile_ranks
on it if these are needed.- Parameters
- predictorssequence of
Class1AffinityPredictor
- predictorssequence of
- Returns
Class1AffinityPredictor
instance
-
merge_in_place
(others)[source]¶ Add the models present in other predictors into the current predictor.
- Parameters
- otherslist of Class1AffinityPredictor
Other predictors to merge into the current predictor.
- Returns
- list of stringnames of newly added models
-
property
supported_alleles
¶ Alleles for which predictions can be made.
- Returns
- list of string
-
property
supported_peptide_lengths
¶ (minimum, maximum) lengths of peptides supported by all models, inclusive.
- Returns
- (int, int) tuple
-
check_consistency
()[source]¶ Verify that self.manifest_df is consistent with: - self.class1_pan_allele_models - self.allele_to_allele_specific_models
Currently only checks for agreement on the total number of models.
Throws AssertionError if inconsistent.
-
save
(models_dir, model_names_to_write=None, write_metadata=True)[source]¶ Serialize the predictor to a directory on disk. If the directory does not exist it will be created.
The serialization format consists of a file called “manifest.csv” with the configurations of each Class1NeuralNetwork, along with per-network files giving the model weights. If there are pan-allele predictors in the ensemble, the allele sequences are also stored in the directory. There is also a small file “index.txt” with basic metadata: when the models were trained, by whom, on what host.
- Parameters
- models_dirstring
Path to directory. It will be created if it doesn’t exist.
- model_names_to_writelist of string, optional
Only write the weights for the specified models. Useful for incremental updates during training.
- write_metadataboolean, optional
Whether to write optional metadata
-
static
load
(models_dir=None, max_models=None, optimization_level=None)[source]¶ Deserialize a predictor from a directory on disk.
- Parameters
- models_dirstring
Path to directory. If unspecified the default downloaded models are used.
- max_modelsint, optional
Maximum number of
Class1NeuralNetwork
instances to load- optimization_levelint
If >0, model optimization will be attempted. Defaults to value of environment variable MHCFLURRY_OPTIMIZATION_LEVEL.
- Returns
Class1AffinityPredictor
instance
-
optimize
(warn=True)[source]¶ EXPERIMENTAL: Optimize the predictor for faster predictions.
Currently the only optimization implemented is to merge multiple pan- allele predictors at the tensorflow level.
The optimization is performed in-place, mutating the instance.
- Returns
- bool
Whether optimization was performed
-
static
model_name
(allele, num)[source]¶ Generate a model name
- Parameters
- allelestring
- numint
- Returns
- string
-
static
weights_path
(models_dir, model_name)[source]¶ Generate the path to the weights file for a model
- Parameters
- models_dirstring
- model_namestring
- Returns
- string
-
property
master_allele_encoding
¶ An AlleleEncoding containing the universe of alleles specified by self.allele_to_sequence.
- Returns
- AlleleEncoding
-
fit_allele_specific_predictors
(n_models, architecture_hyperparameters_list, allele, peptides, affinities, inequalities=None, train_rounds=None, models_dir_for_save=None, verbose=0, progress_preamble='', progress_print_interval=5.0)[source]¶ Fit one or more allele specific predictors for a single allele using one or more neural network architectures.
The new predictors are saved in the Class1AffinityPredictor instance and will be used on subsequent calls to
predict
.- Parameters
- n_modelsint
Number of neural networks to fit
- architecture_hyperparameters_listlist of dict
List of hyperparameter sets.
- allelestring
- peptides
EncodableSequences
or list of string - affinitieslist of float
nM affinities
- inequalitieslist of string, each element one of “>”, “<”, or “=”
See
Class1NeuralNetwork.fit
for details.- train_roundssequence of int
Each training point i will be used on training rounds r for which train_rounds[i] > r, r >= 0.
- models_dir_for_savestring, optional
If specified, the Class1AffinityPredictor is (incrementally) written to the given models dir after each neural network is fit.
- verboseint
Keras verbosity
- progress_preamblestring
Optional string of information to include in each progress update
- progress_print_intervalfloat
How often (in seconds) to print progress. Set to None to disable.
- Returns
- list of
Class1NeuralNetwork
- list of
-
fit_class1_pan_allele_models
(n_models, architecture_hyperparameters, alleles, peptides, affinities, inequalities, models_dir_for_save=None, verbose=1, progress_preamble='', progress_print_interval=5.0)[source]¶ Fit one or more pan-allele predictors using a single neural network architecture.
The new predictors are saved in the Class1AffinityPredictor instance and will be used on subsequent calls to
predict
.- Parameters
- n_modelsint
Number of neural networks to fit
- architecture_hyperparametersdict
- alleleslist of string
Allele names (not sequences) corresponding to each peptide
- peptides
EncodableSequences
or list of string - affinitieslist of float
nM affinities
- inequalitieslist of string, each element one of “>”, “<”, or “=”
See Class1NeuralNetwork.fit for details.
- models_dir_for_savestring, optional
If specified, the Class1AffinityPredictor is (incrementally) written to the given models dir after each neural network is fit.
- verboseint
Keras verbosity
- progress_preamblestring
Optional string of information to include in each progress update
- progress_print_intervalfloat
How often (in seconds) to print progress. Set to None to disable.
- Returns
- list of
Class1NeuralNetwork
- list of
-
add_pan_allele_model
(model, models_dir_for_save=None)[source]¶ Add a pan-allele model to the ensemble and optionally do an incremental save.
- Parameters
- modelClass1NeuralNetwork
- models_dir_for_savestring
Directory to save resulting ensemble to
-
percentile_ranks
(affinities, allele=None, alleles=None, throw=True)[source]¶ Return percentile ranks for the given ic50 affinities and alleles.
The ‘allele’ and ‘alleles’ argument are as in the
predict
method. Specify one of these.- Parameters
- affinitiessequence of float
nM affinities
- allelestring
- allelessequence of string
- throwboolean
If True, a ValueError will be raised in the case of unsupported alleles. If False, a warning will be logged and NaN will be returned for those percentile ranks.
- Returns
- numpy.array of float
-
predict
(peptides, alleles=None, allele=None, throw=True, centrality_measure='mean', model_kwargs={})[source]¶ Predict nM binding affinities.
If multiple predictors are available for an allele, the predictions are the geometric means of the individual model (nM) predictions.
One of ‘allele’ or ‘alleles’ must be specified. If ‘allele’ is specified all predictions will be for the given allele. If ‘alleles’ is specified it must be the same length as ‘peptides’ and give the allele corresponding to each peptide.
- Parameters
- peptides
EncodableSequences
or list of string - alleleslist of string
- allelestring
- throwboolean
If True, a ValueError will be raised in the case of unsupported alleles or peptide lengths. If False, a warning will be logged and the predictions for the unsupported alleles or peptides will be NaN.
- centrality_measurestring or callable
Measure of central tendency to use to combine predictions in the ensemble. Options include: mean, median, robust_mean.
- model_kwargsdict
Additional keyword arguments to pass to Class1NeuralNetwork.predict
- peptides
- Returns
- numpy.array of predictions
-
predict_to_dataframe
(peptides, alleles=None, allele=None, throw=True, include_individual_model_predictions=False, include_percentile_ranks=True, include_confidence_intervals=True, centrality_measure='mean', model_kwargs={})[source]¶ Predict nM binding affinities. Gives more detailed output than
predict
method, including 5-95% prediction intervals.If multiple predictors are available for an allele, the predictions are the geometric means of the individual model predictions.
One of ‘allele’ or ‘alleles’ must be specified. If ‘allele’ is specified all predictions will be for the given allele. If ‘alleles’ is specified it must be the same length as ‘peptides’ and give the allele corresponding to each peptide.
- Parameters
- peptides
EncodableSequences
or list of string - alleleslist of string
- allelestring
- throwboolean
If True, a ValueError will be raised in the case of unsupported alleles or peptide lengths. If False, a warning will be logged and the predictions for the unsupported alleles or peptides will be NaN.
- include_individual_model_predictionsboolean
If True, the predictions of each individual model are included as columns in the result DataFrame.
- include_percentile_ranksboolean, default True
If True, a “prediction_percentile” column will be included giving the percentile ranks. If no percentile rank info is available, this will be ignored with a warning.
- centrality_measurestring or callable
Measure of central tendency to use to combine predictions in the ensemble. Options include: mean, median, robust_mean.
- model_kwargsdict
Additional keyword arguments to pass to Class1NeuralNetwork.predict
- peptides
- Returns
pandas.DataFrame
of predictions
-
calibrate_percentile_ranks
(peptides=None, num_peptides_per_length=100000, alleles=None, bins=None, motif_summary=False, summary_top_peptide_fractions=[0.001], verbose=False, model_kwargs={})[source]¶ Compute the cumulative distribution of ic50 values for a set of alleles over a large universe of random peptides, to enable taking quantiles of this distribution later.
- Parameters
- peptidessequence of string or EncodableSequences, optional
Peptides to use
- num_peptides_per_lengthint, optional
If peptides argument is not specified, then num_peptides_per_length peptides are randomly sampled from a uniform distribution for each supported length
- allelessequence of string, optional
Alleles to perform calibration for. If not specified all supported alleles will be calibrated.
- binsobject
Anything that can be passed to numpy.histogram’s “bins” argument can be used here, i.e. either an integer or a sequence giving bin edges. This is in ic50 space.
- motif_summarybool
If True, the length distribution and per-position amino acid frequencies are also calculated for the top x fraction of tightest- binding peptides, where each value of x is given in the summary_top_peptide_fractions list.
- summary_top_peptide_fractionslist of float
Only used if motif_summary is True
- verboseboolean
Whether to print status updates to stdout
- model_kwargsdict
Additional low-level Class1NeuralNetwork.predict() kwargs.
- Returns
- dict of string -> pandas.DataFrame
- If motif_summary is True, this will have keys “frequency_matrices” and
- “length_distributions”. Otherwise it will be empty.
-
model_select
(score_function, alleles=None, min_models=1, max_models=10000)[source]¶ Perform model selection using a user-specified scoring function.
This works only with allele-specific models, not pan-allele models.
Model selection is done using a “step up” variable selection procedure, in which models are repeatedly added to an ensemble until the score stops improving.
- Parameters
- score_functionClass1AffinityPredictor -> float function
Scoring function
- alleleslist of string, optional
If not specified, model selection is performed for all alleles.
- min_modelsint, optional
Min models to select per allele
- max_modelsint, optional
Max models to select per allele
- Returns
- Class1AffinityPredictorpredictor containing the selected models
-
class
mhcflurry.
Class1NeuralNetwork
(**hyperparameters)[source]¶ Bases:
object
Low level class I predictor consisting of a single neural network.
Both single allele and pan-allele prediction are supported.
Users will generally use Class1AffinityPredictor, which gives a higher-level interface and supports ensembles.
-
network_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Hyperparameters (and their default values) that affect the neural network architecture.
-
compile_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Loss and optimizer hyperparameters.
-
fit_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Hyperparameters for neural network training.
-
early_stopping_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Hyperparameters for early stopping.
-
miscelaneous_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Miscelaneous hyperaparameters. These parameters are not used by this class but may be interpreted by other code.
-
hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Combined set of all supported hyperparameters and their default values.
-
hyperparameter_renames
= {'embedding_init_method': None, 'embedding_input_dim': None, 'embedding_output_dim': None, 'kmer_size': None, 'left_edge': None, 'min_delta': None, 'mode': None, 'monitor': None, 'peptide_amino_acid_encoding': None, 'pseudosequence_use_embedding': None, 'right_edge': None, 'take_best_epoch': None, 'use_embedding': None, 'verbose': None}¶
-
classmethod
apply_hyperparameter_renames
(hyperparameters)[source]¶ Handle hyperparameter renames.
- Parameters
- hyperparametersdict
- Returns
- dictupdated hyperparameters
-
KERAS_MODELS_CACHE
= {}¶ Process-wide keras model cache, a map from: architecture JSON string to (Keras model, existing network weights)
-
classmethod
borrow_cached_network
(network_json, network_weights)[source]¶ Return a keras Model with the specified architecture and weights. As an optimization, when possible this will reuse architectures from a process-wide cache.
The returned object is “borrowed” in the sense that its weights can change later after subsequent calls to this method from other objects.
If you’re using this from a parallel implementation you’ll need to hold a lock while using the returned object.
- Parameters
- network_jsonstring of JSON
- network_weightslist of numpy.array
- Returns
- keras.models.Model
-
network
(borrow=False)[source]¶ Return the keras model associated with this predictor.
- Parameters
- borrowbool
Whether to return a cached model if possible. See borrow_cached_network for details
- Returns
- keras.models.Model
-
update_network_description
()[source]¶ Update self.network_json and self.network_weights properties based on this instances’s neural network.
-
static
keras_network_cache_key
(network_json)[source]¶ Given a Keras JSON description of a neural network, return a key that uniquely defines this network. Networks that share the same key should have compatible weights matrices and give the same prediction outputs when their weights are the same.
- Parameters
- network_jsonstring
- Returns
- string
-
classmethod
from_config
(config, weights=None, weights_loader=None)[source]¶ deserialize from a dict returned by get_config().
- Parameters
- configdict
- weightslist of array, optional
Network weights to restore
- weights_loadercallable, optional
Function to call (no arguments) to load weights when needed
- Returns
- Class1NeuralNetwork
-
load_weights
()[source]¶ Load weights by evaluating self.network_weights_loader, if needed.
After calling this, self.network_weights_loader will be None and self.network_weights will be the weights list, if available.
-
get_weights
()[source]¶ Get the network weights
- Returns
- list of numpy.array giving weights for each layer or None if there is no
- network
-
peptides_to_network_input
(peptides)[source]¶ Encode peptides to the fixed-length encoding expected by the neural network (which depends on the architecture).
- Parameters
- peptidesEncodableSequences or list of string
- Returns
- numpy.array
-
property
supported_peptide_lengths
¶ (minimum, maximum) lengths of peptides supported, inclusive.
- Returns
- (int, int) tuple
-
allele_encoding_to_network_input
(allele_encoding)[source]¶ Encode alleles to the fixed-length encoding expected by the neural network (which depends on the architecture).
- Parameters
- allele_encodingAlleleEncoding
- Returns
- (numpy.array, numpy.array)
- Indices and allele representations.
-
static
data_dependent_weights_initialization
(network, x_dict=None, method='lsuv', verbose=1)[source]¶ Data dependent weights initialization.
- Parameters
- networkkeras.Model
- x_dictdict of string -> numpy.ndarray
Training data as would be passed keras.Model.fit().
- methodstring
Initialization method. Currently only “lsuv” is supported.
- verboseint
Status updates printed to stdout if verbose > 0
-
fit_generator
(generator, validation_peptide_encoding, validation_affinities, validation_allele_encoding=None, validation_inequalities=None, validation_output_indices=None, steps_per_epoch=10, epochs=1000, min_epochs=0, patience=10, min_delta=0.0, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]¶ Fit using a generator. Does not support many of the features of fit(), such as random negative peptides.
Fitting proceeds until early stopping is hit, using the peptides, affinities, etc. given by the parameters starting with “validation_”.
This is used for pre-training pan-allele models using data synthesized by the allele-specific models.
- Parameters
- generatorgenerator yielding (alleles, peptides, affinities) tuples
where alleles and peptides are lists of strings, and affinities is list of floats.
- validation_peptide_encodingEncodableSequences
- validation_affinitieslist of float
- validation_allele_encodingAlleleEncoding
- validation_inequalitieslist of string
- validation_output_indiceslist of int
- steps_per_epochint
- epochsint
- min_epochsint
- patienceint
- min_deltafloat
- verboseint
- progress_callbackthunk
- progress_preamblestring
- progress_print_intervalfloat
-
fit
(peptides, affinities, allele_encoding=None, inequalities=None, output_indices=None, sample_weights=None, shuffle_permutation=None, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]¶ Fit the neural network.
- Parameters
- peptidesEncodableSequences or list of string
- affinitieslist of float
nM affinities. Must be same length of as peptides.
- allele_encodingAlleleEncoding
If not specified, the model will be a single-allele predictor.
- inequalitieslist of string, each element one of “>”, “<”, or “=”.
Inequalities to use for fitting. Same length as affinities. Each element must be one of “>”, “<”, or “=”. For example, a “>” will train on y_pred > y_true for that element in the training set. Requires using a custom losses that support inequalities (e.g. mse_with_ineqalities). If None all inequalities are taken to be “=”.
- output_indiceslist of int
For multi-output models only. Same length as affinities. Indicates the index of the output (starting from 0) for each training example.
- sample_weightslist of float
If not specified, all samples (including random negatives added during training) will have equal weight. If specified, the random negatives will be assigned weight=1.0.
- shuffle_permutationlist of int
Permutation (integer list) of same length as peptides and affinities If None, then a random permutation will be generated.
- verboseint
Keras verbosity level
- progress_callbackfunction
No-argument function to call after each epoch.
- progress_preamblestring
Optional string of information to include in each progress update
- progress_print_intervalfloat
How often (in seconds) to print progress update. Set to None to disable.
-
predict
(peptides, allele_encoding=None, batch_size=4096, output_index=0)[source]¶ Predict affinities.
If peptides are specified as EncodableSequences, then the predictions will be cached for this predictor as long as the EncodableSequences object remains in memory. The cache is keyed in the object identity of the EncodableSequences, not the sequences themselves. The cache is used only for allele-specific models (i.e. when allele_encoding is None).
- Parameters
- peptidesEncodableSequences or list of string
- allele_encodingAlleleEncoding, optional
Only required when this model is a pan-allele model
- batch_sizeint
batch_size passed to Keras
- output_indexint or None
For multi-output models. Gives the output index to return. If set to None, then all outputs are returned as a samples x outputs matrix.
- Returns
- numpy.array of nM affinity predictions
-
classmethod
merge
(models, merge_method='average')[source]¶ Merge multiple models at the tensorflow (or other backend) level.
Only certain neural network architectures support merging. Others will result in a NotImplementedError.
- Parameters
- modelslist of Class1NeuralNetwork
instances to merge
- merge_methodstring, one of “average”, “sum”, or “concatenate”
How to merge the predictions of the different models
- Returns
- Class1NeuralNetwork
The merged neural network
-
make_network
(peptide_encoding, allele_amino_acid_encoding, allele_dense_layer_sizes, peptide_dense_layer_sizes, peptide_allele_merge_method, peptide_allele_merge_activation, layer_sizes, dense_layer_l1_regularization, dense_layer_l2_regularization, activation, init, output_activation, dropout_probability, batch_normalization, locally_connected_layers, topology, num_outputs=1, allele_representations=None)[source]¶ Helper function to make a keras network for class 1 affinity prediction.
-
clear_allele_representations
()[source]¶ Set allele representations to an empty array. Useful before saving to save a smaller version of the model.
-
set_allele_representations
(allele_representations, force_surgery=False)[source]¶ Set the allele representations in use by this model. This means mutating the weights for the allele input embedding layer.
Rationale: instead of passing in the allele sequence for each data point during model training or prediction (which is expensive in terms of memory usage), we pass in an allele index between 0 and n-1 where n is the number of alleles in some universe of possible alleles. This index is used in the model to lookup the corresponding allele sequence. This function sets the lookup table.
See also: AlleleEncoding.allele_representations()
- Parameters
- allele_representationsnumpy.ndarray of shape (a, l, m)
- where a is the total number of alleles,
l is the allele sequence length, m is the length of the vectors used to represent amino acids
-
-
class
mhcflurry.
Class1ProcessingPredictor
(models, manifest_df=None, metadata_dataframes=None, provenance_string=None)[source]¶ Bases:
object
User-facing interface to antigen processing prediction.
Delegates to an ensemble of Class1ProcessingNeuralNetwork instances.
Instantiate a new Class1ProcessingPredictor
Users will generally call load() to restore a saved predictor rather than using this constructor.
- Parameters
- modelslist of Class1ProcessingNeuralNetwork
Neural networks in the ensemble.
- manifest_dfpandas.DataFrame
Manifest dataframe. If not specified a new one will be created when needed.
- metadata_dataframesdict of string -> pandas.DataFrame
Arbitrary metadata associated with this predictor
- provenance_stringstring, optional
Optional info string to use in __str__.
-
property
sequence_lengths
¶ Supported maximum sequence lengths.
Passing a peptide greater than the maximum supported length results in an error.
Passing an N- or C-flank sequence greater than the maximum supported length results in some part of it being ignored.
- Returns
- dict of string -> int
- Keys are “peptide”, “n_flank”, “c_flank”. Values give the maximum
- supported sequence length.
-
add_models
(models)[source]¶ Add models to the ensemble (in-place).
- Parameters
- modelslist of Class1ProcessingNeuralNetwork
- Returns
- list of string
- Names of the new models.
-
property
manifest_df
¶ A pandas.DataFrame describing the models included in this predictor.
- Returns
- pandas.DataFrame
-
static
weights_path
(models_dir, model_name)[source]¶ Generate the path to the weights file for a model
- Parameters
- models_dirstring
- model_namestring
- Returns
- string
-
predict
(peptides, n_flanks=None, c_flanks=None, throw=True, batch_size=4096)[source]¶ Predict antigen processing.
- Parameters
- peptideslist of string
Peptide sequences
- n_flankslist of string
Upstream sequence before each peptide
- c_flankslist of string
Downstream sequence after each peptide
- throwboolean
If True, a ValueError will be raised in the case of unsupported peptides. If False, a warning will be logged and the predictions for those peptides will be NaN.
- batch_sizeint
Prediction keras batch size.
- Returns
- numpy.array
- Processing scores. Range is 0-1, higher indicates more favorable
- processing.
-
predict_to_dataframe
(peptides, n_flanks=None, c_flanks=None, throw=True, batch_size=4096)[source]¶ Predict antigen processing.
See
predict
method for parameter descriptions.- Returns
- pandas.DataFrame
- Processing predictions are in the “score” column. Also includes
- peptides and flanking sequences.
-
predict_to_dataframe_encoded
(sequences, throw=True, batch_size=4096)[source]¶ Predict antigen processing.
See
predict
method for more information.- Parameters
- sequencesFlankingEncoding
- batch_sizeint
- throwboolean
- Returns
- pandas.DataFrame
-
check_consistency
()[source]¶ Verify that self.manifest_df is consistent with instance variables.
Currently only checks for agreement on the total number of models.
Throws AssertionError if inconsistent.
-
save
(models_dir, model_names_to_write=None, write_metadata=True)[source]¶ Serialize the predictor to a directory on disk. If the directory does not exist it will be created.
The serialization format consists of a file called “manifest.csv” with the configurations of each Class1ProcessingNeuralNetwork, along with per-network files giving the model weights.
- Parameters
- models_dirstring
Path to directory. It will be created if it doesn’t exist.
-
classmethod
load
(models_dir=None, max_models=None)[source]¶ Deserialize a predictor from a directory on disk.
- Parameters
- models_dirstring
Path to directory. If unspecified the default downloaded models are used.
- max_modelsint, optional
Maximum number of models to load
- Returns
Class1ProcessingPredictor
instance
-
class
mhcflurry.
Class1ProcessingNeuralNetwork
(**hyperparameters)[source]¶ Bases:
object
A neural network for antigen processing prediction
-
network_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Hyperparameters (and their default values) that affect the neural network architecture.
-
fit_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Hyperparameters for neural network training.
-
early_stopping_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Hyperparameters for early stopping.
-
compile_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Loss and optimizer hyperparameters. Any values supported by keras may be used.
-
auxiliary_input_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Allele feature hyperparameters.
-
hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶
-
property
sequence_lengths
¶ Supported maximum sequence lengths
- Returns
- dict of string -> int
- Keys are “peptide”, “n_flank”, “c_flank”. Values give the maximum
- supported sequence length.
-
update_network_description
()[source]¶ Update self.network_json and self.network_weights properties based on this instances’s neural network.
-
fit
(sequences, targets, sample_weights=None, shuffle_permutation=None, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]¶ Fit the neural network.
- Parameters
- sequencesFlankingEncoding
Peptides and upstream/downstream flanking sequences
- targetslist of float
1 indicates hit, 0 indicates decoy
- sample_weightslist of float
If not specified all samples have equal weight.
- shuffle_permutationlist of int
Permutation (integer list) of same length as peptides and affinities If None, then a random permutation will be generated.
- verboseint
Keras verbosity level
- progress_callbackfunction
No-argument function to call after each epoch.
- progress_preamblestring
Optional string of information to include in each progress update
- progress_print_intervalfloat
How often (in seconds) to print progress update. Set to None to disable.
-
predict
(peptides, n_flanks=None, c_flanks=None, batch_size=4096)[source]¶ Predict antigen processing.
- Parameters
- peptideslist of string
Peptide sequences
- n_flankslist of string
Upstream sequence before each peptide
- c_flankslist of string
Downstream sequence after each peptide
- batch_sizeint
Prediction keras batch size.
- Returns
- numpy.array
- Processing scores. Range is 0-1, higher indicates more favorable
- processing.
-
predict_encoded
(sequences, throw=True, batch_size=4096)[source]¶ Predict antigen processing.
- Parameters
- sequencesFlankingEncoding
Peptides and flanking sequences
- throwboolean
Whether to throw exception on unsupported peptides
- batch_sizeint
Prediction keras batch size.
- Returns
- numpy.array
-
network_input
(sequences, throw=True)[source]¶ Encode peptides to the fixed-length encoding expected by the neural network (which depends on the architecture).
- Parameters
- sequencesFlankingEncoding
Peptides and flanking sequences
- throwboolean
Whether to throw exception on unsupported peptides
- Returns
- numpy.array
-
make_network
(amino_acid_encoding, peptide_max_length, n_flank_length, c_flank_length, flanking_averages, convolutional_filters, convolutional_kernel_size, convolutional_activation, convolutional_kernel_l1_l2, dropout_rate, post_convolutional_dense_layer_sizes)[source]¶ Helper function to make a keras network given hyperparameters.
-
-
class
mhcflurry.
Class1PresentationPredictor
(affinity_predictor=None, processing_predictor_with_flanks=None, processing_predictor_without_flanks=None, weights_dataframe=None, metadata_dataframes=None, percent_rank_transform=None, provenance_string=None)[source]¶ Bases:
object
A logistic regression model over predicted binding affinity (BA) and antigen processing (AP) score.
Instances of this class delegate to Class1AffinityPredictor and Class1ProcessingPredictor instances to generate BA and AP predictions. These predictions are combined using a logistic regression model to give a “presentation score” prediction.
Most users will call the
load
static method to get an instance of this class, then call thepredict
method to generate predictions.-
model_inputs
= ['affinity_score', 'processing_score']¶
-
property
supported_alleles
¶ List of alleles supported by the underlying Class1AffinityPredictor
-
property
supported_peptide_lengths
¶ (min, max) of supported peptide lengths, inclusive.
-
property
supports_affinity_prediction
¶ Is there an affinity predictor associated with this instance?
-
property
supports_processing_prediction
¶ Is there a processing predictor associated with this instance?
-
property
supports_presentation_prediction
¶ Can this instance predict presentation?
-
predict_affinity
(peptides, alleles, sample_names=None, include_affinity_percentile=True, verbose=1, throw=True)[source]¶ Predict binding affinities across samples (each corresponding to up to six MHC I alleles).
Two modes are supported: each peptide can be evaluated for binding to any of the alleles in any sample (this is what happens when sample_names is None), or the i’th peptide can be evaluated for binding the alleles of the sample given by the i’th entry in sample_names.
For example, if we don’t specify sample_names, then predictions are taken for all combinations of samples and peptides, for a result size of num peptides * num samples:
>>> predictor = Class1PresentationPredictor.load() >>> predictor.predict_affinity( ... peptides=["SIINFEKL", "PEPTIDE"], ... alleles={ ... "sample1": ["A0201", "A0301", "B0702"], ... "sample2": ["A0101", "C0202"], ... }, ... verbose=0) peptide peptide_num sample_name affinity best_allele affinity_percentile 0 SIINFEKL 0 sample1 11927.161 A0201 6.296 1 PEPTIDE 1 sample1 32507.083 A0201 71.249 2 SIINFEKL 0 sample2 2725.593 C0202 6.662 3 PEPTIDE 1 sample2 28304.330 C0202 54.652
In contrast, here we specify sample_names, so peptide is evaluated for binding the alleles in the corresponding sample, for a result size equal to the number of peptides:
>>> predictor.predict_affinity( ... peptides=["SIINFEKL", "PEPTIDE"], ... alleles={ ... "sample1": ["A0201", "A0301", "B0702"], ... "sample2": ["A0101", "C0202"], ... }, ... sample_names=["sample2", "sample1"], ... verbose=0) peptide peptide_num sample_name affinity best_allele affinity_percentile 0 SIINFEKL 0 sample2 2725.592 C0202 6.662 1 PEPTIDE 1 sample1 32507.079 A0201 71.249
- Parameters
- peptideslist of string
Peptide sequences
- allelesdict of string -> list of string
Keys are sample names, values are the alleles (genotype) for that sample
- sample_nameslist of string [same length as peptides]
Sample names corresponding to each peptide. If None, then predictions are generated for all sample genotypes across all peptides.
- include_affinity_percentilebool
Whether to include affinity percentile ranks
- verboseint
Set to 0 for quiet.
- throwverbose
Whether to throw exception (vs. just log a warning) on invalid peptides, etc.
- Returns
- pandas.DataFramepredictions
-
predict_processing
(peptides, n_flanks=None, c_flanks=None, throw=True, verbose=1)[source]¶ Predict antigen processing scores for individual peptides, optionally including flanking sequences for better cleavage prediction.
- Parameters
- peptideslist of string
- n_flankslist of string [same length as peptides]
- c_flankslist of string [same length as peptides]
- throwboolean
Whether to raise exception on unsupported peptides
- verboseint
- Returns
- numpy.arrayAntigen processing scores for each peptide
-
fit
(targets, peptides, sample_names, alleles, n_flanks=None, c_flanks=None, verbose=1)[source]¶ Fit the presentation score logistic regression model.
- Parameters
- targetslist of int/float
1 indicates hit, 0 indicates decoy
- peptideslist of string [same length as targets]
- sample_nameslist of string [same length as targets]
- allelesdict of string -> list of string
Keys are sample names, values are the alleles for that sample
- n_flankslist of string [same length as targets]
- c_flankslist of string [same length as targets]
- verboseint
-
get_model
(name=None)[source]¶ Load or instantiate a new logistic regression model. Private helper method.
- Parameters
- namestring
If None (the default), an un-fit LR model is returned. Otherwise the weights are loaded for the specified model.
- Returns
- sklearn.linear_model.LogisticRegression
-
predict
(peptides, alleles, sample_names=None, n_flanks=None, c_flanks=None, include_affinity_percentile=False, verbose=1, throw=True)[source]¶ Predict presentation scores across a set of peptides.
Presentation scores combine predictions for MHC I binding affinity and antigen processing.
This method returns a pandas.DataFrame giving presentation scores plus the binding affinity and processing predictions and other intermediate results.
Example:
>>> predictor = Class1PresentationPredictor.load() >>> predictor.predict( ... peptides=["SIINFEKL", "PEPTIDE"], ... n_flanks=["NNN", "SNS"], ... c_flanks=["CCC", "CNC"], ... alleles={ ... "sample1": ["A0201", "A0301", "B0702"], ... "sample2": ["A0101", "C0202"], ... }, ... verbose=0) peptide n_flank c_flank peptide_num sample_name affinity best_allele processing_score presentation_score presentation_percentile 0 SIINFEKL NNN CCC 0 sample1 11927.161 A0201 0.838 0.145 2.282 1 PEPTIDE SNS CNC 1 sample1 32507.083 A0201 0.025 0.003 100.000 2 SIINFEKL NNN CCC 0 sample2 2725.593 C0202 0.838 0.416 1.017 3 PEPTIDE SNS CNC 1 sample2 28304.330 C0202 0.025 0.003 99.287
You can also specify sample_names, in which case peptide is evaluated for binding the alleles in the corresponding sample only. See
predict_affinity
for an examples.- Parameters
- peptideslist of string
Peptide sequences
- alleleslist of string or dict of string -> list of string
If you are predicting for a single sample, pass a list of strings (up to 6) indicating the genotype. If you are predicting across multiple samples, pass a dict where the keys are (arbitrary) sample names and the values are the alleles to predict for that sample. Set to an empty list or dict to perform processing prediction only.
- sample_nameslist of string [same length as peptides]
If you are passing a dict for ‘alleles’, you can use this argument to specify which peptides go with which samples. If it is None, then predictions will be performed for each peptide across all samples.
- n_flankslist of string [same length as peptides]
Upstream sequences before the peptide. Sequences of any length can be given and a suffix of the size supported by the model will be used.
- c_flankslist of string [same length as peptides]
Downstream sequences after the peptide. Sequences of any length can be given and a prefix of the size supported by the model will be used.
- include_affinity_percentilebool
Whether to include affinity percentile ranks
- verboseint
Set to 0 for quiet.
- throwverbose
Whether to throw exception (vs. just log a warning) on invalid peptides, etc.
- Returns
- pandas.DataFrame
- Presentation scores and intermediate results.
-
predict_sequences
(sequences, alleles, result='best', comparison_quantity=None, filter_value=None, peptide_lengths=8, 9, 10, 11, use_flanks=True, include_affinity_percentile=True, verbose=1, throw=True)[source]¶ Predict presentation across protein sequences.
Example:
>>> predictor = Class1PresentationPredictor.load() >>> predictor.predict_sequences( ... sequences={ ... 'protein1': "MDSKGSSQKGSRLLLLLVVSNLL", ... 'protein2': "SSLPTPEDKEQAQQTHH", ... }, ... alleles={ ... "sample1": ["A0201", "A0301", "B0702"], ... "sample2": ["A0101", "C0202"], ... }, ... result="filtered", ... comparison_quantity="affinity", ... filter_value=500, ... verbose=0) sequence_name pos peptide n_flank c_flank sample_name affinity best_allele affinity_percentile processing_score presentation_score presentation_percentile 0 protein1 14 LLLVVSNLL GSRLL sample1 57.180 A0201 0.398 0.233 0.754 0.351 1 protein1 13 LLLLVVSNL KGSRL L sample1 57.339 A0201 0.398 0.031 0.586 0.643 2 protein1 5 SSQKGSRLL MDSKG LLLVV sample2 110.779 C0202 0.782 0.061 0.456 0.920 3 protein1 6 SQKGSRLLL DSKGS LLVVS sample2 254.480 C0202 1.735 0.102 0.303 1.356 4 protein1 13 LLLLVVSNLL KGSRL sample1 260.390 A0201 1.012 0.158 0.345 1.215 5 protein1 12 LLLLLVVSNL QKGSR L sample1 308.150 A0201 1.094 0.015 0.206 1.802 6 protein2 0 SSLPTPEDK EQAQQ sample2 410.354 C0202 2.398 0.003 0.158 2.155 7 protein1 5 SSQKGSRL MDSKG LLLLV sample2 444.321 C0202 2.512 0.026 0.159 2.138 8 protein2 0 SSLPTPEDK EQAQQ sample1 459.296 A0301 0.971 0.003 0.144 2.292 9 protein1 4 GSSQKGSRL MDSK LLLLV sample2 469.052 C0202 2.595 0.014 0.146 2.261
- Parameters
- sequencesstr, list of string, or string -> string dict
Protein sequences. If a dict is given, the keys are arbitrary ( e.g. protein names), and the values are the amino acid sequences.
- alleleslist of string, list of list of string, or dict of string -> list of string
MHC I alleles. Can be: (1) a string (a single allele), (2) a list of strings (a single genotype), (3) a list of list of strings (multiple genotypes, where the total number of genotypes must equal the number of sequences), or (4) a dict giving multiple genotypes, which will each be run over the sequences.
- resultstring
Specify ‘best’ to return the strongest peptide for each sequence, ‘all’ to return predictions for all peptides, or ‘filtered’ to return predictions where the comparison_quantity is stronger (i.e (<) for affinity, (>) for scores) than filter_value.
- comparison_quantitystring
One of “presentation_score”, “processing_score”, “affinity”, or “affinity_percentile”. Prediction to use to rank (if result is “best”) or filter (if result is “filtered”) results. Default is “presentation_score”.
- filter_valuefloat
Threshold value to use, only relevant when result is “filtered”. If comparison_quantity is “affinity”, then all results less than (i.e. tighter than) the specified nM affinity are retained. If it’s “presentation_score” or “processing_score” then results greater than the indicated filter_value are retained.
- peptide_lengthslist of int
Peptide lengths to predict for.
- use_flanksbool
Whether to include flanking sequences when running the AP predictor (for better cleavage prediction).
- include_affinity_percentilebool
Whether to include affinity percentile ranks in output.
- verboseint
Set to 0 for quiet mode.
- throwboolean
Whether to throw exceptions (vs. log warnings) on invalid inputs.
- Returns
- pandas.DataFrame with columns:
peptide, n_flank, c_flank, sequence_name, affinity, best_allele, processing_score, presentation_score
-
save
(models_dir, write_affinity_predictor=True, write_processing_predictor=True, write_weights=True, write_percent_ranks=True, write_info=True, write_metdata=True)[source]¶ Save the predictor to a directory on disk. If the directory does not exist it will be created.
The wrapped Class1AffinityPredictor and Class1ProcessingPredictor instances are included in the saved data.
- Parameters
- models_dirstring
Path to directory. It will be created if it doesn’t exist.
-
classmethod
load
(models_dir=None, max_models=None)[source]¶ Deserialize a predictor from a directory on disk.
This will also load the wrapped Class1AffinityPredictor and Class1ProcessingPredictor instances.
- Parameters
- models_dirstring
Path to directory. If unspecified the default downloaded models are used.
- max_modelsint, optional
Maximum number of affinity and processing (counted separately) models to load
- Returns
Class1PresentationPredictor
instance
-
percentile_ranks
(presentation_scores, throw=True)[source]¶ Return percentile ranks for the given presentation scores.
- Parameters
- presentation_scoressequence of float
- Returns
- numpy.array of float
-
calibrate_percentile_ranks
(scores, bins=None)[source]¶ Compute the cumulative distribution of scores, to enable taking quantiles of this distribution later.
- Parameters
- scoressequence of float
Presentation prediction scores
- binsobject
Anything that can be passed to numpy.histogram’s “bins” argument can be used here, i.e. either an integer or a sequence giving bin edges.
-
Submodules¶
mhcflurry.allele_encoding module¶
-
class
mhcflurry.allele_encoding.
AlleleEncoding
(alleles=None, allele_to_sequence=None, borrow_from=None)[source]¶ Bases:
object
A place to cache encodings for a sequence of alleles.
We frequently work with alleles by integer indices, for example as inputs to neural networks. This class is used to map allele names to integer indices in a consistent way by keeping track of the universe of alleles under use, i.e. a distinction is made between the universe of supported alleles (what’s in
allele_to_sequence
) and the actual set of alleles used for some task (what’s inalleles
).- Parameters
- alleleslist of string
Allele names. If any allele is None instead of string, it will be mapped to the special index value -1.
- allele_to_sequencedict of str -> str
Allele name to amino acid sequence
- borrow_fromAlleleEncoding, optional
If specified, do not specify allele_to_sequence. The sequences from the provided instance are used. This guarantees that the mappings from allele to index and from allele to sequence are the same between the instances.
-
compact
()[source]¶ Return a new AlleleEncoding in which the universe of supported alleles is only the alleles actually used.
- Returns
- AlleleEncoding
-
allele_representations
(encoding_name)[source]¶ Encode the universe of supported allele sequences to a matrix.
- Parameters
- encoding_namestring
How to represent amino acids. Valid names are “BLOSUM62” or “one-hot”. See
amino_acid.ENCODING_DATA_FRAMES
.
- Returns
- numpy.array of shape
(num alleles in universe, sequence length, vector size)
- where vector size is usually 21 (20 amino acids + X character)
-
fixed_length_vector_encoded_sequences
(encoding_name)[source]¶ Encode allele sequences (not the universe of alleles) to a matrix.
- Parameters
- encoding_namestring
How to represent amino acids. Valid names are “BLOSUM62” or “one-hot”. See
amino_acid.ENCODING_DATA_FRAMES
.
- Returns
- numpy.array with shape:
(num alleles, sequence length, vector size)
- where vector size is usually 21 (20 amino acids + X character)
mhcflurry.amino_acid module¶
Functions for encoding fixed length sequences of amino acids into various vector representations, such as one-hot and BLOSUM62.
-
mhcflurry.amino_acid.
available_vector_encodings
()[source]¶ Return list of supported amino acid vector encodings.
- Returns
- list of string
-
mhcflurry.amino_acid.
vector_encoding_length
(name)[source]¶ Return the length of the given vector encoding.
- Parameters
- namestring
- Returns
- int
-
mhcflurry.amino_acid.
index_encoding
(sequences, letter_to_index_dict)[source]¶ Encode a sequence of same-length strings to a matrix of integers of the same shape. The map from characters to integers is given by
letter_to_index_dict
.Given a sequence of
n
strings all of lengthk
, return ak * n
array where the (i
,j
)th element isletter_to_index_dict[sequence[i][j]]
.- Parameters
- sequenceslist of length n of strings of length k
- letter_to_index_dictdict
- Returns
- numpy.array of integers with shape (
k
,n
)
- numpy.array of integers with shape (
-
mhcflurry.amino_acid.
fixed_vectors_encoding
(index_encoded_sequences, letter_to_vector_df)[source]¶ Given a
n
xk
matrix of integers such as that returned byindex_encoding()
and a dataframe mapping each index to an arbitrary vector, return an * k * m
array where the (i
,j
)’th element isletter_to_vector_df.iloc[sequence[i][j]]
.The dataframe index and columns names are ignored here; the indexing is done entirely by integer position in the dataframe.
- Parameters
- index_encoded_sequences
n
xk
array of integers - letter_to_vector_dfpandas.DataFrame of shape (
alphabet size
,m
)
- index_encoded_sequences
- Returns
- numpy.array of integers with shape (
n
,k
,m
)
- numpy.array of integers with shape (
mhcflurry.calibrate_percentile_ranks_command module¶
Calibrate percentile ranks for models. Runs in-place.
-
mhcflurry.calibrate_percentile_ranks_command.
run
(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶
-
mhcflurry.calibrate_percentile_ranks_command.
run_class1_presentation_predictor
(args, peptides)[source]¶
mhcflurry.class1_affinity_predictor module¶
-
class
mhcflurry.class1_affinity_predictor.
Class1AffinityPredictor
(allele_to_allele_specific_models=None, class1_pan_allele_models=None, allele_to_sequence=None, manifest_df=None, allele_to_percent_rank_transform=None, metadata_dataframes=None, provenance_string=None)[source]¶ Bases:
object
High-level interface for peptide/MHC I binding affinity prediction.
This class manages low-level
Class1NeuralNetwork
instances, each of which wraps a single Keras network. The purpose ofClass1AffinityPredictor
is to implement ensembles, handling of multiple alleles, and predictor loading and saving. It also provides a place to keep track of metadata like prediction histograms for percentile rank calibration.- Parameters
- allele_to_allele_specific_modelsdict of string -> list of
Class1NeuralNetwork
Ensemble of single-allele models to use for each allele.
- class1_pan_allele_modelslist of
Class1NeuralNetwork
Ensemble of pan-allele models.
- allele_to_sequencedict of string -> string
MHC allele name to fixed-length amino acid sequence (sometimes referred to as the pseudosequence). Required only if class1_pan_allele_models is specified.
- manifest_df
pandas.DataFrame
, optional Must have columns: model_name, allele, config_json, model. Only required if you want to update an existing serialization of a Class1AffinityPredictor. Otherwise this dataframe will be generated automatically based on the supplied models.
- allele_to_percent_rank_transformdict of string ->
PercentRankTransform
, optional PercentRankTransform
instances to use for each allele- metadata_dataframesdict of string -> pandas.DataFrame, optional
Optional additional dataframes to write to the models dir when save() is called. Useful for tracking provenance.
- provenance_stringstring, optional
Optional info string to use in __str__.
- allele_to_allele_specific_modelsdict of string -> list of
-
property
manifest_df
¶ A pandas.DataFrame describing the models included in this predictor.
Based on: - self.class1_pan_allele_models - self.allele_to_allele_specific_models
- Returns
- pandas.DataFrame
-
clear_cache
()[source]¶ Clear values cached based on the neural networks in this predictor.
- Users should call this after mutating any of the following:
self.class1_pan_allele_models
self.allele_to_allele_specific_models
self.allele_to_sequence
Methods that mutate these instance variables will call this method on their own if needed.
-
property
neural_networks
¶ List of the neural networks in the ensemble.
- Returns
- list of
Class1NeuralNetwork
- list of
-
classmethod
merge
(predictors)[source]¶ Merge the ensembles of two or more
Class1AffinityPredictor
instances.Note: the resulting merged predictor will NOT have calibrated percentile ranks. Call
calibrate_percentile_ranks
on it if these are needed.- Parameters
- predictorssequence of
Class1AffinityPredictor
- predictorssequence of
- Returns
Class1AffinityPredictor
instance
-
merge_in_place
(others)[source]¶ Add the models present in other predictors into the current predictor.
- Parameters
- otherslist of Class1AffinityPredictor
Other predictors to merge into the current predictor.
- Returns
- list of stringnames of newly added models
-
property
supported_alleles
¶ Alleles for which predictions can be made.
- Returns
- list of string
-
property
supported_peptide_lengths
¶ (minimum, maximum) lengths of peptides supported by all models, inclusive.
- Returns
- (int, int) tuple
-
check_consistency
()[source]¶ Verify that self.manifest_df is consistent with: - self.class1_pan_allele_models - self.allele_to_allele_specific_models
Currently only checks for agreement on the total number of models.
Throws AssertionError if inconsistent.
-
save
(models_dir, model_names_to_write=None, write_metadata=True)[source]¶ Serialize the predictor to a directory on disk. If the directory does not exist it will be created.
The serialization format consists of a file called “manifest.csv” with the configurations of each Class1NeuralNetwork, along with per-network files giving the model weights. If there are pan-allele predictors in the ensemble, the allele sequences are also stored in the directory. There is also a small file “index.txt” with basic metadata: when the models were trained, by whom, on what host.
- Parameters
- models_dirstring
Path to directory. It will be created if it doesn’t exist.
- model_names_to_writelist of string, optional
Only write the weights for the specified models. Useful for incremental updates during training.
- write_metadataboolean, optional
Whether to write optional metadata
-
static
load
(models_dir=None, max_models=None, optimization_level=None)[source]¶ Deserialize a predictor from a directory on disk.
- Parameters
- models_dirstring
Path to directory. If unspecified the default downloaded models are used.
- max_modelsint, optional
Maximum number of
Class1NeuralNetwork
instances to load- optimization_levelint
If >0, model optimization will be attempted. Defaults to value of environment variable MHCFLURRY_OPTIMIZATION_LEVEL.
- Returns
Class1AffinityPredictor
instance
-
optimize
(warn=True)[source]¶ EXPERIMENTAL: Optimize the predictor for faster predictions.
Currently the only optimization implemented is to merge multiple pan- allele predictors at the tensorflow level.
The optimization is performed in-place, mutating the instance.
- Returns
- bool
Whether optimization was performed
-
static
model_name
(allele, num)[source]¶ Generate a model name
- Parameters
- allelestring
- numint
- Returns
- string
-
static
weights_path
(models_dir, model_name)[source]¶ Generate the path to the weights file for a model
- Parameters
- models_dirstring
- model_namestring
- Returns
- string
-
property
master_allele_encoding
¶ An AlleleEncoding containing the universe of alleles specified by self.allele_to_sequence.
- Returns
- AlleleEncoding
-
fit_allele_specific_predictors
(n_models, architecture_hyperparameters_list, allele, peptides, affinities, inequalities=None, train_rounds=None, models_dir_for_save=None, verbose=0, progress_preamble='', progress_print_interval=5.0)[source]¶ Fit one or more allele specific predictors for a single allele using one or more neural network architectures.
The new predictors are saved in the Class1AffinityPredictor instance and will be used on subsequent calls to
predict
.- Parameters
- n_modelsint
Number of neural networks to fit
- architecture_hyperparameters_listlist of dict
List of hyperparameter sets.
- allelestring
- peptides
EncodableSequences
or list of string - affinitieslist of float
nM affinities
- inequalitieslist of string, each element one of “>”, “<”, or “=”
See
Class1NeuralNetwork.fit
for details.- train_roundssequence of int
Each training point i will be used on training rounds r for which train_rounds[i] > r, r >= 0.
- models_dir_for_savestring, optional
If specified, the Class1AffinityPredictor is (incrementally) written to the given models dir after each neural network is fit.
- verboseint
Keras verbosity
- progress_preamblestring
Optional string of information to include in each progress update
- progress_print_intervalfloat
How often (in seconds) to print progress. Set to None to disable.
- Returns
- list of
Class1NeuralNetwork
- list of
-
fit_class1_pan_allele_models
(n_models, architecture_hyperparameters, alleles, peptides, affinities, inequalities, models_dir_for_save=None, verbose=1, progress_preamble='', progress_print_interval=5.0)[source]¶ Fit one or more pan-allele predictors using a single neural network architecture.
The new predictors are saved in the Class1AffinityPredictor instance and will be used on subsequent calls to
predict
.- Parameters
- n_modelsint
Number of neural networks to fit
- architecture_hyperparametersdict
- alleleslist of string
Allele names (not sequences) corresponding to each peptide
- peptides
EncodableSequences
or list of string - affinitieslist of float
nM affinities
- inequalitieslist of string, each element one of “>”, “<”, or “=”
See Class1NeuralNetwork.fit for details.
- models_dir_for_savestring, optional
If specified, the Class1AffinityPredictor is (incrementally) written to the given models dir after each neural network is fit.
- verboseint
Keras verbosity
- progress_preamblestring
Optional string of information to include in each progress update
- progress_print_intervalfloat
How often (in seconds) to print progress. Set to None to disable.
- Returns
- list of
Class1NeuralNetwork
- list of
-
add_pan_allele_model
(model, models_dir_for_save=None)[source]¶ Add a pan-allele model to the ensemble and optionally do an incremental save.
- Parameters
- modelClass1NeuralNetwork
- models_dir_for_savestring
Directory to save resulting ensemble to
-
percentile_ranks
(affinities, allele=None, alleles=None, throw=True)[source]¶ Return percentile ranks for the given ic50 affinities and alleles.
The ‘allele’ and ‘alleles’ argument are as in the
predict
method. Specify one of these.- Parameters
- affinitiessequence of float
nM affinities
- allelestring
- allelessequence of string
- throwboolean
If True, a ValueError will be raised in the case of unsupported alleles. If False, a warning will be logged and NaN will be returned for those percentile ranks.
- Returns
- numpy.array of float
-
predict
(peptides, alleles=None, allele=None, throw=True, centrality_measure='mean', model_kwargs={})[source]¶ Predict nM binding affinities.
If multiple predictors are available for an allele, the predictions are the geometric means of the individual model (nM) predictions.
One of ‘allele’ or ‘alleles’ must be specified. If ‘allele’ is specified all predictions will be for the given allele. If ‘alleles’ is specified it must be the same length as ‘peptides’ and give the allele corresponding to each peptide.
- Parameters
- peptides
EncodableSequences
or list of string - alleleslist of string
- allelestring
- throwboolean
If True, a ValueError will be raised in the case of unsupported alleles or peptide lengths. If False, a warning will be logged and the predictions for the unsupported alleles or peptides will be NaN.
- centrality_measurestring or callable
Measure of central tendency to use to combine predictions in the ensemble. Options include: mean, median, robust_mean.
- model_kwargsdict
Additional keyword arguments to pass to Class1NeuralNetwork.predict
- peptides
- Returns
- numpy.array of predictions
-
predict_to_dataframe
(peptides, alleles=None, allele=None, throw=True, include_individual_model_predictions=False, include_percentile_ranks=True, include_confidence_intervals=True, centrality_measure='mean', model_kwargs={})[source]¶ Predict nM binding affinities. Gives more detailed output than
predict
method, including 5-95% prediction intervals.If multiple predictors are available for an allele, the predictions are the geometric means of the individual model predictions.
One of ‘allele’ or ‘alleles’ must be specified. If ‘allele’ is specified all predictions will be for the given allele. If ‘alleles’ is specified it must be the same length as ‘peptides’ and give the allele corresponding to each peptide.
- Parameters
- peptides
EncodableSequences
or list of string - alleleslist of string
- allelestring
- throwboolean
If True, a ValueError will be raised in the case of unsupported alleles or peptide lengths. If False, a warning will be logged and the predictions for the unsupported alleles or peptides will be NaN.
- include_individual_model_predictionsboolean
If True, the predictions of each individual model are included as columns in the result DataFrame.
- include_percentile_ranksboolean, default True
If True, a “prediction_percentile” column will be included giving the percentile ranks. If no percentile rank info is available, this will be ignored with a warning.
- centrality_measurestring or callable
Measure of central tendency to use to combine predictions in the ensemble. Options include: mean, median, robust_mean.
- model_kwargsdict
Additional keyword arguments to pass to Class1NeuralNetwork.predict
- peptides
- Returns
pandas.DataFrame
of predictions
-
calibrate_percentile_ranks
(peptides=None, num_peptides_per_length=100000, alleles=None, bins=None, motif_summary=False, summary_top_peptide_fractions=[0.001], verbose=False, model_kwargs={})[source]¶ Compute the cumulative distribution of ic50 values for a set of alleles over a large universe of random peptides, to enable taking quantiles of this distribution later.
- Parameters
- peptidessequence of string or EncodableSequences, optional
Peptides to use
- num_peptides_per_lengthint, optional
If peptides argument is not specified, then num_peptides_per_length peptides are randomly sampled from a uniform distribution for each supported length
- allelessequence of string, optional
Alleles to perform calibration for. If not specified all supported alleles will be calibrated.
- binsobject
Anything that can be passed to numpy.histogram’s “bins” argument can be used here, i.e. either an integer or a sequence giving bin edges. This is in ic50 space.
- motif_summarybool
If True, the length distribution and per-position amino acid frequencies are also calculated for the top x fraction of tightest- binding peptides, where each value of x is given in the summary_top_peptide_fractions list.
- summary_top_peptide_fractionslist of float
Only used if motif_summary is True
- verboseboolean
Whether to print status updates to stdout
- model_kwargsdict
Additional low-level Class1NeuralNetwork.predict() kwargs.
- Returns
- dict of string -> pandas.DataFrame
- If motif_summary is True, this will have keys “frequency_matrices” and
- “length_distributions”. Otherwise it will be empty.
-
model_select
(score_function, alleles=None, min_models=1, max_models=10000)[source]¶ Perform model selection using a user-specified scoring function.
This works only with allele-specific models, not pan-allele models.
Model selection is done using a “step up” variable selection procedure, in which models are repeatedly added to an ensemble until the score stops improving.
- Parameters
- score_functionClass1AffinityPredictor -> float function
Scoring function
- alleleslist of string, optional
If not specified, model selection is performed for all alleles.
- min_modelsint, optional
Min models to select per allele
- max_modelsint, optional
Max models to select per allele
- Returns
- Class1AffinityPredictorpredictor containing the selected models
mhcflurry.class1_neural_network module¶
-
class
mhcflurry.class1_neural_network.
Class1NeuralNetwork
(**hyperparameters)[source]¶ Bases:
object
Low level class I predictor consisting of a single neural network.
Both single allele and pan-allele prediction are supported.
Users will generally use Class1AffinityPredictor, which gives a higher-level interface and supports ensembles.
-
network_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Hyperparameters (and their default values) that affect the neural network architecture.
-
compile_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Loss and optimizer hyperparameters.
-
fit_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Hyperparameters for neural network training.
-
early_stopping_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Hyperparameters for early stopping.
-
miscelaneous_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Miscelaneous hyperaparameters. These parameters are not used by this class but may be interpreted by other code.
-
hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Combined set of all supported hyperparameters and their default values.
-
hyperparameter_renames
= {'embedding_init_method': None, 'embedding_input_dim': None, 'embedding_output_dim': None, 'kmer_size': None, 'left_edge': None, 'min_delta': None, 'mode': None, 'monitor': None, 'peptide_amino_acid_encoding': None, 'pseudosequence_use_embedding': None, 'right_edge': None, 'take_best_epoch': None, 'use_embedding': None, 'verbose': None}¶
-
classmethod
apply_hyperparameter_renames
(hyperparameters)[source]¶ Handle hyperparameter renames.
- Parameters
- hyperparametersdict
- Returns
- dictupdated hyperparameters
-
KERAS_MODELS_CACHE
= {}¶ Process-wide keras model cache, a map from: architecture JSON string to (Keras model, existing network weights)
-
classmethod
borrow_cached_network
(network_json, network_weights)[source]¶ Return a keras Model with the specified architecture and weights. As an optimization, when possible this will reuse architectures from a process-wide cache.
The returned object is “borrowed” in the sense that its weights can change later after subsequent calls to this method from other objects.
If you’re using this from a parallel implementation you’ll need to hold a lock while using the returned object.
- Parameters
- network_jsonstring of JSON
- network_weightslist of numpy.array
- Returns
- keras.models.Model
-
network
(borrow=False)[source]¶ Return the keras model associated with this predictor.
- Parameters
- borrowbool
Whether to return a cached model if possible. See borrow_cached_network for details
- Returns
- keras.models.Model
-
update_network_description
()[source]¶ Update self.network_json and self.network_weights properties based on this instances’s neural network.
-
static
keras_network_cache_key
(network_json)[source]¶ Given a Keras JSON description of a neural network, return a key that uniquely defines this network. Networks that share the same key should have compatible weights matrices and give the same prediction outputs when their weights are the same.
- Parameters
- network_jsonstring
- Returns
- string
-
classmethod
from_config
(config, weights=None, weights_loader=None)[source]¶ deserialize from a dict returned by get_config().
- Parameters
- configdict
- weightslist of array, optional
Network weights to restore
- weights_loadercallable, optional
Function to call (no arguments) to load weights when needed
- Returns
- Class1NeuralNetwork
-
load_weights
()[source]¶ Load weights by evaluating self.network_weights_loader, if needed.
After calling this, self.network_weights_loader will be None and self.network_weights will be the weights list, if available.
-
get_weights
()[source]¶ Get the network weights
- Returns
- list of numpy.array giving weights for each layer or None if there is no
- network
-
peptides_to_network_input
(peptides)[source]¶ Encode peptides to the fixed-length encoding expected by the neural network (which depends on the architecture).
- Parameters
- peptidesEncodableSequences or list of string
- Returns
- numpy.array
-
property
supported_peptide_lengths
¶ (minimum, maximum) lengths of peptides supported, inclusive.
- Returns
- (int, int) tuple
-
allele_encoding_to_network_input
(allele_encoding)[source]¶ Encode alleles to the fixed-length encoding expected by the neural network (which depends on the architecture).
- Parameters
- allele_encodingAlleleEncoding
- Returns
- (numpy.array, numpy.array)
- Indices and allele representations.
-
static
data_dependent_weights_initialization
(network, x_dict=None, method='lsuv', verbose=1)[source]¶ Data dependent weights initialization.
- Parameters
- networkkeras.Model
- x_dictdict of string -> numpy.ndarray
Training data as would be passed keras.Model.fit().
- methodstring
Initialization method. Currently only “lsuv” is supported.
- verboseint
Status updates printed to stdout if verbose > 0
-
fit_generator
(generator, validation_peptide_encoding, validation_affinities, validation_allele_encoding=None, validation_inequalities=None, validation_output_indices=None, steps_per_epoch=10, epochs=1000, min_epochs=0, patience=10, min_delta=0.0, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]¶ Fit using a generator. Does not support many of the features of fit(), such as random negative peptides.
Fitting proceeds until early stopping is hit, using the peptides, affinities, etc. given by the parameters starting with “validation_”.
This is used for pre-training pan-allele models using data synthesized by the allele-specific models.
- Parameters
- generatorgenerator yielding (alleles, peptides, affinities) tuples
where alleles and peptides are lists of strings, and affinities is list of floats.
- validation_peptide_encodingEncodableSequences
- validation_affinitieslist of float
- validation_allele_encodingAlleleEncoding
- validation_inequalitieslist of string
- validation_output_indiceslist of int
- steps_per_epochint
- epochsint
- min_epochsint
- patienceint
- min_deltafloat
- verboseint
- progress_callbackthunk
- progress_preamblestring
- progress_print_intervalfloat
-
fit
(peptides, affinities, allele_encoding=None, inequalities=None, output_indices=None, sample_weights=None, shuffle_permutation=None, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]¶ Fit the neural network.
- Parameters
- peptidesEncodableSequences or list of string
- affinitieslist of float
nM affinities. Must be same length of as peptides.
- allele_encodingAlleleEncoding
If not specified, the model will be a single-allele predictor.
- inequalitieslist of string, each element one of “>”, “<”, or “=”.
Inequalities to use for fitting. Same length as affinities. Each element must be one of “>”, “<”, or “=”. For example, a “>” will train on y_pred > y_true for that element in the training set. Requires using a custom losses that support inequalities (e.g. mse_with_ineqalities). If None all inequalities are taken to be “=”.
- output_indiceslist of int
For multi-output models only. Same length as affinities. Indicates the index of the output (starting from 0) for each training example.
- sample_weightslist of float
If not specified, all samples (including random negatives added during training) will have equal weight. If specified, the random negatives will be assigned weight=1.0.
- shuffle_permutationlist of int
Permutation (integer list) of same length as peptides and affinities If None, then a random permutation will be generated.
- verboseint
Keras verbosity level
- progress_callbackfunction
No-argument function to call after each epoch.
- progress_preamblestring
Optional string of information to include in each progress update
- progress_print_intervalfloat
How often (in seconds) to print progress update. Set to None to disable.
-
predict
(peptides, allele_encoding=None, batch_size=4096, output_index=0)[source]¶ Predict affinities.
If peptides are specified as EncodableSequences, then the predictions will be cached for this predictor as long as the EncodableSequences object remains in memory. The cache is keyed in the object identity of the EncodableSequences, not the sequences themselves. The cache is used only for allele-specific models (i.e. when allele_encoding is None).
- Parameters
- peptidesEncodableSequences or list of string
- allele_encodingAlleleEncoding, optional
Only required when this model is a pan-allele model
- batch_sizeint
batch_size passed to Keras
- output_indexint or None
For multi-output models. Gives the output index to return. If set to None, then all outputs are returned as a samples x outputs matrix.
- Returns
- numpy.array of nM affinity predictions
-
classmethod
merge
(models, merge_method='average')[source]¶ Merge multiple models at the tensorflow (or other backend) level.
Only certain neural network architectures support merging. Others will result in a NotImplementedError.
- Parameters
- modelslist of Class1NeuralNetwork
instances to merge
- merge_methodstring, one of “average”, “sum”, or “concatenate”
How to merge the predictions of the different models
- Returns
- Class1NeuralNetwork
The merged neural network
-
make_network
(peptide_encoding, allele_amino_acid_encoding, allele_dense_layer_sizes, peptide_dense_layer_sizes, peptide_allele_merge_method, peptide_allele_merge_activation, layer_sizes, dense_layer_l1_regularization, dense_layer_l2_regularization, activation, init, output_activation, dropout_probability, batch_normalization, locally_connected_layers, topology, num_outputs=1, allele_representations=None)[source]¶ Helper function to make a keras network for class 1 affinity prediction.
-
clear_allele_representations
()[source]¶ Set allele representations to an empty array. Useful before saving to save a smaller version of the model.
-
set_allele_representations
(allele_representations, force_surgery=False)[source]¶ Set the allele representations in use by this model. This means mutating the weights for the allele input embedding layer.
Rationale: instead of passing in the allele sequence for each data point during model training or prediction (which is expensive in terms of memory usage), we pass in an allele index between 0 and n-1 where n is the number of alleles in some universe of possible alleles. This index is used in the model to lookup the corresponding allele sequence. This function sets the lookup table.
See also: AlleleEncoding.allele_representations()
- Parameters
- allele_representationsnumpy.ndarray of shape (a, l, m)
- where a is the total number of alleles,
l is the allele sequence length, m is the length of the vectors used to represent amino acids
-
mhcflurry.class1_presentation_predictor module¶
-
class
mhcflurry.class1_presentation_predictor.
Class1PresentationPredictor
(affinity_predictor=None, processing_predictor_with_flanks=None, processing_predictor_without_flanks=None, weights_dataframe=None, metadata_dataframes=None, percent_rank_transform=None, provenance_string=None)[source]¶ Bases:
object
A logistic regression model over predicted binding affinity (BA) and antigen processing (AP) score.
Instances of this class delegate to Class1AffinityPredictor and Class1ProcessingPredictor instances to generate BA and AP predictions. These predictions are combined using a logistic regression model to give a “presentation score” prediction.
Most users will call the
load
static method to get an instance of this class, then call thepredict
method to generate predictions.-
model_inputs
= ['affinity_score', 'processing_score']¶
-
property
supported_alleles
¶ List of alleles supported by the underlying Class1AffinityPredictor
-
property
supported_peptide_lengths
¶ (min, max) of supported peptide lengths, inclusive.
-
property
supports_affinity_prediction
¶ Is there an affinity predictor associated with this instance?
-
property
supports_processing_prediction
¶ Is there a processing predictor associated with this instance?
-
property
supports_presentation_prediction
¶ Can this instance predict presentation?
-
predict_affinity
(peptides, alleles, sample_names=None, include_affinity_percentile=True, verbose=1, throw=True)[source]¶ Predict binding affinities across samples (each corresponding to up to six MHC I alleles).
Two modes are supported: each peptide can be evaluated for binding to any of the alleles in any sample (this is what happens when sample_names is None), or the i’th peptide can be evaluated for binding the alleles of the sample given by the i’th entry in sample_names.
For example, if we don’t specify sample_names, then predictions are taken for all combinations of samples and peptides, for a result size of num peptides * num samples:
>>> predictor = Class1PresentationPredictor.load() >>> predictor.predict_affinity( ... peptides=["SIINFEKL", "PEPTIDE"], ... alleles={ ... "sample1": ["A0201", "A0301", "B0702"], ... "sample2": ["A0101", "C0202"], ... }, ... verbose=0) peptide peptide_num sample_name affinity best_allele affinity_percentile 0 SIINFEKL 0 sample1 11927.161 A0201 6.296 1 PEPTIDE 1 sample1 32507.083 A0201 71.249 2 SIINFEKL 0 sample2 2725.593 C0202 6.662 3 PEPTIDE 1 sample2 28304.330 C0202 54.652
In contrast, here we specify sample_names, so peptide is evaluated for binding the alleles in the corresponding sample, for a result size equal to the number of peptides:
>>> predictor.predict_affinity( ... peptides=["SIINFEKL", "PEPTIDE"], ... alleles={ ... "sample1": ["A0201", "A0301", "B0702"], ... "sample2": ["A0101", "C0202"], ... }, ... sample_names=["sample2", "sample1"], ... verbose=0) peptide peptide_num sample_name affinity best_allele affinity_percentile 0 SIINFEKL 0 sample2 2725.592 C0202 6.662 1 PEPTIDE 1 sample1 32507.079 A0201 71.249
- Parameters
- peptideslist of string
Peptide sequences
- allelesdict of string -> list of string
Keys are sample names, values are the alleles (genotype) for that sample
- sample_nameslist of string [same length as peptides]
Sample names corresponding to each peptide. If None, then predictions are generated for all sample genotypes across all peptides.
- include_affinity_percentilebool
Whether to include affinity percentile ranks
- verboseint
Set to 0 for quiet.
- throwverbose
Whether to throw exception (vs. just log a warning) on invalid peptides, etc.
- Returns
- pandas.DataFramepredictions
-
predict_processing
(peptides, n_flanks=None, c_flanks=None, throw=True, verbose=1)[source]¶ Predict antigen processing scores for individual peptides, optionally including flanking sequences for better cleavage prediction.
- Parameters
- peptideslist of string
- n_flankslist of string [same length as peptides]
- c_flankslist of string [same length as peptides]
- throwboolean
Whether to raise exception on unsupported peptides
- verboseint
- Returns
- numpy.arrayAntigen processing scores for each peptide
-
fit
(targets, peptides, sample_names, alleles, n_flanks=None, c_flanks=None, verbose=1)[source]¶ Fit the presentation score logistic regression model.
- Parameters
- targetslist of int/float
1 indicates hit, 0 indicates decoy
- peptideslist of string [same length as targets]
- sample_nameslist of string [same length as targets]
- allelesdict of string -> list of string
Keys are sample names, values are the alleles for that sample
- n_flankslist of string [same length as targets]
- c_flankslist of string [same length as targets]
- verboseint
-
get_model
(name=None)[source]¶ Load or instantiate a new logistic regression model. Private helper method.
- Parameters
- namestring
If None (the default), an un-fit LR model is returned. Otherwise the weights are loaded for the specified model.
- Returns
- sklearn.linear_model.LogisticRegression
-
predict
(peptides, alleles, sample_names=None, n_flanks=None, c_flanks=None, include_affinity_percentile=False, verbose=1, throw=True)[source]¶ Predict presentation scores across a set of peptides.
Presentation scores combine predictions for MHC I binding affinity and antigen processing.
This method returns a pandas.DataFrame giving presentation scores plus the binding affinity and processing predictions and other intermediate results.
Example:
>>> predictor = Class1PresentationPredictor.load() >>> predictor.predict( ... peptides=["SIINFEKL", "PEPTIDE"], ... n_flanks=["NNN", "SNS"], ... c_flanks=["CCC", "CNC"], ... alleles={ ... "sample1": ["A0201", "A0301", "B0702"], ... "sample2": ["A0101", "C0202"], ... }, ... verbose=0) peptide n_flank c_flank peptide_num sample_name affinity best_allele processing_score presentation_score presentation_percentile 0 SIINFEKL NNN CCC 0 sample1 11927.161 A0201 0.838 0.145 2.282 1 PEPTIDE SNS CNC 1 sample1 32507.083 A0201 0.025 0.003 100.000 2 SIINFEKL NNN CCC 0 sample2 2725.593 C0202 0.838 0.416 1.017 3 PEPTIDE SNS CNC 1 sample2 28304.330 C0202 0.025 0.003 99.287
You can also specify sample_names, in which case peptide is evaluated for binding the alleles in the corresponding sample only. See
predict_affinity
for an examples.- Parameters
- peptideslist of string
Peptide sequences
- alleleslist of string or dict of string -> list of string
If you are predicting for a single sample, pass a list of strings (up to 6) indicating the genotype. If you are predicting across multiple samples, pass a dict where the keys are (arbitrary) sample names and the values are the alleles to predict for that sample. Set to an empty list or dict to perform processing prediction only.
- sample_nameslist of string [same length as peptides]
If you are passing a dict for ‘alleles’, you can use this argument to specify which peptides go with which samples. If it is None, then predictions will be performed for each peptide across all samples.
- n_flankslist of string [same length as peptides]
Upstream sequences before the peptide. Sequences of any length can be given and a suffix of the size supported by the model will be used.
- c_flankslist of string [same length as peptides]
Downstream sequences after the peptide. Sequences of any length can be given and a prefix of the size supported by the model will be used.
- include_affinity_percentilebool
Whether to include affinity percentile ranks
- verboseint
Set to 0 for quiet.
- throwverbose
Whether to throw exception (vs. just log a warning) on invalid peptides, etc.
- Returns
- pandas.DataFrame
- Presentation scores and intermediate results.
-
predict_sequences
(sequences, alleles, result='best', comparison_quantity=None, filter_value=None, peptide_lengths=8, 9, 10, 11, use_flanks=True, include_affinity_percentile=True, verbose=1, throw=True)[source]¶ Predict presentation across protein sequences.
Example:
>>> predictor = Class1PresentationPredictor.load() >>> predictor.predict_sequences( ... sequences={ ... 'protein1': "MDSKGSSQKGSRLLLLLVVSNLL", ... 'protein2': "SSLPTPEDKEQAQQTHH", ... }, ... alleles={ ... "sample1": ["A0201", "A0301", "B0702"], ... "sample2": ["A0101", "C0202"], ... }, ... result="filtered", ... comparison_quantity="affinity", ... filter_value=500, ... verbose=0) sequence_name pos peptide n_flank c_flank sample_name affinity best_allele affinity_percentile processing_score presentation_score presentation_percentile 0 protein1 14 LLLVVSNLL GSRLL sample1 57.180 A0201 0.398 0.233 0.754 0.351 1 protein1 13 LLLLVVSNL KGSRL L sample1 57.339 A0201 0.398 0.031 0.586 0.643 2 protein1 5 SSQKGSRLL MDSKG LLLVV sample2 110.779 C0202 0.782 0.061 0.456 0.920 3 protein1 6 SQKGSRLLL DSKGS LLVVS sample2 254.480 C0202 1.735 0.102 0.303 1.356 4 protein1 13 LLLLVVSNLL KGSRL sample1 260.390 A0201 1.012 0.158 0.345 1.215 5 protein1 12 LLLLLVVSNL QKGSR L sample1 308.150 A0201 1.094 0.015 0.206 1.802 6 protein2 0 SSLPTPEDK EQAQQ sample2 410.354 C0202 2.398 0.003 0.158 2.155 7 protein1 5 SSQKGSRL MDSKG LLLLV sample2 444.321 C0202 2.512 0.026 0.159 2.138 8 protein2 0 SSLPTPEDK EQAQQ sample1 459.296 A0301 0.971 0.003 0.144 2.292 9 protein1 4 GSSQKGSRL MDSK LLLLV sample2 469.052 C0202 2.595 0.014 0.146 2.261
- Parameters
- sequencesstr, list of string, or string -> string dict
Protein sequences. If a dict is given, the keys are arbitrary ( e.g. protein names), and the values are the amino acid sequences.
- alleleslist of string, list of list of string, or dict of string -> list of string
MHC I alleles. Can be: (1) a string (a single allele), (2) a list of strings (a single genotype), (3) a list of list of strings (multiple genotypes, where the total number of genotypes must equal the number of sequences), or (4) a dict giving multiple genotypes, which will each be run over the sequences.
- resultstring
Specify ‘best’ to return the strongest peptide for each sequence, ‘all’ to return predictions for all peptides, or ‘filtered’ to return predictions where the comparison_quantity is stronger (i.e (<) for affinity, (>) for scores) than filter_value.
- comparison_quantitystring
One of “presentation_score”, “processing_score”, “affinity”, or “affinity_percentile”. Prediction to use to rank (if result is “best”) or filter (if result is “filtered”) results. Default is “presentation_score”.
- filter_valuefloat
Threshold value to use, only relevant when result is “filtered”. If comparison_quantity is “affinity”, then all results less than (i.e. tighter than) the specified nM affinity are retained. If it’s “presentation_score” or “processing_score” then results greater than the indicated filter_value are retained.
- peptide_lengthslist of int
Peptide lengths to predict for.
- use_flanksbool
Whether to include flanking sequences when running the AP predictor (for better cleavage prediction).
- include_affinity_percentilebool
Whether to include affinity percentile ranks in output.
- verboseint
Set to 0 for quiet mode.
- throwboolean
Whether to throw exceptions (vs. log warnings) on invalid inputs.
- Returns
- pandas.DataFrame with columns:
peptide, n_flank, c_flank, sequence_name, affinity, best_allele, processing_score, presentation_score
-
save
(models_dir, write_affinity_predictor=True, write_processing_predictor=True, write_weights=True, write_percent_ranks=True, write_info=True, write_metdata=True)[source]¶ Save the predictor to a directory on disk. If the directory does not exist it will be created.
The wrapped Class1AffinityPredictor and Class1ProcessingPredictor instances are included in the saved data.
- Parameters
- models_dirstring
Path to directory. It will be created if it doesn’t exist.
-
classmethod
load
(models_dir=None, max_models=None)[source]¶ Deserialize a predictor from a directory on disk.
This will also load the wrapped Class1AffinityPredictor and Class1ProcessingPredictor instances.
- Parameters
- models_dirstring
Path to directory. If unspecified the default downloaded models are used.
- max_modelsint, optional
Maximum number of affinity and processing (counted separately) models to load
- Returns
Class1PresentationPredictor
instance
-
percentile_ranks
(presentation_scores, throw=True)[source]¶ Return percentile ranks for the given presentation scores.
- Parameters
- presentation_scoressequence of float
- Returns
- numpy.array of float
-
calibrate_percentile_ranks
(scores, bins=None)[source]¶ Compute the cumulative distribution of scores, to enable taking quantiles of this distribution later.
- Parameters
- scoressequence of float
Presentation prediction scores
- binsobject
Anything that can be passed to numpy.histogram’s “bins” argument can be used here, i.e. either an integer or a sequence giving bin edges.
-
mhcflurry.class1_processing_neural_network module¶
Antigen processing neural network implementation
-
class
mhcflurry.class1_processing_neural_network.
Class1ProcessingNeuralNetwork
(**hyperparameters)[source]¶ Bases:
object
A neural network for antigen processing prediction
-
network_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Hyperparameters (and their default values) that affect the neural network architecture.
-
fit_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Hyperparameters for neural network training.
-
early_stopping_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Hyperparameters for early stopping.
-
compile_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Loss and optimizer hyperparameters. Any values supported by keras may be used.
-
auxiliary_input_hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Allele feature hyperparameters.
-
hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶
-
property
sequence_lengths
¶ Supported maximum sequence lengths
- Returns
- dict of string -> int
- Keys are “peptide”, “n_flank”, “c_flank”. Values give the maximum
- supported sequence length.
-
update_network_description
()[source]¶ Update self.network_json and self.network_weights properties based on this instances’s neural network.
-
fit
(sequences, targets, sample_weights=None, shuffle_permutation=None, verbose=1, progress_callback=None, progress_preamble='', progress_print_interval=5.0)[source]¶ Fit the neural network.
- Parameters
- sequencesFlankingEncoding
Peptides and upstream/downstream flanking sequences
- targetslist of float
1 indicates hit, 0 indicates decoy
- sample_weightslist of float
If not specified all samples have equal weight.
- shuffle_permutationlist of int
Permutation (integer list) of same length as peptides and affinities If None, then a random permutation will be generated.
- verboseint
Keras verbosity level
- progress_callbackfunction
No-argument function to call after each epoch.
- progress_preamblestring
Optional string of information to include in each progress update
- progress_print_intervalfloat
How often (in seconds) to print progress update. Set to None to disable.
-
predict
(peptides, n_flanks=None, c_flanks=None, batch_size=4096)[source]¶ Predict antigen processing.
- Parameters
- peptideslist of string
Peptide sequences
- n_flankslist of string
Upstream sequence before each peptide
- c_flankslist of string
Downstream sequence after each peptide
- batch_sizeint
Prediction keras batch size.
- Returns
- numpy.array
- Processing scores. Range is 0-1, higher indicates more favorable
- processing.
-
predict_encoded
(sequences, throw=True, batch_size=4096)[source]¶ Predict antigen processing.
- Parameters
- sequencesFlankingEncoding
Peptides and flanking sequences
- throwboolean
Whether to throw exception on unsupported peptides
- batch_sizeint
Prediction keras batch size.
- Returns
- numpy.array
-
network_input
(sequences, throw=True)[source]¶ Encode peptides to the fixed-length encoding expected by the neural network (which depends on the architecture).
- Parameters
- sequencesFlankingEncoding
Peptides and flanking sequences
- throwboolean
Whether to throw exception on unsupported peptides
- Returns
- numpy.array
-
make_network
(amino_acid_encoding, peptide_max_length, n_flank_length, c_flank_length, flanking_averages, convolutional_filters, convolutional_kernel_size, convolutional_activation, convolutional_kernel_l1_l2, dropout_rate, post_convolutional_dense_layer_sizes)[source]¶ Helper function to make a keras network given hyperparameters.
-
mhcflurry.class1_processing_predictor module¶
-
class
mhcflurry.class1_processing_predictor.
Class1ProcessingPredictor
(models, manifest_df=None, metadata_dataframes=None, provenance_string=None)[source]¶ Bases:
object
User-facing interface to antigen processing prediction.
Delegates to an ensemble of Class1ProcessingNeuralNetwork instances.
Instantiate a new Class1ProcessingPredictor
Users will generally call load() to restore a saved predictor rather than using this constructor.
- Parameters
- modelslist of Class1ProcessingNeuralNetwork
Neural networks in the ensemble.
- manifest_dfpandas.DataFrame
Manifest dataframe. If not specified a new one will be created when needed.
- metadata_dataframesdict of string -> pandas.DataFrame
Arbitrary metadata associated with this predictor
- provenance_stringstring, optional
Optional info string to use in __str__.
-
property
sequence_lengths
¶ Supported maximum sequence lengths.
Passing a peptide greater than the maximum supported length results in an error.
Passing an N- or C-flank sequence greater than the maximum supported length results in some part of it being ignored.
- Returns
- dict of string -> int
- Keys are “peptide”, “n_flank”, “c_flank”. Values give the maximum
- supported sequence length.
-
add_models
(models)[source]¶ Add models to the ensemble (in-place).
- Parameters
- modelslist of Class1ProcessingNeuralNetwork
- Returns
- list of string
- Names of the new models.
-
property
manifest_df
¶ A pandas.DataFrame describing the models included in this predictor.
- Returns
- pandas.DataFrame
-
static
weights_path
(models_dir, model_name)[source]¶ Generate the path to the weights file for a model
- Parameters
- models_dirstring
- model_namestring
- Returns
- string
-
predict
(peptides, n_flanks=None, c_flanks=None, throw=True, batch_size=4096)[source]¶ Predict antigen processing.
- Parameters
- peptideslist of string
Peptide sequences
- n_flankslist of string
Upstream sequence before each peptide
- c_flankslist of string
Downstream sequence after each peptide
- throwboolean
If True, a ValueError will be raised in the case of unsupported peptides. If False, a warning will be logged and the predictions for those peptides will be NaN.
- batch_sizeint
Prediction keras batch size.
- Returns
- numpy.array
- Processing scores. Range is 0-1, higher indicates more favorable
- processing.
-
predict_to_dataframe
(peptides, n_flanks=None, c_flanks=None, throw=True, batch_size=4096)[source]¶ Predict antigen processing.
See
predict
method for parameter descriptions.- Returns
- pandas.DataFrame
- Processing predictions are in the “score” column. Also includes
- peptides and flanking sequences.
-
predict_to_dataframe_encoded
(sequences, throw=True, batch_size=4096)[source]¶ Predict antigen processing.
See
predict
method for more information.- Parameters
- sequencesFlankingEncoding
- batch_sizeint
- throwboolean
- Returns
- pandas.DataFrame
-
check_consistency
()[source]¶ Verify that self.manifest_df is consistent with instance variables.
Currently only checks for agreement on the total number of models.
Throws AssertionError if inconsistent.
-
save
(models_dir, model_names_to_write=None, write_metadata=True)[source]¶ Serialize the predictor to a directory on disk. If the directory does not exist it will be created.
The serialization format consists of a file called “manifest.csv” with the configurations of each Class1ProcessingNeuralNetwork, along with per-network files giving the model weights.
- Parameters
- models_dirstring
Path to directory. It will be created if it doesn’t exist.
-
classmethod
load
(models_dir=None, max_models=None)[source]¶ Deserialize a predictor from a directory on disk.
- Parameters
- models_dirstring
Path to directory. If unspecified the default downloaded models are used.
- max_modelsint, optional
Maximum number of models to load
- Returns
Class1ProcessingPredictor
instance
mhcflurry.cluster_parallelism module¶
Simple, relatively naive parallel map implementation for HPC clusters.
Used for training MHCflurry models.
-
mhcflurry.cluster_parallelism.
add_cluster_parallelism_args
(parser)[source]¶ Add commandline arguments controlling cluster parallelism to an argparse ArgumentParser.
- Parameters
- parserargparse.ArgumentParser
-
mhcflurry.cluster_parallelism.
cluster_results_from_args
(args, work_function, work_items, constant_data=None, input_serialization_method='pickle', result_serialization_method='pickle', clear_constant_data=False)[source]¶ Parallel map configurable using commandline arguments. See the cluster_results() function for docs.
The
args
parameter should be an argparse.Namespace from an argparse parser generated using the add_cluster_parallelism_args() function.- Parameters
- args
- work_function
- work_items
- constant_data
- result_serialization_method
- clear_constant_data
- Returns
- generator
-
mhcflurry.cluster_parallelism.
cluster_results
(work_function, work_items, constant_data=None, submit_command='sh', results_workdir='./cluster-workdir', additional_complete_file=None, script_prefix_path=None, input_serialization_method='pickle', result_serialization_method='pickle', max_retries=3, clear_constant_data=False)[source]¶ Parallel map on an HPC cluster.
Returns [work_function(item) for item in work_items] where each invocation of work_function is performed as a separate HPC cluster job. Order is preserved.
Optionally, “constant data” can be specified, which will be passed to each work_function() invocation as a keyword argument called constant_data. This data is serialized once and all workers read it from the same source, which is more efficient than serializing it separately for each worker.
Each worker’s input is serialized to a shared NFS directory and the submit_command is used to launch a job to process that input. The shared filesystem is polled occasionally to watch for results, which are fed back to the user.
- Parameters
- work_functionA -> B
- work_itemslist of A
- constant_dataobject
- submit_commandstring
For running on LSF, we use “bsub” here.
- results_workdirstring
Path to NFS shared directory where inputs and results can be written
- script_prefix_pathstring
Path to script that will be invoked to run each worker. A line calling the _mhcflurry-cluster-worker-entry-point command will be appended to the contents of this file.
- result_serialization_methodstring, one of “pickle” or “save_predictor”
The “save_predictor” works only when the return type of work_function is Class1AffinityPredictor
- max_retriesint
How many times to attempt to re-launch a failed worker
- clear_constant_databool
If True, the constant data dict is cleared on the launching host after it is serialized to disk.
- Returns
- generator of B
mhcflurry.common module¶
-
mhcflurry.common.
configure_tensorflow
(backend=None, gpu_device_nums=None, num_threads=None)[source]¶ Configure Keras backend to use GPU or CPU. Only tensorflow is supported.
- Parameters
- backendstring, optional
one of ‘tensorflow-default’, ‘tensorflow-cpu’, ‘tensorflow-gpu’
- gpu_device_numslist of int, optional
GPU devices to potentially use
- num_threadsint, optional
Tensorflow threads to use
-
mhcflurry.common.
configure_logging
(verbose=False)[source]¶ Configure logging module using defaults.
- Parameters
- verboseboolean
If true, output will be at level DEBUG, otherwise, INFO.
-
mhcflurry.common.
amino_acid_distribution
(peptides, smoothing=0.0)[source]¶ Compute the fraction of each amino acid across a collection of peptides.
- Parameters
- peptideslist of string
- smoothingfloat, optional
Small number (e.g. 0.01) to add to all amino acid fractions. The higher the number the more uniform the distribution.
- Returns
- pandas.Series indexed by amino acids
-
mhcflurry.common.
random_peptides
(num, length=9, distribution=None)[source]¶ Generate random peptides (kmers).
- Parameters
- numint
Number of peptides to return
- lengthint
Length of each peptide
- distributionpandas.Series
Maps 1-letter amino acid abbreviations to probabilities. If not specified a uniform distribution is used.
- Returns
- list of string
-
mhcflurry.common.
positional_frequency_matrix
(peptides)[source]¶ Given a set of peptides, calculate a length x amino acids frequency matrix.
- Parameters
- peptideslist of string
All of same length
- Returns
- pandas.DataFrame
Index is position, columns are amino acids
-
mhcflurry.common.
save_weights
(weights_list, filename)[source]¶ Save model weights to the given filename using numpy’s “.npz” format.
- Parameters
- weights_listlist of numpy array
- filenamestring
-
mhcflurry.common.
load_weights
(filename)[source]¶ Restore model weights from the given filename, which should have been created with
save_weights
.- Parameters
- filenamestring
- Returns
- list of array
-
class
mhcflurry.common.
NumpyJSONEncoder
(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶ Bases:
json.encoder.JSONEncoder
JSON encoder (used with json module) that can handle numpy arrays.
Constructor for JSONEncoder, with sensible defaults.
If skipkeys is false, then it is a TypeError to attempt encoding of keys that are not str, int, float or None. If skipkeys is True, such items are simply skipped.
If ensure_ascii is true, the output is guaranteed to be str objects with all incoming non-ASCII characters escaped. If ensure_ascii is false, the output can contain non-ASCII characters.
If check_circular is true, then lists, dicts, and custom encoded objects will be checked for circular references during encoding to prevent an infinite recursion (which would cause an OverflowError). Otherwise, no such check takes place.
If allow_nan is true, then NaN, Infinity, and -Infinity will be encoded as such. This behavior is not JSON specification compliant, but is consistent with most JavaScript based encoders and decoders. Otherwise, it will be a ValueError to encode such floats.
If sort_keys is true, then the output of dictionaries will be sorted by key; this is useful for regression tests to ensure that JSON serializations can be compared on a day-to-day basis.
If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.
If specified, separators should be an (item_separator, key_separator) tuple. The default is (‘, ‘, ‘: ‘) if indent is
None
and (‘,’, ‘: ‘) otherwise. To get the most compact JSON representation, you should specify (‘,’, ‘:’) to eliminate whitespace.If specified, default is a function that gets called for objects that can’t otherwise be serialized. It should return a JSON encodable version of the object or raise a
TypeError
.-
default
(obj)[source]¶ Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
-
mhcflurry.custom_loss module¶
Custom loss functions.
For losses supporting inequalities, each training data point is associated with one of (=), (<), or (>). For e.g. (>) inequalities, penalization is applied only if the prediction is less than the given value.
-
mhcflurry.custom_loss.
get_loss
(name)[source]¶ Get a custom_loss.Loss instance by name.
- Parameters
- namestring
- Returns
- custom_loss.Loss
-
class
mhcflurry.custom_loss.
Loss
(name=None)[source]¶ Bases:
object
Thin wrapper to keep track of neural network loss functions, which could be custom or baked into Keras.
Each subclass or instance should define these properties/methods: - name : string - loss : string or function
This is what gets passed to keras.fit()
- encode_ynumpy.ndarray -> numpy.ndarray
Transformation to apply to regression target before fitting
-
class
mhcflurry.custom_loss.
StandardKerasLoss
(loss_name='mse')[source]¶ Bases:
mhcflurry.custom_loss.Loss
A loss function supported by Keras, such as MSE.
-
supports_inequalities
= False¶
-
supports_multiple_outputs
= False¶
-
-
class
mhcflurry.custom_loss.
TransformPredictionsLossWrapper
(loss, y_pred_transform=None)[source]¶ Bases:
mhcflurry.custom_loss.Loss
Wrapper that applies an arbitrary transform to y_pred before calling an underlying loss function.
The y_pred_transform function should be a tensor -> tensor function.
-
class
mhcflurry.custom_loss.
MSEWithInequalities
(name=None)[source]¶ Bases:
mhcflurry.custom_loss.Loss
Supports training a regression model on data that includes inequalities (e.g. x < 100). Mean square error is used as the loss for elements with an (=) inequality. For elements with e.g. a (> 0.5) inequality, then the loss for that element is (y - 0.5)^2 (standard MSE) if y < 500 and 0 otherwise.
This loss assumes that the normal range for y_true and y_pred is 0 - 1. As a hack, the implementation uses other intervals for y_pred to encode the inequality information.
y_true is interpreted as follows:
- between 0 - 1
Regular MSE loss is used. Penalty (y_pred - y_true)**2 is applied if y_pred is greater or less than y_true.
- between 2 - 3:
Treated as a “>” inequality. Penalty (y_pred - (y_true - 2))**2 is applied only if y_pred is less than y_true - 2.
- between 4 - 5:
Treated as a “<” inequality. Penalty (y_pred - (y_true - 4))**2 is applied only if y_pred is greater than y_true - 4.
-
name
= 'mse_with_inequalities'¶
-
supports_inequalities
= True¶
-
supports_multiple_outputs
= False¶
-
class
mhcflurry.custom_loss.
MSEWithInequalitiesAndMultipleOutputs
(name=None)[source]¶ Bases:
mhcflurry.custom_loss.Loss
Loss supporting inequalities and multiple outputs.
This loss assumes that the normal range for y_true and y_pred is 0 - 1. As a hack, the implementation uses other intervals for y_pred to encode the inequality and output-index information.
Inequalities are encoded into the regression target as in the MSEWithInequalities loss.
Multiple outputs are encoded by mapping each regression target x (after transforming for inequalities) using the rule x -> x + i * 10 where i is the output index.
The reason for explicitly encoding multiple outputs this way (rather than just making the regression target a matrix instead of a vector) is that in our use cases we frequently have missing data in the regression target. This encoding gives a simple way to penalize only on (data point, output index) pairs that have labels.
-
name
= 'mse_with_inequalities_and_multiple_outputs'¶
-
supports_inequalities
= True¶
-
supports_multiple_outputs
= True¶
-
-
class
mhcflurry.custom_loss.
MultiallelicMassSpecLoss
(delta=0.2, multiplier=1.0)[source]¶ Bases:
mhcflurry.custom_loss.Loss
-
name
= 'multiallelic_mass_spec_loss'¶
-
supports_inequalities
= True¶
-
supports_multiple_outputs
= False¶
-
-
mhcflurry.custom_loss.
check_shape
(name, arr, expected_shape)[source]¶ Raise ValueError if arr.shape != expected_shape.
- Parameters
- namestring
Included in error message to aid debugging
- arrnumpy.ndarray
- expected_shapetuple of int
-
mhcflurry.custom_loss.
cls
¶
mhcflurry.data_dependent_weights_initialization module¶
Layer-sequential unit-variance initialization for neural networks.
- See:
Mishkin and Matas, “All you need is a good init”. 2016. https://arxiv.org/abs/1511.06422
-
mhcflurry.data_dependent_weights_initialization.
lsuv_init
(model, batch, verbose=True, margin=0.1, max_iter=100)[source]¶ Initialize neural network weights using layer-sequential unit-variance initialization.
- See:
Mishkin and Matas, “All you need is a good init”. 2016. https://arxiv.org/abs/1511.06422
- Parameters
- modelkeras.Model
- batchdict
Training data, as would be passed keras.Model.fit()
- verboseboolean
Whether to print progress to stdout
- marginfloat
- max_iterint
- Returns
- keras.Model
Same as what was passed in.
mhcflurry.downloads module¶
Manage local downloaded data.
-
mhcflurry.downloads.
get_downloads_metadata
()[source]¶ Return the contents of downloads.yml as a dict
-
mhcflurry.downloads.
get_default_class1_models_dir
(test_exists=True)[source]¶ Return the absolute path to the default class1 models dir.
If environment variable MHCFLURRY_DEFAULT_CLASS1_MODELS is set to an absolute path, return that path. If it’s set to a relative path (i.e. does not start with /) then return that path taken to be relative to the mhcflurry downloads dir.
If environment variable MHCFLURRY_DEFAULT_CLASS1_MODELS is NOT set, then return the path to downloaded models in the “models_class1” download.
- Parameters
- test_existsboolean, optional
Whether to raise an exception of the path does not exist
- Returns
- stringabsolute path
-
mhcflurry.downloads.
get_default_class1_presentation_models_dir
(test_exists=True)[source]¶ Return the absolute path to the default class1 presentation models dir.
See
get_default_class1_models_dir
.If environment variable MHCFLURRY_DEFAULT_CLASS1_PRESENTATION_MODELS is set to an absolute path, return that path. If it’s set to a relative path (does not start with /) then return that path taken to be relative to the mhcflurry downloads dir.
- Parameters
- test_existsboolean, optional
Whether to raise an exception of the path does not exist
- Returns
- stringabsolute path
-
mhcflurry.downloads.
get_default_class1_processing_models_dir
(test_exists=True)[source]¶ Return the absolute path to the default class1 processing models dir.
See
get_default_class1_models_dir
.If environment variable MHCFLURRY_DEFAULT_CLASS1_PROCESSING_MODELS is set to an absolute path, return that path. If it’s set to a relative path (does not start with /) then return that path taken to be relative to the mhcflurry downloads dir.
- Parameters
- test_existsboolean, optional
Whether to raise an exception of the path does not exist
- Returns
- stringabsolute path
-
mhcflurry.downloads.
get_current_release_downloads
()[source]¶ Return a dict of all available downloads in the current release.
The dict keys are the names of the downloads. The values are a dict with two entries:
- downloadedbool
Whether the download is currently available locally
- metadatadict
Info about the download from downloads.yml such as URL
- up_to_datebool or None
Whether the download URL(s) match what was used to download the current data. This is None if it cannot be determined.
-
mhcflurry.downloads.
get_path
(download_name, filename='', test_exists=True)[source]¶ Get the local path to a file in a MHCflurry download
- Parameters
- download_namestring
- filenamestring
Relative path within the download to the file of interest
- test_existsboolean
If True (default) throw an error telling the user how to download the data if the file does not exist
- Returns
- string giving local absolute path
mhcflurry.downloads_command module¶
Download MHCflurry released datasets and trained models.
Examples
- Fetch the default downloads:
$ mhcflurry-downloads fetch
- Fetch a specific download:
$ mhcflurry-downloads fetch models_class1_pan
- Get the path to a download:
$ mhcflurry-downloads path models_class1_pan
- Get the URL of a download:
$ mhcflurry-downloads url models_class1_pan
- Summarize available and fetched downloads:
$ mhcflurry-downloads info
-
mhcflurry.downloads_command.
run
(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶
-
mhcflurry.downloads_command.
mkdir_p
(path)[source]¶ Make directories as needed, similar to mkdir -p in a shell.
From: http://stackoverflow.com/questions/600268/mkdir-p-functionality-in-python
-
class
mhcflurry.downloads_command.
TqdmUpTo
(*args, **kwargs)[source]¶ Bases:
tqdm.std.tqdm
Provides
update_to(n)
which usestqdm.update(delta_n)
.- Parameters
- iterableiterable, optional
Iterable to decorate with a progressbar. Leave blank to manually manage the updates.
- descstr, optional
Prefix for the progressbar.
- totalint or float, optional
The number of expected iterations. If unspecified, len(iterable) is used if possible. If float(“inf”) or as a last resort, only basic progress statistics are displayed (no ETA, no progressbar). If
gui
is True and this parameter needs subsequent updating, specify an initial arbitrary large positive number, e.g. 9e9.- leavebool, optional
If [default: True], keeps all traces of the progressbar upon termination of iteration. If
None
, will leave only ifposition
is0
.- file
io.TextIOWrapper
orio.StringIO
, optional Specifies where to output the progress messages (default: sys.stderr). Uses
file.write(str)
andfile.flush()
methods. For encoding, seewrite_bytes
.- ncolsint, optional
The width of the entire output message. If specified, dynamically resizes the progressbar to stay within this bound. If unspecified, attempts to use environment width. The fallback is a meter width of 10 and no limit for the counter and statistics. If 0, will not print any meter (only stats).
- minintervalfloat, optional
Minimum progress display update interval [default: 0.1] seconds.
- maxintervalfloat, optional
Maximum progress display update interval [default: 10] seconds. Automatically adjusts
miniters
to correspond tomininterval
after long display update lag. Only works ifdynamic_miniters
or monitor thread is enabled.- minitersint or float, optional
Minimum progress display update interval, in iterations. If 0 and
dynamic_miniters
, will automatically adjust to equalmininterval
(more CPU efficient, good for tight loops). If > 0, will skip display of specified number of iterations. Tweak this andmininterval
to get very efficient loops. If your progress is erratic with both fast and slow iterations (network, skipping items, etc) you should set miniters=1.- asciibool or str, optional
If unspecified or False, use unicode (smooth blocks) to fill the meter. The fallback is to use ASCII characters ” 123456789#”.
- disablebool, optional
Whether to disable the entire progressbar wrapper [default: False]. If set to None, disable on non-TTY.
- unitstr, optional
String that will be used to define the unit of each iteration [default: it].
- unit_scalebool or int or float, optional
If 1 or True, the number of iterations will be reduced/scaled automatically and a metric prefix following the International System of Units standard will be added (kilo, mega, etc.) [default: False]. If any other non-zero number, will scale
total
andn
.- dynamic_ncolsbool, optional
If set, constantly alters
ncols
andnrows
to the environment (allowing for window resizes) [default: False].- smoothingfloat, optional
Exponential moving average smoothing factor for speed estimates (ignored in GUI mode). Ranges from 0 (average speed) to 1 (current/instantaneous speed) [default: 0.3].
- bar_formatstr, optional
Specify a custom bar string formatting. May impact performance. [default: ‘{l_bar}{bar}{r_bar}’], where l_bar=’{desc}: {percentage:3.0f}%|’ and r_bar=’| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, ‘
‘{rate_fmt}{postfix}]’
- Possible vars: l_bar, bar, r_bar, n, n_fmt, total, total_fmt,
percentage, elapsed, elapsed_s, ncols, nrows, desc, unit, rate, rate_fmt, rate_noinv, rate_noinv_fmt, rate_inv, rate_inv_fmt, postfix, unit_divisor, remaining, remaining_s.
Note that a trailing “: ” is automatically removed after {desc} if the latter is empty.
- initialint or float, optional
The initial counter value. Useful when restarting a progress bar [default: 0]. If using float, consider specifying
{n:.3f}
or similar inbar_format
, or specifyingunit_scale
.- positionint, optional
Specify the line offset to print this bar (starting from 0) Automatic if unspecified. Useful to manage multiple bars at once (eg, from threads).
- postfixdict or *, optional
Specify additional stats to display at the end of the bar. Calls
set_postfix(**postfix)
if possible (dict).- unit_divisorfloat, optional
[default: 1000], ignored unless
unit_scale
is True.- write_bytesbool, optional
If (default: None) and
file
is unspecified, bytes will be written in Python 2. IfTrue
will also write bytes. In all other cases will default to unicode.- lock_argstuple, optional
Passed to
refresh
for intermediate output (initialisation, iterating, and updating).- nrowsint, optional
The screen height. If specified, hides nested bars outside this bound. If unspecified, attempts to use environment height. The fallback is 20.
- guibool, optional
WARNING: internal parameter - do not use. Use tqdm.gui.tqdm(…) instead. If set, will attempt to use matplotlib animations for a graphical output [default: False].
- Returns
- outdecorated iterator.
mhcflurry.encodable_sequences module¶
Class for encoding variable-length peptides to fixed-size numerical matrices
-
exception
mhcflurry.encodable_sequences.
EncodingError
(message, supported_peptide_lengths)[source]¶ Bases:
ValueError
Exception raised when peptides cannot be encoded
-
class
mhcflurry.encodable_sequences.
EncodableSequences
(sequences)[source]¶ Bases:
object
Class for encoding variable-length peptides to fixed-size numerical matrices
This class caches various encodings of a list of sequences.
In practice this is used only for peptides. To encode MHC allele sequences, see AlleleEncoding.
-
unknown_character
= 'X'¶
-
classmethod
create
(sequences)[source]¶ Factory that returns an EncodableSequences given a list of strings. As a convenience, you can also pass it an EncodableSequences instance, in which case the object is returned unchanged.
-
variable_length_to_fixed_length_categorical
(alignment_method='pad_middle', left_edge=4, right_edge=4, max_length=15)[source]¶ Encode variable-length sequences to a fixed-size index-encoded (integer) matrix.
See
sequences_to_fixed_length_index_encoded_array
for details.- Parameters
- alignment_methodstring
One of “pad_middle” or “left_pad_right_pad”
- left_edgeint, size of fixed-position left side
Only relevant for pad_middle alignment method
- right_edgeint, size of the fixed-position right side
Only relevant for pad_middle alignment method
- max_lengthmaximum supported peptide length
- Returns
- numpy.array of integers with shape (num sequences, encoded length)
- For pad_middle, the encoded length is max_length. For left_pad_right_pad,
- it’s 3 * max_length.
-
variable_length_to_fixed_length_vector_encoding
(vector_encoding_name, alignment_method='pad_middle', left_edge=4, right_edge=4, max_length=15, trim=False, allow_unsupported_amino_acids=False)[source]¶ Encode variable-length sequences to a fixed-size matrix. Amino acids are encoded as specified by the vector_encoding_name argument.
See
sequences_to_fixed_length_index_encoded_array
for details.See also: variable_length_to_fixed_length_categorical.
- Parameters
- vector_encoding_namestring
How to represent amino acids. One of “BLOSUM62”, “one-hot”, etc. Full list of supported vector encodings is given by available_vector_encodings().
- alignment_methodstring
One of “pad_middle” or “left_pad_right_pad”
- left_edgeint
Size of fixed-position left side. Only relevant for pad_middle alignment method
- right_edgeint
Size of the fixed-position right side. Only relevant for pad_middle alignment method
- max_lengthint
Maximum supported peptide length
- trimbool
If True, longer sequences will be trimmed to fit the maximum supported length. Not supported for all alignment methods.
- allow_unsupported_amino_acidsbool
If True, non-canonical amino acids will be replaced with the X character before encoding.
- Returns
- numpy.array with shape (num sequences, encoded length, m)
- where
m is the vector encoding length (usually 21).
encoded length is max_length if alignment_method is pad_middle; 3 * max_length if it’s left_pad_right_pad.
-
classmethod
sequences_to_fixed_length_index_encoded_array
(sequences, alignment_method='pad_middle', left_edge=4, right_edge=4, max_length=15, trim=False, allow_unsupported_amino_acids=False)[source]¶ Encode variable-length sequences to a fixed-size index-encoded (integer) matrix.
How variable length sequences get mapped to fixed length is set by the “alignment_method” argument. Supported alignment methods are:
- pad_middle
Encoding designed for preserving the anchor positions of class I peptides. This is what is used in allele-specific models.
Each string must be of length at least left_edge + right_edge and at most max_length. The first left_edge characters in the input always map to the first left_edge characters in the output. Similarly for the last right_edge characters. The middle characters are filled in based on the length, with the X character filling in the blanks.
Example:
AAAACDDDD -> AAAAXXXCXXXDDDD
- left_pad_centered_right_pad
Encoding that makes no assumptions on anchor positions but is 3x larger than pad_middle, since it duplicates the peptide (left aligned + centered + right aligned). This is what is used for the pan-allele models.
Example:
AAAACDDDD -> AAAACDDDDXXXXXXXXXAAAACDDDDXXXXXXXXXAAAACDDDD
- left_pad_right_pad
Same as left_pad_centered_right_pad but only includes left- and right-padded peptide.
Example:
AAAACDDDD -> AAAACDDDDXXXXXXXXXXXXAAAACDDDD
- Parameters
- sequenceslist of string
- alignment_methodstring
One of “pad_middle” or “left_pad_right_pad”
- left_edgeint
Size of fixed-position left side. Only relevant for pad_middle alignment method
- right_edgeint
Size of the fixed-position right side. Only relevant for pad_middle alignment method
- max_lengthint
maximum supported peptide length
- trimbool
If True, longer sequences will be trimmed to fit the maximum supported length. Not supported for all alignment methods.
- allow_unsupported_amino_acidsbool
If True, non-canonical amino acids will be replaced with the X character before encoding.
- Returns
- numpy.array of integers with shape (num sequences, encoded length)
- For pad_middle, the encoded length is max_length. For left_pad_right_pad,
- it’s 2 * max_length. For left_pad_centered_right_pad, it’s
- 3 * max_length.
-
mhcflurry.ensemble_centrality module¶
Measures of centrality (e.g. mean) used to combine predictions across an ensemble. The input to these functions are log affinities, and they are expected to return a centrality measure also in log-space.
mhcflurry.fasta module¶
Adapted from pyensembl, github.com/openvax/pyensembl Original implementation by Alex Rubinsteyn.
The worse sin in bioinformatics is to write your own FASTA parser. We’re doing this to avoid adding another dependency to MHCflurry, however.
mhcflurry.flanking_encoding module¶
Class for encoding variable-length flanking and peptides to fixed-size numerical matrices
-
class
mhcflurry.flanking_encoding.
EncodingResult
(array, peptide_lengths)¶ Bases:
tuple
Create new instance of EncodingResult(array, peptide_lengths)
-
array
¶ Alias for field number 0
-
peptide_lengths
¶ Alias for field number 1
-
-
class
mhcflurry.flanking_encoding.
FlankingEncoding
(peptides, n_flanks, c_flanks)[source]¶ Bases:
object
Encode peptides and optionally their N- and C-flanking sequences into fixed size numerical matrices. Similar to EncodableSequences but with support for flanking sequences and the encoding scheme used by the processing predictor.
Instances of this class have an immutable list of peptides with flanking sequences. Encodings are cached in the instances for faster performance when the same set of peptides needs to encoded more than once.
Constructor. Sequences of any lengths can be passed.
- Parameters
- peptideslist of string
Peptide sequences
- n_flankslist of string [same length as peptides]
Upstream sequences
- c_flankslist of string [same length as peptides]
Downstream sequences
-
unknown_character
= 'X'¶
-
vector_encode
(vector_encoding_name, peptide_max_length, n_flank_length, c_flank_length, allow_unsupported_amino_acids=True, throw=True)[source]¶ Encode variable-length sequences to a fixed-size matrix.
- Parameters
- vector_encoding_namestring
How to represent amino acids. One of “BLOSUM62”, “one-hot”, etc. See
amino_acid.available_vector_encodings()
.- peptide_max_lengthint
Maximum supported peptide length.
- n_flank_lengthint
Maximum supported N-flank length
- c_flank_lengthint
Maximum supported C-flank length
- allow_unsupported_amino_acidsbool
If True, non-canonical amino acids will be replaced with the X character before encoding.
- throwbool
Whether to raise exception on unsupported peptides
- Returns
- numpy.array with shape (num sequences, length, m)
- where
num sequences is number of peptides, i.e. len(self)
length is peptide_max_length + n_flank_length + c_flank_length
m is the vector encoding length (usually 21).
-
static
encode
(vector_encoding_name, df, peptide_max_length, n_flank_length, c_flank_length, allow_unsupported_amino_acids=False, throw=True)[source]¶ Encode variable-length sequences to a fixed-size matrix.
Helper function. Users should use
vector_encode
.- Parameters
- vector_encoding_namestring
- dfpandas.DataFrame
- peptide_max_lengthint
- n_flank_lengthint
- c_flank_lengthint
- allow_unsupported_amino_acidsbool
- throwbool
- Returns
- numpy.array
mhcflurry.hyperparameters module¶
Hyperparameter (neural network options) management
-
class
mhcflurry.hyperparameters.
HyperparameterDefaults
(**defaults)[source]¶ Bases:
object
Class for managing hyperparameters. Thin wrapper around a dict.
Instances of this class are a specification of the hyperparameters supported by a model and their defaults. The particular hyperparameter settings to be used, for example, to train a model are kept in plain dicts.
-
extend
(other)[source]¶ Return a new HyperparameterDefaults instance containing the hyperparameters from the current instance combined with those from other.
It is an error if self and other have any hyperparameters in common.
-
with_defaults
(obj)[source]¶ Given a dict of hyperparameter settings, return a dict containing those settings augmented by the defaults for any keys missing from the dict.
-
subselect
(obj)[source]¶ Filter a dict of hyperparameter settings to only those keys defined in this HyperparameterDefaults .
-
check_valid_keys
(obj)[source]¶ Given a dict of hyperparameter settings, throw an exception if any keys are not defined in this HyperparameterDefaults instance.
-
models_grid
(**kwargs)[source]¶ Make a grid of models by taking the cartesian product of all specified model parameter lists.
- Parameters
- The valid kwarg parameters are the entries of this
- HyperparameterDefaults instance. Each parameter must be a list
- giving the values to search across.
- Returns
- list of dict giving the parameters for each model. The length of the
- list is the product of the lengths of the input lists.
-
mhcflurry.local_parallelism module¶
Infrastructure for “local” parallelism, i.e. multiprocess parallelism on one compute node.
-
mhcflurry.local_parallelism.
add_local_parallelism_args
(parser)[source]¶ Add local parallelism arguments to the given argparse.ArgumentParser.
- Parameters
- parserargparse.ArgumentParser
-
mhcflurry.local_parallelism.
worker_pool_with_gpu_assignments_from_args
(args)[source]¶ Create a multiprocessing.Pool where each worker uses its own GPU.
Uses commandline arguments. See
worker_pool_with_gpu_assignments
.- Parameters
- argsargparse.ArgumentParser
- Returns
- multiprocessing.Pool
-
mhcflurry.local_parallelism.
worker_pool_with_gpu_assignments
(num_jobs, num_gpus=0, backend=None, max_workers_per_gpu=1, max_tasks_per_worker=None, worker_log_dir=None)[source]¶ Create a multiprocessing.Pool where each worker uses its own GPU.
- Parameters
- num_jobsint
Number of worker processes.
- num_gpusint
- backendstring
- max_workers_per_gpuint
- max_tasks_per_workerint
- worker_log_dirstring
- Returns
- multiprocessing.Pool
-
mhcflurry.local_parallelism.
make_worker_pool
(processes=None, initializer=None, initializer_kwargs_per_process=None, max_tasks_per_worker=None)[source]¶ Convenience wrapper to create a multiprocessing.Pool.
This function adds support for per-worker initializer arguments, which are not natively supported by the multiprocessing module. The motivation for this feature is to support allocating each worker to a (different) GPU.
- IMPLEMENTATION NOTE:
The per-worker initializer arguments are implemented using a Queue. Each worker reads its arguments from this queue when it starts. When it terminates, it adds its initializer arguments back to the queue, so a future process can initialize itself using these arguments.
There is one issue with this approach, however. If a worker crashes, it never repopulates the queue of initializer arguments. This will prevent any future worker from re-using those arguments. To deal with this issue we add a second ‘backup queue’. This queue always contains the full set of initializer arguments: whenever a worker reads from it, it always pushes the pop’d args back to the end of the queue immediately. If the primary arg queue is ever empty, then workers will read from this backup queue.
- Parameters
- processesint
Number of workers. Default: num CPUs.
- initializerfunction, optional
Init function to call in each worker
- initializer_kwargs_per_processlist of dict, optional
Arguments to pass to initializer function for each worker. Length of list must equal the number of workers.
- max_tasks_per_workerint, optional
Restart workers after this many tasks. Requires Python >=3.2.
- Returns
- multiprocessing.Pool
-
mhcflurry.local_parallelism.
worker_init_entry_point
(init_function, arg_queue=None, backup_arg_queue=None)[source]¶
-
mhcflurry.local_parallelism.
worker_init
(keras_backend=None, gpu_device_nums=None, worker_log_dir=None)[source]¶
-
exception
mhcflurry.local_parallelism.
WrapException
[source]¶ Bases:
Exception
Add traceback info to exception so exceptions raised in worker processes can still show traceback info when re-raised in the parent.
mhcflurry.percent_rank_transform module¶
Class for transforming arbitrary values into percent ranks given a distribution.
-
class
mhcflurry.percent_rank_transform.
PercentRankTransform
[source]¶ Bases:
object
Transform arbitrary values into percent ranks.
-
fit
(values, bins)[source]¶ Fit the transform using the given values (e.g. ic50s).
- Parameters
- valuespredictions (e.g. ic50 values)
- binsbins for the cumulative distribution function
Anything that can be passed to numpy.histogram’s “bins” argument can be used here.
-
to_series
()[source]¶ Serialize the fit to a pandas.Series.
The index on the series gives the bin edges and the values give the CDF.
- Returns
- pandas.Series
-
static
from_series
(series)[source]¶ Deseralize a PercentRankTransform the given pandas.Series, as returned by
to_series()
.- Parameters
- seriespandas.Series
- Returns
- PercentRankTransform
-
mhcflurry.predict_command module¶
Run MHCflurry predictor on specified peptides.
By default, the presentation predictor is used, and predictions for MHC I binding affinity, antigen processing, and the composite presentation score are returned. If you just want binding affinity predictions, pass –affinity-only.
Examples:
Write a CSV file containing the contents of INPUT.csv plus additional columns giving MHCflurry predictions:
$ mhcflurry-predict INPUT.csv –out RESULT.csv
The input CSV file is expected to contain columns “allele”, “peptide”, and, optionally, “n_flank”, and “c_flank”.
If --out
is not specified, results are written to stdout.
You can also run on alleles and peptides specified on the commandline, in which case predictions are written for all combinations of alleles and peptides:
$ mhcflurry-predict –alleles HLA-A0201 H-2Kb –peptides SIINFEKL DENDREKLLL
Instead of individual alleles (in a CSV or on the command line), you can also give a comma separated list of alleles giving a sample genotype. In this case, the tightest binding affinity across the alleles for the sample will be returned. For example:
$ mhcflurry-predict –peptides SIINFEKL DENDREKLLL –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:01,HLA-C*07:02 HLA-A*01:01,HLA-A*02:06,HLA-B*44:02,HLA-B*07:02,HLA-C*01:01,HLA-C*03:01
will give the tightest predicted affinities across alleles for each of the two genotypes specified for each peptide.
mhcflurry.predict_scan_command module¶
Scan protein sequences using the MHCflurry presentation predictor.
By default, sub-sequences (peptides) with affinity percentile ranks less than 2.0 are returned. You can also specify –results-all to return predictions for all peptides, or –results-best to return the top peptide for each sequence.
Examples:
Scan a set of sequences in a FASTA file for binders to any alleles in a MHC I genotype:
$ mhcflurry-predict-scan test/data/example.fasta –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:01,HLA-C*07:02
Instead of a FASTA, you can also pass a CSV that has “sequence_id” and “sequence” columns.
You can also specify multiple MHC I genotypes to scan as space-separated arguments to the –alleles option:
$ mhcflurry-predict-scan test/data/example.fasta –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:02,HLA-C*07:02 HLA-A*01:01,HLA-A*02:06,HLA-B*44:02,HLA-B*07:02,HLA-C*01:02,HLA-C*03:01
If --out
is not specified, results are written to standard out.
You can also specify sequences on the commandline:
mhcflurry-predict-scan –sequences MGYINVFAFPFTIYSLLLCRMNSRNYIAQVDVVNFNLT –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:02,HLA-C*07:02
mhcflurry.random_negative_peptides module¶
-
class
mhcflurry.random_negative_peptides.
RandomNegativePeptides
(**hyperparameters)[source]¶ Bases:
object
Generate random negative (peptide, allele) pairs. These are used during model training, where they are resampled at each epoch.
-
hyperparameter_defaults
= <mhcflurry.hyperparameters.HyperparameterDefaults object>¶ Hyperperameters for random negative peptides.
- Number of random negatives will be:
random_negative_rate * (num measurements) + random_negative_constant
where the exact meaning of (num measurements) depends on the particular random_negative_method in use.
If random_negative_match_distribution is True, then the amino acid frequencies of the training data peptides are used to generate the random peptides.
- Valid values for random_negative_method are:
- “by_length”: used for allele-specific prediction. See description in
- “by_allele”: used for pan-allele prediction. See
- “by_allele_equalize_nonbinders”: used for pan-allele prediction. See
RandomNegativePeptides.plan_by_allele_equalize_nonbinders
method.- “recommended”: the default. Use by_length if the predictor is allele-
specific and by_allele if it’s pan-allele.
-
plan
(peptides, affinities, alleles=None, inequalities=None)[source]¶ Calculate the number of random negatives for each allele and peptide length. Call this once after instantiating the object.
- Parameters
- peptideslist of string
- affinitieslist of float
- alleleslist of string, optional
- inequalitieslist of string (“>”, “<”, or “=”), optional
- Returns
- pandas.DataFrame indicating number of random negatives for each length
- and allele.
-
plan_by_length
(df_all, df_binders=None, df_nonbinders=None)[source]¶ Generate a random negative plan using the “by_length” policy.
Parameters are as in the
plan
method. No return value.Used for allele-specific predictors. Does not work well for pan-allele.
Different numbers of random negatives per length. Alleles are sampled proportionally to the number of times they are used in the training data.
-
plan_by_allele
(df_all, df_binders=None, df_nonbinders=None)[source]¶ Generate a random negative plan using the “by_allele” policy.
Parameters are as in the
plan
method. No return value.For each allele, a particular number of random negatives are used for all lengths. Across alleles, the number of random negatives varies; within an allele, the number of random negatives for each length is a constant
-
plan_by_allele_equalize_nonbinders
(df_all, df_binders, df_nonbinders)[source]¶ Generate a random negative plan using the “by_allele_equalize_nonbinders” policy.
Parameters are as in the
plan
method. No return value.Requires that the random_negative_binder_threshold hyperparameter is set.
In a first step, the number of random negatives selected by the “by_allele” method are added (see
plan_by_allele
). Then, the total number of non-binders are calculated for each allele and length. This total includes non-binder measurements in the training data plus the random negative peptides added in the first step. In a second step, additional random negative peptides are added so that for each allele, all peptide lengths have the same total number of non-binders.
-
get_alleles
()[source]¶ Get the list of alleles corresponding to each random negative peptide as returned by
get_peptides
. This does NOT change and can be safely called once and reused.- Returns
- list of string
-
mhcflurry.regression_target module¶
mhcflurry.scoring module¶
Measures of prediction accuracy
-
mhcflurry.scoring.
make_scores
(ic50_y, ic50_y_pred, sample_weight=None, threshold_nm=500, max_ic50=50000)[source]¶ Calculate AUC, F1, and Kendall Tau scores.
- Parameters
- ic50_yfloat list
true IC50s (i.e. affinities)
- ic50_y_predfloat list
predicted IC50s
- sample_weightfloat list [optional]
- threshold_nmfloat [optional]
- max_ic50float [optional]
- Returns
- dict with entries “auc”, “f1”, “tau”
mhcflurry.select_allele_specific_models_command module¶
Model select class1 single allele models.
-
mhcflurry.select_allele_specific_models_command.
run
(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶
-
class
mhcflurry.select_allele_specific_models_command.
ScrambledPredictor
(predictor)[source]¶ Bases:
object
-
class
mhcflurry.select_allele_specific_models_command.
ScoreFunction
(function, summary=None)[source]¶ Bases:
object
Thin wrapper over a score function (Class1AffinityPredictor -> float). Used to keep a summary string associated with the function.
-
class
mhcflurry.select_allele_specific_models_command.
CombinedModelSelector
(model_selectors, weights=None, min_contribution_percent=1.0)[source]¶ Bases:
object
Model selector that computes a weighted average over other model selectors.
-
class
mhcflurry.select_allele_specific_models_command.
ConsensusModelSelector
(predictor, num_peptides_per_length=10000, multiply_score_by_value=10.0)[source]¶ Bases:
object
Model selector that scores sub-ensembles based on their Kendall tau consistency with the full ensemble over a set of random peptides.
-
class
mhcflurry.select_allele_specific_models_command.
MSEModelSelector
(df, predictor, min_measurements=1, multiply_score_by_data_size=True)[source]¶ Bases:
object
Model selector that uses mean-squared error to score models. Inequalities are supported.
mhcflurry.select_pan_allele_models_command module¶
Model select class1 pan-allele models.
APPROACH: For each training fold, we select at least min and at most max models (where min and max are set by the –{min/max}-models-per-fold argument) using a step-up (forward) selection procedure. The final ensemble is the union of all selected models across all folds.
-
mhcflurry.select_pan_allele_models_command.
mse
(predictions, actual, inequalities=None, affinities_are_already_01_transformed=False)[source]¶ Mean squared error of predictions vs. actual
- Parameters
- predictionslist of float
- actuallist of float
- inequalitieslist of string (“>”, “<”, or “=”)
- affinities_are_already_01_transformedboolean
Predictions and actual are taken to be nanomolar affinities if affinities_are_already_01_transformed is False, otherwise 0-1 values.
- Returns
- float
-
mhcflurry.select_pan_allele_models_command.
run
(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶
-
mhcflurry.select_pan_allele_models_command.
model_select
(fold_num, models, min_models, max_models, constant_data={})[source]¶ Model select for a fold.
- Parameters
- fold_numint
- modelslist of Class1NeuralNetwork
- min_modelsint
- max_modelsint
- constant_datadict
- Returns
- dict with keys ‘fold_num’, ‘selected_indices’, ‘summary’
mhcflurry.select_processing_models_command module¶
Model select antigen processing models.
APPROACH: For each training fold, we select at least min and at most max models (where min and max are set by the –{min/max}-models-per-fold argument) using a step-up (forward) selection procedure. The final ensemble is the union of all selected models across all folds. AUC is used as the metric.
-
mhcflurry.select_processing_models_command.
run
(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶
-
mhcflurry.select_processing_models_command.
model_select
(fold_num, models, min_models, max_models, constant_data={})[source]¶ Model select for a fold.
- Parameters
- fold_numint
- modelslist of Class1NeuralNetwork
- min_modelsint
- max_modelsint
- constant_datadict
- Returns
- dict with keys ‘fold_num’, ‘selected_indices’, ‘summary’
mhcflurry.testing_utils module¶
Utilities used in MHCflurry unit tests.
mhcflurry.train_allele_specific_models_command module¶
Train Class1 single allele models.
-
mhcflurry.train_allele_specific_models_command.
run
(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶
mhcflurry.train_pan_allele_models_command module¶
Train Class1 pan-allele models.
-
mhcflurry.train_pan_allele_models_command.
assign_folds
(df, num_folds, held_out_fraction, held_out_max)[source]¶ Split training data into multple test/train pairs, which we refer to as folds. Note that a given data point may be assigned to multiple test or train sets; these folds are NOT a non-overlapping partition as used in cross validation.
A fold is defined by a boolean value for each data point, indicating whether it is included in the training data for that fold. If it’s not in the training data, then it’s in the test data.
Folds are balanced in terms of allele content.
- Parameters
- dfpandas.DataFrame
training data
- num_foldsint
- held_out_fractionfloat
Fraction of data to hold out as test data in each fold
- held_out_max
For a given allele, do not hold out more than held_out_max number of data points in any fold.
- Returns
- pandas.DataFrame
index is same as df.index, columns are “fold_0”, … “fold_N” giving whether the data point is in the training data for the fold
-
mhcflurry.train_pan_allele_models_command.
pretrain_data_iterator
(filename, master_allele_encoding, peptides_per_chunk=1024)[source]¶ Step through a CSV file giving predictions for a large number of peptides (rows) and alleles (columns).
- Parameters
- filenamestring
- master_allele_encodingAlleleEncoding
- peptides_per_chunkint
- Returns
- Generator of (AlleleEncoding, EncodableSequences, float affinities) tuples
-
mhcflurry.train_pan_allele_models_command.
run
(argv=['-b', 'html', '-v', '-d', '_build/doctrees', '.', '_build/html'])[source]¶
-
mhcflurry.train_pan_allele_models_command.
train_model
(work_item_name, work_item_num, num_work_items, architecture_num, num_architectures, fold_num, num_folds, replicate_num, num_replicates, hyperparameters, pretrain_data_filename, verbose, progress_print_interval, predictor, save_to, constant_data={})[source]¶
mhcflurry.train_presentation_models_command module¶
Train Class1 presentation models.
mhcflurry.train_processing_models_command module¶
Train Class1 processing models.
-
mhcflurry.train_processing_models_command.
assign_folds
(df, num_folds, held_out_samples)[source]¶ Split training data into mulitple test/train pairs, which we refer to as folds. Note that a given data point may be assigned to multiple test or train sets; these folds are NOT a non-overlapping partition as used in cross validation.
A fold is defined by a boolean value for each data point, indicating whether it is included in the training data for that fold. If it’s not in the training data, then it’s in the test data.
- Parameters
- dfpandas.DataFrame
training data
- num_foldsint
- held_out_samplesint
- Returns
- pandas.DataFrame
index is same as df.index, columns are “fold_0”, … “fold_N” giving whether the data point is in the training data for the fold