Python library tutorial¶
The MHCflurry Python API exposes additional options and features beyond those supported by the commandline tools and can be more convenient for interactive analyses and bioinformatic pipelines. This tutorial gives a basic overview of the most important functionality. See the API Documentation for further details.
Loading a predictor¶
Most prediction tasks can be performed using the
Class1PresentationPredictor
class, which provides a programmatic API
to the functionality in the mhcflurry-predict and
mhcflurry-predict-scan commands.
Instances of Class1PresentationPredictor
wrap a
Class1AffinityPredictor
to generate binding affinity predictions
and a Class1ProcessingPredictor
to generate antigen processing
predictions. The presentation score is computed using a logistic regression
model over binding affinity and processing predictions.
Use the load
static method to load a
trained predictor from disk. With no arguments this method will load the predictor
released with MHCflurry (see Downloading models). If you pass a path to a
models directory, then it will load that predictor instead.
>>> from mhcflurry import Class1PresentationPredictor
>>> predictor = Class1PresentationPredictor.load()
>>> predictor.supported_alleles[:5]
['Atbe-B*01:01', 'Atbe-E*03:01', 'Atbe-G*03:01', 'Atbe-G*03:02', 'Atbe-G*06:01']
Predicting for individual peptides¶
To generate predictions for individual peptides, we can use the
predict
method of the Class1PresentationPredictor
,
loaded above. This method returns a pandas.DataFrame
with binding affinity, processing, and presentation
predictions:
>>> predictor.predict(
... peptides=["SIINFEKL", "NLVPMVATV"],
... alleles=["HLA-A0201", "HLA-A0301"],
... verbose=0)
peptide peptide_num sample_name affinity best_allele processing_score presentation_score
0 SIINFEKL 0 sample1 12906.786173 HLA-A0201 0.101473 0.012503
1 NLVPMVATV 1 sample1 15.038358 HLA-A0201 0.676289 0.975463
Here, the list of alleles is taken to be an individual’s MHC I genotype (i.e. up to 6 alleles), and the strongest binder across alleles for each peptide is reported.
Note
MHCflurry normalizes allele names using the mhcnames
package. Names like HLA-A0201
or A*02:01
will be
normalized to HLA-A*02:01
, so most naming conventions can be used
with methods such as predict
.
If you have multiple sample genotypes, you can pass a dict, where the keys are arbitrary sample names:
>>> predictor.predict(
... peptides=["KSEYMTSWFY", "NLVPMVATV"],
... alleles={
... "sample1": ["A0201", "A0301", "B0702", "B4402", "C0201", "C0702"],
... "sample2": ["A0101", "A0206", "B5701", "C0202"],
... },
... verbose=0)
peptide peptide_num sample_name affinity best_allele processing_score presentation_score
0 KSEYMTSWFY 0 sample1 16737.745268 A0301 0.381632 0.026550
1 NLVPMVATV 1 sample1 15.038358 A0201 0.676289 0.975463
2 KSEYMTSWFY 0 sample2 62.540779 A0101 0.381632 0.796731
3 NLVPMVATV 1 sample2 15.765500 A0206 0.676289 0.974439
Here the strongest binder for each sample / peptide pair is returned.
Many users will focus on the binding affinity predictions, as the
processing and presentation predictions are experimental. If you do use the latter
scores, however, when available you should provide the upstream (N-flank)
and downstream (C-flank) sequences from the source proteins of the peptides for
a small boost in accuracy. To do so, specify the n_flank
and c_flank
arguments, which give the flanking sequences for the corresponding peptides:
>>> predictor.predict(
... peptides=["KSEYMTSWFY", "NLVPMVATV"],
... n_flanks=["NNNNNNN", "SSSSSSSS"],
... c_flanks=["CCCCCCCC", "YYYAAAA"],
... alleles={
... "sample1": ["A0201", "A0301", "B0702", "B4402", "C0201", "C0702"],
... "sample2": ["A0101", "A0206", "B5701", "C0202"],
... },
... verbose=0)
peptide n_flank c_flank peptide_num sample_name affinity best_allele processing_score presentation_score
0 KSEYMTSWFY NNNNNNN CCCCCCCC 0 sample1 16737.745268 A0301 0.605816 0.056190
1 NLVPMVATV SSSSSSSS YYYAAAA 1 sample1 15.038358 A0201 0.824994 0.986719
2 KSEYMTSWFY NNNNNNN CCCCCCCC 0 sample2 62.540779 A0101 0.605816 0.897493
3 NLVPMVATV SSSSSSSS YYYAAAA 1 sample2 15.765500 A0206 0.824994 0.986155
Scanning protein sequences¶
The predict_sequences
method supports
scanning protein sequences for MHC ligands. Here’s an example to identify all
peptides with a predicted binding affinity of 500 nM or tighter to any allele
across two sample genotypes and two short peptide sequences.
>>> predictor.predict_sequences(
... sequences={
... 'protein1': "MDSKGSSQKGSRLLLLLVVSNLL",
... 'protein2': "SSLPTPEDKEQAQQTHH",
... },
... alleles={
... "sample1": ["A0201", "A0301", "B0702"],
... "sample2": ["A0101", "C0202"],
... },
... result="filtered",
... comparison_quantity="affinity",
... filter_value=500,
... verbose=0)
sequence_name pos peptide n_flank c_flank sample_name affinity best_allele affinity_percentile processing_score presentation_score
0 protein1 13 LLLLVVSNL MDSKGSSQKGSRL L sample1 38.206225 A0201 0.380125 0.017644 0.571060
1 protein1 14 LLLVVSNLL MDSKGSSQKGSRLL sample1 42.243472 A0201 0.420250 0.090984 0.619213
2 protein1 5 SSQKGSRLL MDSKG LLLVVSNLL sample2 66.749223 C0202 0.803375 0.383608 0.774468
3 protein1 6 SQKGSRLLL MDSKGS LLVVSNLL sample2 178.033467 C0202 1.820000 0.275019 0.482206
4 protein1 13 LLLLVVSNLL MDSKGSSQKGSRL sample1 202.208167 A0201 1.112500 0.058782 0.261320
5 protein1 12 LLLLLVVSNL MDSKGSSQKGSR L sample1 202.506582 A0201 1.112500 0.010025 0.225648
6 protein2 0 SSLPTPEDK EQAQQTHH sample1 335.529377 A0301 1.011750 0.010443 0.156798
7 protein2 0 SSLPTPEDK EQAQQTHH sample2 353.451759 C0202 2.674250 0.010443 0.150753
8 protein1 8 KGSRLLLLL MDSKGSSQ VVSNLL sample2 410.327286 C0202 2.887000 0.121374 0.194081
9 protein1 5 SSQKGSRL MDSKG LLLLVVSNLL sample2 477.285937 C0202 3.107375 0.111982 0.168572
When using predict_sequences
, the flanking sequences for each peptide are
automatically included in the processing and presentation predictions.
See the documentation for Class1PresentationPredictor
for other
useful methods.
Lower level interfaces¶
The Class1PresentationPredictor
delegates to a
Class1AffinityPredictor
instance for binding affinity predictions.
If all you need are binding affinities, you can use this instance directly.
Here’s an example:
>>> from mhcflurry import Class1AffinityPredictor
>>> predictor = Class1AffinityPredictor.load()
>>> predictor.predict_to_dataframe(allele="HLA-A0201", peptides=["SIINFEKL", "SIINFEQL"])
peptide allele prediction prediction_low prediction_high prediction_percentile
0 SIINFEKL HLA-A0201 12906.786173 8829.460289 18029.923061 6.566375
1 SIINFEQL HLA-A0201 13025.300796 9050.056312 18338.004869 6.623625
The prediction_low
and prediction_high
fields give the 5-95 percentile
predictions across the models in the ensemble. This detailed information is not
available through the higher-level Class1PresentationPredictor
interface.
Under the hood, Class1AffinityPredictor
itself delegates to an ensemble of
of Class1NeuralNetwork
instances, which implement the neural network
models used for prediction. To fit your own affinity prediction models, call
fit
.
You can similarly use Class1ProcessingPredictor
directly for
antigen processing prediction, and there is a low-level
Class1ProcessingNeuralNetwork
with a fit
method.
See the API documentation of these classes for details.