Python library tutorial

The MHCflurry Python API exposes additional options and features beyond those supported by the commandline tools and can be more convenient for interactive analyses and bioinformatic pipelines. This tutorial gives a basic overview of the most important functionality. See the API Documentation for further details.

Loading a predictor

Most prediction tasks can be performed using the Class1PresentationPredictor class, which provides a programmatic API to the functionality in the mhcflurry-predict and mhcflurry-predict-scan commands.

Instances of Class1PresentationPredictor wrap a Class1AffinityPredictor to generate binding affinity predictions and a Class1ProcessingPredictor to generate antigen processing predictions. The presentation score is computed using a logistic regression model over binding affinity and processing predictions.

Use the load static method to load a trained predictor from disk. With no arguments this method will load the predictor released with MHCflurry (see Downloading models). If you pass a path to a models directory, then it will load that predictor instead.

>>> from mhcflurry import Class1PresentationPredictor
>>> predictor = Class1PresentationPredictor.load()
>>> predictor.supported_alleles[:5]
['Atbe-B*01:01', 'Atbe-E*03:01', 'Atbe-G*03:01', 'Atbe-G*03:02', 'Atbe-G*06:01']

Predicting for individual peptides

To generate predictions for individual peptides, we can use the predict method of the Class1PresentationPredictor, loaded above. This method returns a pandas.DataFrame with binding affinity, processing, and presentation predictions:

>>> predictor.predict(
...     peptides=["SIINFEKL", "NLVPMVATV"],
...     alleles=["HLA-A0201", "HLA-A0301"],
...     verbose=0)
     peptide  peptide_num sample_name      affinity best_allele  processing_score  presentation_score
0   SIINFEKL            0     sample1  12906.786173   HLA-A0201          0.101473            0.012503
1  NLVPMVATV            1     sample1     15.038358   HLA-A0201          0.676289            0.975463

Here, the list of alleles is taken to be an individual’s MHC I genotype (i.e. up to 6 alleles), and the strongest binder across alleles for each peptide is reported.

Note

MHCflurry normalizes allele names using the mhcnames package. Names like HLA-A0201 or A*02:01 will be normalized to HLA-A*02:01, so most naming conventions can be used with methods such as predict.

If you have multiple sample genotypes, you can pass a dict, where the keys are arbitrary sample names:

>>> predictor.predict(
...     peptides=["KSEYMTSWFY", "NLVPMVATV"],
...     alleles={
...        "sample1": ["A0201", "A0301", "B0702", "B4402", "C0201", "C0702"],
...        "sample2": ["A0101", "A0206", "B5701", "C0202"],
...     },
...     verbose=0)
      peptide  peptide_num sample_name      affinity best_allele  processing_score  presentation_score
0  KSEYMTSWFY            0     sample1  16737.745268       A0301          0.381632            0.026550
1   NLVPMVATV            1     sample1     15.038358       A0201          0.676289            0.975463
2  KSEYMTSWFY            0     sample2     62.540779       A0101          0.381632            0.796731
3   NLVPMVATV            1     sample2     15.765500       A0206          0.676289            0.974439

Here the strongest binder for each sample / peptide pair is returned.

Many users will focus on the binding affinity predictions, as the processing and presentation predictions are experimental. If you do use the latter scores, however, when available you should provide the upstream (N-flank) and downstream (C-flank) sequences from the source proteins of the peptides for a small boost in accuracy. To do so, specify the n_flank and c_flank arguments, which give the flanking sequences for the corresponding peptides:

>>> predictor.predict(
...     peptides=["KSEYMTSWFY", "NLVPMVATV"],
...     n_flanks=["NNNNNNN", "SSSSSSSS"],
...     c_flanks=["CCCCCCCC", "YYYAAAA"],
...     alleles={
...        "sample1": ["A0201", "A0301", "B0702", "B4402", "C0201", "C0702"],
...        "sample2": ["A0101", "A0206", "B5701", "C0202"],
...     },
...     verbose=0)
      peptide   n_flank   c_flank  peptide_num sample_name      affinity best_allele  processing_score  presentation_score
0  KSEYMTSWFY   NNNNNNN  CCCCCCCC            0     sample1  16737.745268       A0301          0.605816            0.056190
1   NLVPMVATV  SSSSSSSS   YYYAAAA            1     sample1     15.038358       A0201          0.824994            0.986719
2  KSEYMTSWFY   NNNNNNN  CCCCCCCC            0     sample2     62.540779       A0101          0.605816            0.897493
3   NLVPMVATV  SSSSSSSS   YYYAAAA            1     sample2     15.765500       A0206          0.824994            0.986155

Scanning protein sequences

The predict_sequences method supports scanning protein sequences for MHC ligands. Here’s an example to identify all peptides with a predicted binding affinity of 500 nM or tighter to any allele across two sample genotypes and two short peptide sequences.

>>> predictor.predict_sequences(
...    sequences={
...        'protein1': "MDSKGSSQKGSRLLLLLVVSNLL",
...        'protein2': "SSLPTPEDKEQAQQTHH",
...    },
...    alleles={
...        "sample1": ["A0201", "A0301", "B0702"],
...        "sample2": ["A0101", "C0202"],
...    },
...    result="filtered",
...    comparison_quantity="affinity",
...    filter_value=500,
...    verbose=0)
  sequence_name  pos     peptide         n_flank     c_flank sample_name    affinity best_allele  affinity_percentile  processing_score  presentation_score
0      protein1   13   LLLLVVSNL   MDSKGSSQKGSRL           L     sample1   38.206225       A0201             0.380125          0.017644            0.571060
1      protein1   14   LLLVVSNLL  MDSKGSSQKGSRLL                 sample1   42.243472       A0201             0.420250          0.090984            0.619213
2      protein1    5   SSQKGSRLL           MDSKG   LLLVVSNLL     sample2   66.749223       C0202             0.803375          0.383608            0.774468
3      protein1    6   SQKGSRLLL          MDSKGS    LLVVSNLL     sample2  178.033467       C0202             1.820000          0.275019            0.482206
4      protein1   13  LLLLVVSNLL   MDSKGSSQKGSRL                 sample1  202.208167       A0201             1.112500          0.058782            0.261320
5      protein1   12  LLLLLVVSNL    MDSKGSSQKGSR           L     sample1  202.506582       A0201             1.112500          0.010025            0.225648
6      protein2    0   SSLPTPEDK                    EQAQQTHH     sample1  335.529377       A0301             1.011750          0.010443            0.156798
7      protein2    0   SSLPTPEDK                    EQAQQTHH     sample2  353.451759       C0202             2.674250          0.010443            0.150753
8      protein1    8   KGSRLLLLL        MDSKGSSQ      VVSNLL     sample2  410.327286       C0202             2.887000          0.121374            0.194081
9      protein1    5    SSQKGSRL           MDSKG  LLLLVVSNLL     sample2  477.285937       C0202             3.107375          0.111982            0.168572

When using predict_sequences, the flanking sequences for each peptide are automatically included in the processing and presentation predictions.

See the documentation for Class1PresentationPredictor for other useful methods.

Lower level interfaces

The Class1PresentationPredictor delegates to a Class1AffinityPredictor instance for binding affinity predictions. If all you need are binding affinities, you can use this instance directly.

Here’s an example:

>>> from mhcflurry import Class1AffinityPredictor
>>> predictor = Class1AffinityPredictor.load()
>>> predictor.predict_to_dataframe(allele="HLA-A0201", peptides=["SIINFEKL", "SIINFEQL"])
    peptide     allele    prediction  prediction_low  prediction_high  prediction_percentile
0  SIINFEKL  HLA-A0201  12906.786173     8829.460289     18029.923061               6.566375
1  SIINFEQL  HLA-A0201  13025.300796     9050.056312     18338.004869               6.623625

The prediction_low and prediction_high fields give the 5-95 percentile predictions across the models in the ensemble. This detailed information is not available through the higher-level Class1PresentationPredictor interface.

Under the hood, Class1AffinityPredictor itself delegates to an ensemble of of Class1NeuralNetwork instances, which implement the neural network models used for prediction. To fit your own affinity prediction models, call fit.

You can similarly use Class1ProcessingPredictor directly for antigen processing prediction, and there is a low-level Class1ProcessingNeuralNetwork with a fit method.

See the API documentation of these classes for details.