Python library tutorial ======================= The MHCflurry Python API exposes additional options and features beyond those supported by the commandline tools and can be more convenient for interactive analyses and bioinformatic pipelines. This tutorial gives a basic overview of the most important functionality. See the :ref:`API-documentation` for further details. Loading a predictor ---------------------------------- Most prediction tasks can be performed using the `~mhcflurry.Class1PresentationPredictor` class, which provides a programmatic API to the functionality in the :ref:`mhcflurry-predict` and :ref:`mhcflurry-predict-scan` commands. Instances of `~mhcflurry.Class1PresentationPredictor` wrap a `~mhcflurry.Class1AffinityPredictor` to generate binding affinity predictions and a `~mhcflurry.Class1ProcessingPredictor` to generate antigen processing predictions. The presentation score is computed using a logistic regression model over binding affinity and processing predictions. Use the `~mhcflurry.Class1PresentationPredictor.load` static method to load a trained predictor from disk. With no arguments this method will load the predictor released with MHCflurry (see :ref:`downloading`\ ). If you pass a path to a models directory, then it will load that predictor instead. .. doctest:: >>> from mhcflurry import Class1PresentationPredictor >>> predictor = Class1PresentationPredictor.load() >>> predictor.supported_alleles[:5] ['Atbe-B*01:01', 'Atbe-E*03:01', 'Atbe-G*03:01', 'Atbe-G*03:02', 'Atbe-G*06:01'] Predicting for individual peptides ---------------------------------- To generate predictions for individual peptides, we can use the `~mhcflurry.Class1AffinityPredictor.predict` method of the `~mhcflurry.Class1PresentationPredictor`, loaded above. This method returns a `pandas.DataFrame` with binding affinity, processing, and presentation predictions: .. doctest:: >>> predictor.predict( ... peptides=["SIINFEKL", "NLVPMVATV"], ... alleles=["HLA-A0201", "HLA-A0301"], ... verbose=0) peptide peptide_num sample_name affinity best_allele processing_score presentation_score 0 SIINFEKL 0 sample1 12906.786173 HLA-A0201 0.101473 0.012503 1 NLVPMVATV 1 sample1 15.038358 HLA-A0201 0.676289 0.975463 Here, the list of alleles is taken to be an individual's MHC I genotype (i.e. up to 6 alleles), and the strongest binder across alleles for each peptide is reported. .. note:: MHCflurry normalizes allele names using the `mhcnames `__ package. Names like ``HLA-A0201`` or ``A*02:01`` will be normalized to ``HLA-A*02:01``, so most naming conventions can be used with methods such as `~mhcflurry.Class1PresentationPredictor.predict`. If you have multiple sample genotypes, you can pass a dict, where the keys are arbitrary sample names: .. doctest:: >>> predictor.predict( ... peptides=["KSEYMTSWFY", "NLVPMVATV"], ... alleles={ ... "sample1": ["A0201", "A0301", "B0702", "B4402", "C0201", "C0702"], ... "sample2": ["A0101", "A0206", "B5701", "C0202"], ... }, ... verbose=0) peptide peptide_num sample_name affinity best_allele processing_score presentation_score 0 KSEYMTSWFY 0 sample1 16737.745268 A0301 0.381632 0.026550 1 NLVPMVATV 1 sample1 15.038358 A0201 0.676289 0.975463 2 KSEYMTSWFY 0 sample2 62.540779 A0101 0.381632 0.796731 3 NLVPMVATV 1 sample2 15.765500 A0206 0.676289 0.974439 Here the strongest binder for each sample / peptide pair is returned. Many users will focus on the binding affinity predictions, as the processing and presentation predictions are experimental. If you do use the latter scores, however, when available you should provide the upstream (N-flank) and downstream (C-flank) sequences from the source proteins of the peptides for a small boost in accuracy. To do so, specify the ``n_flank`` and ``c_flank`` arguments, which give the flanking sequences for the corresponding peptides: .. doctest:: >>> predictor.predict( ... peptides=["KSEYMTSWFY", "NLVPMVATV"], ... n_flanks=["NNNNNNN", "SSSSSSSS"], ... c_flanks=["CCCCCCCC", "YYYAAAA"], ... alleles={ ... "sample1": ["A0201", "A0301", "B0702", "B4402", "C0201", "C0702"], ... "sample2": ["A0101", "A0206", "B5701", "C0202"], ... }, ... verbose=0) peptide n_flank c_flank peptide_num sample_name affinity best_allele processing_score presentation_score 0 KSEYMTSWFY NNNNNNN CCCCCCCC 0 sample1 16737.745268 A0301 0.605816 0.056190 1 NLVPMVATV SSSSSSSS YYYAAAA 1 sample1 15.038358 A0201 0.824994 0.986719 2 KSEYMTSWFY NNNNNNN CCCCCCCC 0 sample2 62.540779 A0101 0.605816 0.897493 3 NLVPMVATV SSSSSSSS YYYAAAA 1 sample2 15.765500 A0206 0.824994 0.986155 Scanning protein sequences -------------------------- The `~mhcflurry.Class1PresentationPredictor.predict_sequences` method supports scanning protein sequences for MHC ligands. Here's an example to identify all peptides with a predicted binding affinity of 500 nM or tighter to any allele across two sample genotypes and two short peptide sequences. .. doctest:: >>> predictor.predict_sequences( ... sequences={ ... 'protein1': "MDSKGSSQKGSRLLLLLVVSNLL", ... 'protein2': "SSLPTPEDKEQAQQTHH", ... }, ... alleles={ ... "sample1": ["A0201", "A0301", "B0702"], ... "sample2": ["A0101", "C0202"], ... }, ... result="filtered", ... comparison_quantity="affinity", ... filter_value=500, ... verbose=0) sequence_name pos peptide n_flank c_flank sample_name affinity best_allele affinity_percentile processing_score presentation_score 0 protein1 13 LLLLVVSNL MDSKGSSQKGSRL L sample1 38.206225 A0201 0.380125 0.017644 0.571060 1 protein1 14 LLLVVSNLL MDSKGSSQKGSRLL sample1 42.243472 A0201 0.420250 0.090984 0.619213 2 protein1 5 SSQKGSRLL MDSKG LLLVVSNLL sample2 66.749223 C0202 0.803375 0.383608 0.774468 3 protein1 6 SQKGSRLLL MDSKGS LLVVSNLL sample2 178.033467 C0202 1.820000 0.275019 0.482206 4 protein1 13 LLLLVVSNLL MDSKGSSQKGSRL sample1 202.208167 A0201 1.112500 0.058782 0.261320 5 protein1 12 LLLLLVVSNL MDSKGSSQKGSR L sample1 202.506582 A0201 1.112500 0.010025 0.225648 6 protein2 0 SSLPTPEDK EQAQQTHH sample1 335.529377 A0301 1.011750 0.010443 0.156798 7 protein2 0 SSLPTPEDK EQAQQTHH sample2 353.451759 C0202 2.674250 0.010443 0.150753 8 protein1 8 KGSRLLLLL MDSKGSSQ VVSNLL sample2 410.327286 C0202 2.887000 0.121374 0.194081 9 protein1 5 SSQKGSRL MDSKG LLLLVVSNLL sample2 477.285937 C0202 3.107375 0.111982 0.168572 When using ``predict_sequences``, the flanking sequences for each peptide are automatically included in the processing and presentation predictions. See the documentation for `~mhcflurry.Class1PresentationPredictor` for other useful methods. Lower level interfaces ---------------------------------- The `~mhcflurry.Class1PresentationPredictor` delegates to a `~mhcflurry.Class1AffinityPredictor` instance for binding affinity predictions. If all you need are binding affinities, you can use this instance directly. Here's an example: .. doctest:: >>> from mhcflurry import Class1AffinityPredictor >>> predictor = Class1AffinityPredictor.load() >>> predictor.predict_to_dataframe(allele="HLA-A0201", peptides=["SIINFEKL", "SIINFEQL"]) peptide allele prediction prediction_low prediction_high prediction_percentile 0 SIINFEKL HLA-A0201 12906.786173 8829.460289 18029.923061 6.566375 1 SIINFEQL HLA-A0201 13025.300796 9050.056312 18338.004869 6.623625 The ``prediction_low`` and ``prediction_high`` fields give the 5-95 percentile predictions across the models in the ensemble. This detailed information is not available through the higher-level `~mhcflurry.Class1PresentationPredictor` interface. Under the hood, `Class1AffinityPredictor` itself delegates to an ensemble of of `~mhcflurry.Class1NeuralNetwork` instances, which implement the neural network models used for prediction. To fit your own affinity prediction models, call `~mhcflurry.Class1NeuralNetwork.fit`. You can similarly use `~mhcflurry.Class1ProcessingPredictor` directly for antigen processing prediction, and there is a low-level `~mhcflurry.Class1ProcessingNeuralNetwork` with a `~mhcflurry.Class1ProcessingNeuralNetwork.fit` method. See the API documentation of these classes for details.