Command-line reference

See also the tutorial.

mhcflurry-predict

Run MHCflurry predictor on specified peptides.

By default, the presentation predictor is used, and predictions for MHC I binding affinity, antigen processing, and the composite presentation score are returned. If you just want binding affinity predictions, pass –affinity-only.

Examples:

Write a CSV file containing the contents of INPUT.csv plus additional columns giving MHCflurry predictions:

$ mhcflurry-predict INPUT.csv –out RESULT.csv

The input CSV file is expected to contain columns “allele”, “peptide”, and, optionally, “n_flank”, and “c_flank”.

If --out is not specified, results are written to stdout.

You can also run on alleles and peptides specified on the commandline, in which case predictions are written for all combinations of alleles and peptides:

$ mhcflurry-predict –alleles HLA-A0201 H-2Kb –peptides SIINFEKL DENDREKLLL

Instead of individual alleles (in a CSV or on the command line), you can also give a comma separated list of alleles giving a sample genotype. In this case, the tightest binding affinity across the alleles for the sample will be returned. For example:

$ mhcflurry-predict –peptides SIINFEKL DENDREKLLL –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:01,HLA-C*07:02 HLA-A*01:01,HLA-A*02:06,HLA-B*44:02,HLA-B*07:02,HLA-C*01:01,HLA-C*03:01

will give the tightest predicted affinities across alleles for each of the two genotypes specified for each peptide.

usage: mhcflurry-predict [-h] [--list-supported-alleles] [--list-supported-peptide-lengths] [--version] [--alleles ALLELE [ALLELE ...]]
                         [--peptides PEPTIDE [PEPTIDE ...]] [--allele-column NAME] [--peptide-column NAME] [--n-flank-column NAME] [--c-flank-column NAME]
                         [--no-throw] [--out OUTPUT.csv] [--prediction-column-prefix NAME] [--output-delimiter CHAR] [--no-affinity-percentile]
                         [--always-include-best-allele] [--models DIR] [--affinity-only] [--no-flanking]
                         [INPUT.csv]
input.csv

Input CSV

-h, --help

Show this help message and exit

--list-supported-alleles

Prints the list of supported alleles and exits

--list-supported-peptide-lengths

Prints the list of supported peptide lengths and exits

--version

show program’s version number and exit

--alleles <allele>

Alleles to predict (exclusive with passing an input CSV)

--peptides <peptide>

Peptides to predict (exclusive with passing an input CSV)

--allele-column <name>

Input column name for alleles. Default: ‘allele’

--peptide-column <name>

Input column name for peptides. Default: ‘peptide’

--n-flank-column <name>

Column giving N-terminal flanking sequence. Default: ‘n_flank’

--c-flank-column <name>

Column giving C-terminal flanking sequence. Default: ‘c_flank’

--no-throw

Return NaNs for unsupported alleles or peptides instead of raising

--out <output.csv>

Output CSV

--prediction-column-prefix <name>

Prefix for output column names. Default: ‘mhcflurry_

--output-delimiter <char>

Delimiter character for results. Default: ‘,’

--no-affinity-percentile

Do not include affinity percentile rank

--always-include-best-allele

Always include the best_allele column even when it is identical to the allele column (i.e. all queries are monoallelic).

--models <dir>

Directory containing models. Either a binding affinity predictor or a presentation predictor can be used. Default: /Users/tim/Library/Application Support/mhcflurry/4/2.0.0/models_class1_presentation/models

--affinity-only

Affinity prediction only (no antigen processing or presentation)

--no-flanking

Do not use flanking sequence information even when available

mhcflurry-predict-scan

Scan protein sequences using the MHCflurry presentation predictor.

By default, sub-sequences (peptides) with affinity percentile ranks less than 2.0 are returned. You can also specify –results-all to return predictions for all peptides, or –results-best to return the top peptide for each sequence.

Examples:

Scan a set of sequences in a FASTA file for binders to any alleles in a MHC I genotype:

$ mhcflurry-predict-scan test/data/example.fasta –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:01,HLA-C*07:02

Instead of a FASTA, you can also pass a CSV that has “sequence_id” and “sequence” columns.

You can also specify multiple MHC I genotypes to scan as space-separated arguments to the –alleles option:

$ mhcflurry-predict-scan test/data/example.fasta –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:02,HLA-C*07:02 HLA-A*01:01,HLA-A*02:06,HLA-B*44:02,HLA-B*07:02,HLA-C*01:02,HLA-C*03:01

If --out is not specified, results are written to standard out.

You can also specify sequences on the commandline:

mhcflurry-predict-scan –sequences MGYINVFAFPFTIYSLLLCRMNSRNYIAQVDVVNFNLT –alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:02,HLA-C*07:02

usage: mhcflurry-predict-scan [-h] [--list-supported-alleles] [--list-supported-peptide-lengths] [--version] [--input-format {guess,csv,fasta}]
                              [--alleles ALLELE [ALLELE ...]] [--sequences SEQ [SEQ ...]] [--sequence-id-column NAME] [--sequence-column NAME] [--no-throw]
                              [--peptide-lengths L] [--results-all] [--results-best {presentation_score,processing_score,affinity,affinity_percentile}]
                              [--results-filtered {presentation_score,processing_score,affinity,affinity_percentile}]
                              [--threshold-presentation-score THRESHOLD_PRESENTATION_SCORE] [--threshold-processing-score THRESHOLD_PROCESSING_SCORE]
                              [--threshold-affinity THRESHOLD_AFFINITY] [--threshold-affinity-percentile THRESHOLD_AFFINITY_PERCENTILE] [--out OUTPUT.csv]
                              [--output-delimiter CHAR] [--no-affinity-percentile] [--models DIR] [--no-flanking]
                              [INPUT]
input

Input CSV or FASTA

-h, --help

Show this help message and exit

--list-supported-alleles

Print the list of supported alleles and exits

--list-supported-peptide-lengths

Print the list of supported peptide lengths and exits

--version

show program’s version number and exit

--input-format {guess,csv,fasta}

Format of input file. By default, it is guessed from the file extension.

--alleles <allele>

Alleles to predict

--sequences <seq>

Sequences to predict (exclusive with passing an input file)

--sequence-id-column <name>

Input CSV column name for sequence IDs. Default: ‘sequence_id’

--sequence-column <name>

Input CSV column name for sequences. Default: ‘sequence’

--no-throw

Return NaNs for unsupported alleles or peptides instead of raising

--peptide-lengths <l>

Peptide lengths to consider. Pass as START-END (e.g. 8-11) or a comma-separated list (8,9,10,11). When using START-END, the range is INCLUSIVE on both ends. Default: 8-11.

--results-all

Return results for all peptides regardless of affinity, etc.

--results-best {presentation_score,processing_score,affinity,affinity_percentile}

Take the top result for each sequence according to the specified predicted quantity

--results-filtered {presentation_score,processing_score,affinity,affinity_percentile}

Filter results by the specified quantity.

--threshold-presentation-score <threshold_presentation_score>

Threshold if filtering by presentation score. Default: 0.7

--threshold-processing-score <threshold_processing_score>

Threshold if filtering by processing score. Default: 0.5

--threshold-affinity <threshold_affinity>

Threshold if filtering by affinity. Default: 500

--threshold-affinity-percentile <threshold_affinity_percentile>

Threshold if filtering by affinity percentile. Default: 2.0

--out <output.csv>

Output CSV

--output-delimiter <char>

Delimiter character for results. Default: ‘,’

--no-affinity-percentile

Do not include affinity percentile rank

--models <dir>

Directory containing presentation models.Default: /Users/tim/Library/Application Support/mhcflurry/4/2.0.0/models_class1_presentation/models

--no-flanking

Do not use flanking sequence information in predictions

mhcflurry-downloads

Download MHCflurry released datasets and trained models.

Examples

Fetch the default downloads:

$ mhcflurry-downloads fetch

Fetch a specific download:

$ mhcflurry-downloads fetch models_class1_pan

Get the path to a download:

$ mhcflurry-downloads path models_class1_pan

Get the URL of a download:

$ mhcflurry-downloads url models_class1_pan

Summarize available and fetched downloads:

$ mhcflurry-downloads info

usage: mhcflurry-downloads [-h] [--quiet] [--verbose] {fetch,info,path,url} ...
-h, --help

show this help message and exit

--quiet

Output less

--verbose, -v

Output more

mhcflurry-downloads fetch

usage: mhcflurry-downloads fetch [-h] [--keep] [--release RELEASE] [--already-downloaded-dir DIR] [DOWNLOAD [DOWNLOAD ...]]
download

Items to download

-h, --help

show this help message and exit

--keep

Don’t delete archives after they are extracted

--release <release>

Release to download. Default: 2.0.0

--already-downloaded-dir <dir>

Don’t download files, get them from DIR

mhcflurry-downloads info

usage: mhcflurry-downloads info [-h]
-h, --help

show this help message and exit

mhcflurry-downloads path

usage: mhcflurry-downloads path [-h] [download_name]
download_name
-h, --help

show this help message and exit

mhcflurry-downloads url

usage: mhcflurry-downloads url [-h] [download_name]
download_name
-h, --help

show this help message and exit

mhcflurry-class1-train-allele-specific-models

usage: 
Train Class1 single allele models.
-h, --help

show this help message and exit

--data <file.csv>

Training data CSV. Expected columns: allele, peptide, measurement_value

--out-models-dir <dir>

Directory to write models and manifest

--hyperparameters <file.json>

JSON or YAML of hyperparameters

--allele <allele>

Alleles to train models for. If not specified, all alleles with enough measurements will be used.

--min-measurements-per-allele <n>

Train models for alleles with >=N measurements.

--held-out-fraction-reciprocal <n>

Hold out 1/N fraction of data (for e.g. subsequent model selection. For example, specify 5 to hold out 20 percent of the data.

--held-out-fraction-seed <n>

Seed for randomizing which measurements are held out. Only matters when –held-out-fraction is specified. Default: 0.

--ignore-inequalities

Do not use affinity value inequalities even when present in data

--n-models <n>

Ensemble size, i.e. how many models to train for each architecture. If specified here it overrides any ‘n_models’ specified in the hyperparameters.

--max-epochs <n>

Max training epochs. If specified here it overrides any ‘max_epochs’ specified in the hyperparameters.

--allele-sequences <file.csv>

Allele sequences file. Used for computing allele similarity matrix.

--save-interval <n>

Write models to disk every N seconds. Only affects parallel runs; serial runs write each model to disk as it is trained.

--verbosity <verbosity>

Keras verbosity. Default: 0

--num-jobs <n>

Number of local processes to parallelize training over. Set to 0 for serial run. Default: 0.

--backend {tensorflow-gpu,tensorflow-cpu,tensorflow-default}

Keras backend. If not specified will use system default.

--gpus <n>

Number of GPUs to attempt to parallelize across. Requires running in parallel.

--max-workers-per-gpu <n>

Maximum number of workers to assign to a GPU. Additional tasks will run on CPU.

--max-tasks-per-worker <n>

Restart workers after N tasks. Workaround for tensorflow memory leaks. Requires Python >=3.2.

--worker-log-dir <worker_log_dir>

Write worker stdout and stderr logs to given directory.

mhcflurry-class1-select-allele-specific-models

usage: 
Model select class1 single allele models.
-h, --help

show this help message and exit

--data <file.csv>

Model selection data CSV. Expected columns: allele, peptide, measurement_value

--exclude-data <file.csv>

Data to EXCLUDE from model selection. Useful to specify the original training data used

--models-dir <dir>

Directory to read models

--out-models-dir <dir>

Directory to write selected models

--out-unselected-predictions <file.csv>

Write predictions for validation data using unselected predictor to FILE.csv

--unselected-accuracy-scorer <scorer>
--unselected-accuracy-scorer-num-samples <unselected_accuracy_scorer_num_samples>
--unselected-accuracy-percentile-threshold <x>
--allele <allele>

Alleles to select models for. If not specified, all alleles with enough measurements will be used.

--combined-min-models <n>

Min number of models to select per allele when using combined selector

--combined-max-models <n>

Max number of models to select per allele when using combined selector

--combined-min-contribution-percent <x>

Use only model selectors that can contribute at least X % to the total score. Default: 1.0

--mass-spec-min-measurements <n>

Min number of measurements required for an allele to use mass-spec model selection

--mass-spec-min-models <n>

Min number of models to select per allele when using mass-spec selector

--mass-spec-max-models <n>

Max number of models to select per allele when using mass-spec selector

--mse-min-measurements <n>

Min number of measurements required for an allele to use MSE model selection

--mse-min-models <n>

Min number of models to select per allele when using MSE selector

--mse-max-models <n>

Max number of models to select per allele when using MSE selector

--scoring <scoring>

Scoring procedures to use in order

--consensus-min-models <n>

Min number of models to select per allele when using consensus selector

--consensus-max-models <n>

Max number of models to select per allele when using consensus selector

--consensus-num-peptides-per-length <consensus_num_peptides_per_length>

Num peptides per length to use for consensus scoring

--mass-spec-regex <regex>

Regular expression for mass-spec data. Runs on measurement_source col.Default: mass[- ]spec.

--verbosity <verbosity>

Keras verbosity. Default: 0

--num-jobs <n>

Number of local processes to parallelize training over. Set to 0 for serial run. Default: 0.

--backend {tensorflow-gpu,tensorflow-cpu,tensorflow-default}

Keras backend. If not specified will use system default.

--gpus <n>

Number of GPUs to attempt to parallelize across. Requires running in parallel.

--max-workers-per-gpu <n>

Maximum number of workers to assign to a GPU. Additional tasks will run on CPU.

--max-tasks-per-worker <n>

Restart workers after N tasks. Workaround for tensorflow memory leaks. Requires Python >=3.2.

--worker-log-dir <worker_log_dir>

Write worker stdout and stderr logs to given directory.

mhcflurry-class1-train-pan-allele-models

usage: 
Train Class1 pan-allele models.
-h, --help

show this help message and exit

--data <file.csv>

Training data CSV. Expected columns: allele, peptide, measurement_value

--pretrain-data <file.csv>

Pre-training data CSV. Expected columns: allele, peptide, measurement_value

--out-models-dir <dir>

Directory to write models and manifest

--hyperparameters <file.json>

JSON or YAML of hyperparameters

--held-out-measurements-per-allele-fraction-and-max <x>

Fraction of measurements per allele to hold out, and maximum number

--ignore-inequalities

Do not use affinity value inequalities even when present in data

--num-folds <n>

Number of training folds.

--num-replicates <n>

Number of replicates per (architecture, fold) pair to train.

--max-epochs <n>

Max training epochs. If specified here it overrides any ‘max_epochs’ specified in the hyperparameters.

--allele-sequences <file.csv>

Allele sequences file.

--verbosity <verbosity>

Keras verbosity. Default: 0

--debug

Launch python debugger on error

--continue-incomplete

Continue training models from an incomplete training run. If this is specified then the only required argument is –out-models-dir

--only-initialize

Do not actually train models. The initialized run can be continued later with –continue-incomplete.

--num-jobs <n>

Number of local processes to parallelize training over. Set to 0 for serial run. Default: 0.

--backend {tensorflow-gpu,tensorflow-cpu,tensorflow-default}

Keras backend. If not specified will use system default.

--gpus <n>

Number of GPUs to attempt to parallelize across. Requires running in parallel.

--max-workers-per-gpu <n>

Maximum number of workers to assign to a GPU. Additional tasks will run on CPU.

--max-tasks-per-worker <n>

Restart workers after N tasks. Workaround for tensorflow memory leaks. Requires Python >=3.2.

--worker-log-dir <worker_log_dir>

Write worker stdout and stderr logs to given directory.

--cluster-parallelism
--cluster-submit-command <cluster_submit_command>

Default: sh

--cluster-results-workdir <cluster_results_workdir>

Default: ./cluster-workdir

--additional-complete-file <additional_complete_file>

Additional file to monitor for job completion. Default: STDERR

--cluster-script-prefix-path <cluster_script_prefix_path>
--cluster-max-retries <cluster_max_retries>

How many times to rerun failing jobs. Default: 3

mhcflurry-class1-select-pan-allele-models

usage: 
Model select class1 pan-allele models.

APPROACH: For each training fold, we select at least min and at most max models
(where min and max are set by the --{min/max}-models-per-fold argument) using a
step-up (forward) selection procedure. The final ensemble is the union of all
selected models across all folds.
-h, --help

show this help message and exit

--data <file.csv>

Model selection data CSV. Expected columns: allele, peptide, measurement_value

--models-dir <dir>

Directory to read models

--out-models-dir <dir>

Directory to write selected models

--min-models-per-fold <n>

Min number of models to select per fold

--max-models-per-fold <n>

Max number of models to select per fold

--mass-spec-regex <regex>

Regular expression for mass-spec data. Runs on measurement_source col.Default: mass[- ]spec.

--verbosity <verbosity>

Keras verbosity. Default: 0

--num-jobs <n>

Number of local processes to parallelize training over. Set to 0 for serial run. Default: 0.

--backend {tensorflow-gpu,tensorflow-cpu,tensorflow-default}

Keras backend. If not specified will use system default.

--gpus <n>

Number of GPUs to attempt to parallelize across. Requires running in parallel.

--max-workers-per-gpu <n>

Maximum number of workers to assign to a GPU. Additional tasks will run on CPU.

--max-tasks-per-worker <n>

Restart workers after N tasks. Workaround for tensorflow memory leaks. Requires Python >=3.2.

--worker-log-dir <worker_log_dir>

Write worker stdout and stderr logs to given directory.

--cluster-parallelism
--cluster-submit-command <cluster_submit_command>

Default: sh

--cluster-results-workdir <cluster_results_workdir>

Default: ./cluster-workdir

--additional-complete-file <additional_complete_file>

Additional file to monitor for job completion. Default: STDERR

--cluster-script-prefix-path <cluster_script_prefix_path>
--cluster-max-retries <cluster_max_retries>

How many times to rerun failing jobs. Default: 3

mhcflurry-class1-train-processing-models

usage: 
Train Class1 processing models.
-h, --help

show this help message and exit

--data <file.csv>

Training data CSV. Expected columns: peptide, n_flank, c_flank, hit

--out-models-dir <dir>

Directory to write models and manifest

--hyperparameters <file.json>

JSON or YAML of hyperparameters

--held-out-samples <n>

Number of experiments to hold out per fold

--num-folds <n>

Number of training folds.

--num-replicates <n>

Number of replicates per (architecture, fold) pair to train.

--max-epochs <n>

Max training epochs. If specified here it overrides any ‘max_epochs’ specified in the hyperparameters.

--verbosity <verbosity>

Keras verbosity. Default: 0

--debug

Launch python debugger on error

--continue-incomplete

Continue training models from an incomplete training run. If this is specified then the only required argument is –out-models-dir

--only-initialize

Do not actually train models. The initialized run can be continued later with –continue-incomplete.

--num-jobs <n>

Number of local processes to parallelize training over. Set to 0 for serial run. Default: 0.

--backend {tensorflow-gpu,tensorflow-cpu,tensorflow-default}

Keras backend. If not specified will use system default.

--gpus <n>

Number of GPUs to attempt to parallelize across. Requires running in parallel.

--max-workers-per-gpu <n>

Maximum number of workers to assign to a GPU. Additional tasks will run on CPU.

--max-tasks-per-worker <n>

Restart workers after N tasks. Workaround for tensorflow memory leaks. Requires Python >=3.2.

--worker-log-dir <worker_log_dir>

Write worker stdout and stderr logs to given directory.

--cluster-parallelism
--cluster-submit-command <cluster_submit_command>

Default: sh

--cluster-results-workdir <cluster_results_workdir>

Default: ./cluster-workdir

--additional-complete-file <additional_complete_file>

Additional file to monitor for job completion. Default: STDERR

--cluster-script-prefix-path <cluster_script_prefix_path>
--cluster-max-retries <cluster_max_retries>

How many times to rerun failing jobs. Default: 3

mhcflurry-class1-select-processing-models

usage: 
Model select antigen processing models.

APPROACH: For each training fold, we select at least min and at most max models
(where min and max are set by the --{min/max}-models-per-fold argument) using a
step-up (forward) selection procedure. The final ensemble is the union of all
selected models across all folds. AUC is used as the metric.
-h, --help

show this help message and exit

--data <file.csv>

Model selection data CSV. Expected columns: peptide, hit, fold_0, …, fold_N

--models-dir <dir>

Directory to read models

--out-models-dir <dir>

Directory to write selected models

--min-models-per-fold <n>

Min number of models to select per fold

--max-models-per-fold <n>

Max number of models to select per fold

--verbosity <verbosity>

Keras verbosity. Default: 0

--num-jobs <n>

Number of local processes to parallelize training over. Set to 0 for serial run. Default: 0.

--backend {tensorflow-gpu,tensorflow-cpu,tensorflow-default}

Keras backend. If not specified will use system default.

--gpus <n>

Number of GPUs to attempt to parallelize across. Requires running in parallel.

--max-workers-per-gpu <n>

Maximum number of workers to assign to a GPU. Additional tasks will run on CPU.

--max-tasks-per-worker <n>

Restart workers after N tasks. Workaround for tensorflow memory leaks. Requires Python >=3.2.

--worker-log-dir <worker_log_dir>

Write worker stdout and stderr logs to given directory.

--cluster-parallelism
--cluster-submit-command <cluster_submit_command>

Default: sh

--cluster-results-workdir <cluster_results_workdir>

Default: ./cluster-workdir

--additional-complete-file <additional_complete_file>

Additional file to monitor for job completion. Default: STDERR

--cluster-script-prefix-path <cluster_script_prefix_path>
--cluster-max-retries <cluster_max_retries>

How many times to rerun failing jobs. Default: 3

mhcflurry-class1-train-presentation-models

usage: 
Train Class1 presentation models.
-h, --help

show this help message and exit

--data <file.csv>

Training data CSV. Expected columns: peptide, n_flank, c_flank, hit

--out-models-dir <dir>

Directory to write models and manifest

--affinity-predictor <dir>

Affinity predictor models dir

--processing-predictor-with-flanks <dir>

Processing predictor with flanks

--processing-predictor-without-flanks <dir>

Processing predictor without flanks

--verbosity <verbosity>

Default: 1

--debug

Launch python debugger on error

--hla-column <hla_column>

Column in data giving space-separated MHC I alleles

--target-column <target_column>

Column in data giving hit (1) vs decoy (0)