Execution configuration (Manifest)¶
MHCXGraph is a Python package designed primarily to be used as a command-line interface (CLI). This section describes the most important parameters, along with their use cases and recommended configurations.
All parameters are provided through a JSON configuration file, referred to as the manifest. The file name is user-defined. The manifest is organized into four top-level sections: settings, classes, inputs, and selectors.
Overview¶
The manifest is organized into four main sections, each responsible for a different stage of the execution pipeline:
Settings Defines execution parameters such as graph construction, filtering thresholds, and output configuration.
Inputs Specifies which structure files are loaded and which selectors are applied to them.
Selectors Defines residue selection rules used to restrict the analysis to specific regions of the structures.
Classes Defines grouping rules that map residues or continuous descriptors into shared categories during triad construction.
An example of a manifest is shown below:
{
"settings": {
"run_name": "multiple-run",
"run_mode": "multiple",
"max_chunks": 5,
"output_path": "examples/results/multiple",
"debug_logs": false,
"debug_tracking": false,
"track_steps": false,
"edge_threshold": 10,
"node_granularity": "ca_only",
"include_ligands": true,
"include_noncanonical_residues": true,
"include_waters": false,
"triad_rsa": false,
"rsa_filter": 0.1,
"local_distance_diff_threshold": 1.0,
"global_distance_diff_threshold": 2.0,
"distance_bin_width": 2,
"close_tolerance": 0.1
},
"inputs": [
{
"path": "examples/input/renumbered",
"enable_tui": false,
"extensions": [".pdb", ".cif"],
"selectors": [
{ "name": "MHC1" }
]
}
],
"selectors": {
"MHC1": {
"chains": ["C"],
"structures": {},
"residues": {
"A": [18,19,42,43,44,54,55,56,58,59,61,62,63,64,65,66,68,69,70,71,72,73,75,76,79,80,83,84,89,108,109,142,143,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,161,162,163,165,166,167,169,170,171]
}
},
"MHC2": {
"chains": ["C"],
"residues": {
"A": [37,51,52,53,55,56,58,59,60,62,63,65,66,67,69],
"B": [56,57,59,60,61,62,63,65,66,67,68,69,70,71,72,73,74,77,78,81]
}
},
"general": {
"chains": ["C"],
"structures": ["helix"],
"residues": {}
}
}
}
Settings¶
This section defines the execution parameters that control graph construction, residue filtering, and output generation. The table below provides an overview, followed by a detailed description of each parameter.
Category |
Parameter |
Description |
Default |
|---|---|---|---|
Execution |
Name of the run |
|
|
Execution mode: |
|
||
Path for results output |
|
||
Path to reference structure (required for |
|
||
Graph |
Atomic representation for residue nodes: |
|
|
Distance cutoff (Å) for defining edges between nodes |
|
||
Include ligands as graph nodes |
|
||
Include modified amino acids as nodes |
|
||
Include water molecules as nodes |
|
||
Restrict triads to those with at least one node from the specified chain |
|
||
Max residue gap between helices to treat them as continuous |
|
||
Triad comparison |
Width of distance bins for discretization |
|
|
Max distance difference (d1/d2/d3) between triads for association |
|
||
Max distance difference between non-adjacent nodes across structures (frame generation step) |
|
||
Tolerance for placing distances at bin center |
|
||
Surface representation |
Min RSA for canonical residues to be included as nodes |
|
|
Min ASA for non-canonical residues, waters, and ligands |
|
||
RSA tokens² |
Use RSA values in triad token representation |
|
|
Width of RSA bins for discretization |
|
||
Max RSA difference between triad nodes for association |
|
||
Tolerance for placing RSA values at bin center |
|
run_name:
The name assigned to the current execution. The raw JSON file containing the graph nodes is saved as graph_{run_name}.json.
run_mode:
There are three possible run_mode options, which determine how the execution is performed: multiple, pairwise, and screening.
multiple: compares all input pMHC structures simultaneously, identifying regions that are common across all of them.pairwise: performs pairwise comparisons between all input pMHC structures.screening: uses thereference_structureto compare a reference pMHC against all other input structures. This mode is particularly useful when you have a target pMHC and want to evaluate it against a larger set of structures.
output_path:
The folder destination of all results generated. All the parents folder will be created even though they doesn’t exist.
reference_structure:
The target structure that will be compared against all the other structures passed in the input parameter when the program is executed in screening mode.
node_granularity:
The node_granularity parameter determines which atoms of each residue are used to compute centroid positions. This choice strongly affects graph connectivity: in combination with edge_threshold, it can determine whether two residues are considered adjacent (i.e., whether an edge is formed between their corresponding nodes).
Four granularities are available: ca_only, all_atoms, backbone, and sidechain.
ca_only: uses only the C\(\alpha\) atom of each residue.all_atoms: uses all heavy atoms in the residue.backbone: uses only heavy atoms belonging to the backbone.sidechain: uses only heavy atoms belonging to the side chain.
Note
Hydrogen atoms are not considered in any granularity. All centroid computations are performed using heavy atoms only.
edge_threshold:
The edge_threshold parameter defines the maximum distance (in Å) between the centroids of two residues for an edge to be created between their corresponding nodes.
include_ligands:
If True, ligand molecules bound to the structure are included as nodes and can participate in edge formation.
Note
This option does not include water molecules. Use include_waters to control their inclusion.
include_noncanonical_residues:
If True, non-canonical residues (e.g., modified amino acids) are included as nodes and can participate in edge formation.
include_waters:
If True, water molecules are included as nodes and can participate in edge formation.
filter_triads_by_chain:
Restricts triads to those that include at least one residue from the specified chains.
This parameter is particularly useful when multiple pMHC structures share the same MHC but differ in their bound peptides. In such cases, it restricts the analysis to the peptide and the surrounding MHC residues that interact with it.
For example, passing ["C"] ensures that, for any triad \((v_1, v_2, v_3)\), at least one residue belongs to chain C. Triads that do not satisfy this condition are discarded.
max_gap_helix:
Maximum allowed gap (in residue index) between consecutive helical segments for them to be treated as a single continuous helix.
distance_bin_width:
Defines the width of distance bins used to discretize inter-residue distances in triads.
local_distance_diff_threshold:
Maximum allowed difference between corresponding distances (\(d_1\), \(d_2\), \(d_3\)) when comparing two triads. This threshold is applied during triad association to enforce local geometric consistency.
global_distance_diff_threshold:
Maximum allowed distance difference between non-adjacent nodes across structures during frame generation. This parameter enforces global geometric consistency.
close_tolerance:
Tolerance used when assigning distances to discretization bins. Values within this tolerance of a bin center may be associated with that bin.
rsa_filter:
Minimum relative solvent accessibility (RSA) required for canonical residues to be included as nodes in the graph.
asa_filter:
Minimum absolute solvent accessibility (ASA) required for non-canonical residues and ligands to be included as nodes.
triad_rsa:
If True, RSA values are included as features in the triad token representation, allowing solvent exposure to influence`` matching.
Note
When using node_granularity = ca_only, it is recommended to set triad_rsa = False, since centroids computed from C\(\alpha\) atoms are not well correlated with RSA values.
rsa_bin_width:
Defines the width of RSA bins used to discretize solvent accessibility values in triads.
rsa_diff_threshold:
Maximum allowed difference in RSA values between corresponding nodes of two triads during association.
close_tolerance_rsa:
Tolerance used when assigning RSA values to discretization bins, analogous to close_tolerance for distances.
Inputs¶
The inputs section defines which structure files are loaded for the analysis and which selectors are applied to them. Each entry in inputs is an input rule describing where the files are located, how they should be collected, and which selector definitions should be used.
This section allows the user to process either individual files or entire directories. When a directory is provided, all files matching the allowed extensions are considered. Selectors referenced in an input rule are resolved for each matching file before graph construction.
Each input rule may contain the following fields:
Field |
Type |
Description |
|---|---|---|
|
str or list[str] |
Path to a structure file or directory. Multiple paths may also be provided. |
|
list[str] |
File extensions allowed when scanning directories, such as |
|
bool |
If |
|
list[dict] |
List of selectors references applied to the files matched by this input rule. |
An example is shown below:
"inputs": [
{
"path": "examples/input/renumbered",
"enable_tui": false,
"extensions": [".pdb", ".cif"],
"selectors": [
{ "name": "MHC1" }
]
}
]
In this example, all .pdb and .cif files inside examples/input/renumbered are considered as inputs, and the selector MHC1 is applied to each file.
Multiple input rules may be defined in the same manifest. This makes it possible to process structures from different folders, apply different selector sets, or combine file-specific and directory-based rules in a single execution.
Note
If enable_tui is set to False, all matching files are collected automatically.
Selectors¶
The selectors section defines how residues are chosen from each input structure after graph construction. A selector can restrict the analysis by chain, residue number, secondary structure, or a logical combination of these criteria.
Selectors are not applied globally by default. Instead, they are referenced from the inputs section, where one or more selector names can be attached to a given input rule. During execution, the selected constraints are resolved for each file and then combined with the solvent-exposure filtering step.
In practice, selectors allow the user to focus the analysis on structurally or biologically relevant regions, such as residues from a given chain, residues belonging to helices, or specific residue positions known to participate in recognition.
The selectors section is a dictionary in which each key is a selector name and each value defines a set of constraints. A selector may contain the following fields:
Field |
Type |
Description |
|---|---|---|
|
list[str] |
Selects all nodes belonging to the specified chains. |
|
dict[str, list[int]] |
Selects specific residue numbers for each chain. |
|
list[str] or dict[str, list[str]] |
Selects nodes by secondary-structure annotation. It may be defined globally as a list, or per chain as a dictionary. |
|
str |
Boolean expression combining named sets such as |
A simple example is shown below:
"selectors": {
"MHC1": {
"chains": ["C"],
"residues": {
"A": [18, 19, 42, 43]
}
},
"general": {
"chains": ["C"],
"structures": ["helix"],
"residues": {}
}
}
In this example, MHC1 restricts the selection to chain C together with a predefined set of residues from chain A. The selector general restricts the graph to residues in chain C that are also annotated as helices.
By default, when no logic expression is provided, the sets defined by chains, residues, and structures are first combined by union, and the result is then intersected with the set of exposed residues. If none of these fields is provided, the selector reduces to the exposed-residue filter alone.
When a logic expression is provided, it is evaluated explicitly. The supported operators are & for intersection, | for union, and ! for negation. Parentheses may also be used. For example:
"selectors": {
"example_selector": {
"chains": ["A", "C"],
"structures": ["helix"],
"logic": "exposed & chains:C & structures"
}
}
This expression keeps only exposed residues that belong to chain C and are part of a helix.
Selectors may also define chain-specific structural filters using a dictionary:
"selectors": {
"mhc2_like": {
"structures": {
"A": ["helix"],
"B": ["sheet"]
}
}
}
In this case, residues are selected according to the allowed secondary-structure types for each chain.
Note
After selector evaluation, isolated ligand and water nodes are removed automatically if they have no edges within the selected subgraph.
Note
If a logical expression is provided and does not explicitly include exposed, the exposure filter is still applied afterward.
Classes¶
The classes section defines optional grouping rules used during triad generation. These classes allow different residues or continuous values to be mapped into the same category before token construction.
This mechanism is particularly useful when the analysis should emphasize broader physicochemical similarity rather than exact identity. For example, residues with similar chemical properties can be assigned to the same class, allowing them to contribute to the same triad token.
The classes section may define classes for residues, distances, and RSA values.
Field |
Type |
Description |
|---|---|---|
|
dict[str, list[str]] |
Groups amino acids into user-defined residue classes. |
|
dict[str, list[float]] or similar |
Defines custom distance classes used instead of automatic distance discretization. |
|
dict[str, list[float]] or similar |
Defines custom RSA classes used instead of automatic RSA discretization. |
When residue classes are provided, each residue is mapped to its corresponding class name before the triad token is created. If no residue classes are defined, the residue identity itself is used.
For example, the following configuration groups amino acids by general physicochemical type:
"classes": {
"residues": {
"hydrophobic": ["ALA", "VAL", "LEU", "ILE", "MET", "PHE", "TRP", "PRO"],
"polar": ["SER", "THR", "ASN", "GLN", "TYR", "CYS", "GLY"],
"positive": ["LYS", "ARG", "HIS"],
"negative": ["ASP", "GLU"]
}
}
With this definition, a triad containing LEU, ILE, and VAL will be represented using the shared class hydrophobic rather than the individual residue names.
This can increase the tolerance of the comparison by allowing chemically similar residues to match even when their exact identities differ.
Custom distance and RSA classes may also be provided. When these are defined, they replace the default discretization based on parameters such as distance_bin_width, rsa_bin_width, and their associated tolerances.
Note
The classes section is optional. If it is omitted, residues, distances, and RSA values are handled using their default representations.
Note
Residue classes affect triad tokens directly. As a result, changing these definitions may significantly alter the number and type of associations detected.