Execution configuration (Manifest)

MHCXGraph is a Python package designed primarily to be used as a command-line interface (CLI). This section describes the most important parameters, along with their use cases and recommended configurations.

All parameters are provided through a JSON configuration file, referred to as the manifest. The file name is user-defined. The manifest is organized into four top-level sections: settings, classes, inputs, and selectors.

Overview

The manifest is organized into four main sections, each responsible for a different stage of the execution pipeline:

  • Settings Defines execution parameters such as graph construction, filtering thresholds, and output configuration.

  • Inputs Specifies which structure files are loaded and which selectors are applied to them.

  • Selectors Defines residue selection rules used to restrict the analysis to specific regions of the structures.

  • Classes Defines grouping rules that map residues or continuous descriptors into shared categories during triad construction.

An example of a manifest is shown below:

{
  "settings": {
    "run_name": "multiple-run",
    "run_mode": "multiple",
    "max_chunks": 5,
    "output_path": "examples/results/multiple",
    "debug_logs": false,
    "debug_tracking": false,
    "track_steps": false,
    "edge_threshold": 10,
    "node_granularity": "ca_only",
    "include_ligands": true,
    "include_noncanonical_residues": true,
    "include_waters": false,
    "triad_rsa": false,
    "rsa_filter": 0.1,
    "local_distance_diff_threshold": 1.0,
    "global_distance_diff_threshold": 2.0,
    "distance_bin_width": 2,
    "close_tolerance": 0.1
  },

  "inputs": [
    {
      "path": "examples/input/renumbered",
      "enable_tui": false,
      "extensions": [".pdb", ".cif"],
      "selectors": [
        { "name": "MHC1" }
      ]
    }
  ],

  "selectors": {
    "MHC1": {
      "chains": ["C"],
      "structures": {},
      "residues": {
        "A": [18,19,42,43,44,54,55,56,58,59,61,62,63,64,65,66,68,69,70,71,72,73,75,76,79,80,83,84,89,108,109,142,143,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,161,162,163,165,166,167,169,170,171]
      }
    },
    "MHC2": {
      "chains": ["C"],
      "residues": {
        "A": [37,51,52,53,55,56,58,59,60,62,63,65,66,67,69],
        "B": [56,57,59,60,61,62,63,65,66,67,68,69,70,71,72,73,74,77,78,81]
      }
    },
    "general": {
      "chains": ["C"],
      "structures": ["helix"],
      "residues": {}
    }
  }
}

Settings

This section defines the execution parameters that control graph construction, residue filtering, and output generation. The table below provides an overview, followed by a detailed description of each parameter.

Key Configuration Parameters¹

Category

Parameter

Description

Default

Execution

run_name

Name of the run

test

run_mode

Execution mode: pairwise, multiple, or screening

multiple

output_path

Path for results output

./outputs

reference_structure

Path to reference structure (required for screening mode)

None

Graph

node_granularity

Atomic representation for residue nodes: ca_only, all_atoms, backbone, or sidechain

all_atoms

edge_threshold

Distance cutoff (Å) for defining edges between nodes

8.5

include_ligands

Include ligands as graph nodes

true

include_noncanonical_residues

Include modified amino acids as nodes

true

include_waters

Include water molecules as nodes

true

filter_triads_by_chain

Restrict triads to those with at least one node from the specified chain

None

max_gap_helix

Max residue gap between helices to treat them as continuous

0

Triad comparison

distance_bin_width

Width of distance bins for discretization

2.0

local_distance_diff_threshold

Max distance difference (d1/d2/d3) between triads for association

1.0

global_distance_diff_threshold

Max distance difference between non-adjacent nodes across structures (frame generation step)

2.0

close_tolerance

Tolerance for placing distances at bin center

0.1

Surface representation

rsa_filter

Min RSA for canonical residues to be included as nodes

0.1

asa_filter

Min ASA for non-canonical residues, waters, and ligands

5

RSA tokens²

triad_rsa

Use RSA values in triad token representation

false

rsa_bin_width

Width of RSA bins for discretization

0.3

rsa_diff_threshold

Max RSA difference between triad nodes for association

0.3

close_tolerance_rsa

Tolerance for placing RSA values at bin center

0.01

run_name: (type: string, default: “test”)

The name assigned to the current execution. The raw JSON file containing the graph nodes is saved as graph_{run_name}.json.

run_mode: (type: string, default: “multiple”)

There are three possible run_mode options, which determine how the execution is performed: multiple, pairwise, and screening.

  • multiple: compares all input pMHC structures simultaneously, identifying regions that are common across all of them.

  • pairwise: performs pairwise comparisons between all input pMHC structures.

  • screening: uses the reference_structure to compare a reference pMHC against all other input structures. This mode is particularly useful when you have a target pMHC and want to evaluate it against a larger set of structures.

output_path: (type: string, default: “./outputs”)

The folder destination of all results generated. All the parents folder will be created even though they doesn’t exist.

reference_structure: (type: string | None, default: None)

The target structure that will be compared against all the other structures passed in the input parameter when the program is executed in screening mode.

node_granularity: (type: string, default: “all_atoms”)

The node_granularity parameter determines which atoms of each residue are used to compute centroid positions. This choice strongly affects graph connectivity: in combination with edge_threshold, it can determine whether two residues are considered adjacent (i.e., whether an edge is formed between their corresponding nodes).

Four granularities are available: ca_only, all_atoms, backbone, and sidechain.

  • ca_only: uses only the C\(\alpha\) atom of each residue.

  • all_atoms: uses all heavy atoms in the residue.

  • backbone: uses only heavy atoms belonging to the backbone.

  • sidechain: uses only heavy atoms belonging to the side chain.

Node granularity illustrationNode granularity illustration

Note

Hydrogen atoms are not considered in any granularity. All centroid computations are performed using heavy atoms only.

edge_threshold: (type: float, default: 8.5)

The edge_threshold parameter defines the maximum distance (in Å) between the centroids of two residues for an edge to be created between their corresponding nodes.

include_ligands: (type: bool, default: True)

If True, ligand molecules bound to the structure are included as nodes and can participate in edge formation.

Note

This option does not include water molecules. Use include_waters to control their inclusion.

include_noncanonical_residues: (type: bool, default: True)

If True, non-canonical residues (e.g., modified amino acids) are included as nodes and can participate in edge formation.

include_waters: (type: bool, default: True)

If True, water molecules are included as nodes and can participate in edge formation.

filter_triads_by_chain: (type: list | None, default: None)

Restricts triads to those that include at least one residue from the specified chains.

This parameter is particularly useful when multiple pMHC structures share the same MHC but differ in their bound peptides. In such cases, it restricts the analysis to the peptide and the surrounding MHC residues that interact with it.

For example, passing ["C"] ensures that, for any triad \((v_1, v_2, v_3)\), at least one residue belongs to chain C. Triads that do not satisfy this condition are discarded.

max_gap_helix: (type: int, default: 0)

Maximum allowed gap (in residue index) between consecutive helical segments for them to be treated as a single continuous helix.

distance_bin_width: (type: float, default: 2.0)

Defines the width of distance bins used to discretize inter-residue distances in triads.

local_distance_diff_threshold: (type: float, default: 1.0)

Maximum allowed difference between corresponding distances (\(d_1\), \(d_2\), \(d_3\)) when comparing two triads. This threshold is applied during triad association to enforce local geometric consistency.

global_distance_diff_threshold: (type: float, default: 2.0)

Maximum allowed distance difference between non-adjacent nodes across structures during frame generation. This parameter enforces global geometric consistency.

close_tolerance: (type: float, default: 0.1)

Tolerance used when assigning distances to discretization bins. Values within this tolerance of a bin center may be associated with that bin.

rsa_filter: (type: float, default: 0.1)

Minimum relative solvent accessibility (RSA) required for canonical residues to be included as nodes in the graph.

asa_filter: (type: float, default: 5)

Minimum absolute solvent accessibility (ASA) required for non-canonical residues and ligands to be included as nodes.

triad_rsa: (type: bool, default: False)

If True, RSA values are included as features in the triad token representation, allowing solvent exposure to influence`` matching.

Note

When using node_granularity = ca_only, it is recommended to set triad_rsa = False, since centroids computed from C\(\alpha\) atoms are not well correlated with RSA values.

rsa_bin_width: (type: float, default: 0.3)

Defines the width of RSA bins used to discretize solvent accessibility values in triads.

rsa_diff_threshold: (type: float, default: 0.3)

Maximum allowed difference in RSA values between corresponding nodes of two triads during association.

close_tolerance_rsa: (type: float, default: 0.01)

Tolerance used when assigning RSA values to discretization bins, analogous to close_tolerance for distances.

Inputs

The inputs section defines which structure files are loaded for the analysis and which selectors are applied to them. Each entry in inputs is an input rule describing where the files are located, how they should be collected, and which selector definitions should be used.

This section allows the user to process either individual files or entire directories. When a directory is provided, all files matching the allowed extensions are considered. Selectors referenced in an input rule are resolved for each matching file before graph construction.

Each input rule may contain the following fields:

Input fields

Field

Type

Description

path

str or list[str]

Path to a structure file or directory. Multiple paths may also be provided.

extensions

list[str]

File extensions allowed when scanning directories, such as [".pdb", ".cif"].

enable_tui

bool

If True, an interactive terminal selection is shown when the input path is a directory.

selectors

list[dict]

List of selectors references applied to the files matched by this input rule.

An example is shown below:

"inputs": [
  {
    "path": "examples/input/renumbered",
    "enable_tui": false,
    "extensions": [".pdb", ".cif"],
    "selectors": [
      { "name": "MHC1" }
    ]
  }
]

In this example, all .pdb and .cif files inside examples/input/renumbered are considered as inputs, and the selector MHC1 is applied to each file.

Multiple input rules may be defined in the same manifest. This makes it possible to process structures from different folders, apply different selector sets, or combine file-specific and directory-based rules in a single execution.

Note

If enable_tui is set to False, all matching files are collected automatically.

Selectors

The selectors section defines how residues are chosen from each input structure after graph construction. A selector can restrict the analysis by chain, residue number, secondary structure, or a logical combination of these criteria.

Selectors are not applied globally by default. Instead, they are referenced from the inputs section, where one or more selector names can be attached to a given input rule. During execution, the selected constraints are resolved for each file and then combined with the solvent-exposure filtering step.

In practice, selectors allow the user to focus the analysis on structurally or biologically relevant regions, such as residues from a given chain, residues belonging to helices, or specific residue positions known to participate in recognition.

The selectors section is a dictionary in which each key is a selector name and each value defines a set of constraints. A selector may contain the following fields:

Selector fields

Field

Type

Description

chains

list[str]

Selects all nodes belonging to the specified chains.

residues

dict[str, list[int]]

Selects specific residue numbers for each chain.

structures

list[str] or dict[str, list[str]]

Selects nodes by secondary-structure annotation. It may be defined globally as a list, or per chain as a dictionary.

logic

str

Boolean expression combining named sets such as chains, residues, structures, and exposed.

A simple example is shown below:

"selectors": {
  "MHC1": {
    "chains": ["C"],
    "residues": {
      "A": [18, 19, 42, 43]
    }
  },
  "general": {
    "chains": ["C"],
    "structures": ["helix"],
    "residues": {}
  }
}

In this example, MHC1 restricts the selection to chain C together with a predefined set of residues from chain A. The selector general restricts the graph to residues in chain C that are also annotated as helices.

By default, when no logic expression is provided, the sets defined by chains, residues, and structures are first combined by union, and the result is then intersected with the set of exposed residues. If none of these fields is provided, the selector reduces to the exposed-residue filter alone.

When a logic expression is provided, it is evaluated explicitly. The supported operators are & for intersection, | for union, and ! for negation. Parentheses may also be used. For example:

"selectors": {
  "example_selector": {
    "chains": ["A", "C"],
    "structures": ["helix"],
    "logic": "exposed & chains:C & structures"
  }
}

This expression keeps only exposed residues that belong to chain C and are part of a helix.

Selectors may also define chain-specific structural filters using a dictionary:

"selectors": {
  "mhc2_like": {
    "structures": {
      "A": ["helix"],
      "B": ["sheet"]
    }
  }
}

In this case, residues are selected according to the allowed secondary-structure types for each chain.

Note

After selector evaluation, isolated ligand and water nodes are removed automatically if they have no edges within the selected subgraph.

Note

If a logical expression is provided and does not explicitly include exposed, the exposure filter is still applied afterward.

Classes

The classes section defines optional grouping rules used during triad generation. These classes allow different residues or continuous values to be mapped into the same category before token construction.

This mechanism is particularly useful when the analysis should emphasize broader physicochemical similarity rather than exact identity. For example, residues with similar chemical properties can be assigned to the same class, allowing them to contribute to the same triad token.

The classes section may define classes for residues, distances, and RSA values.

Class categories

Field

Type

Description

residues

dict[str, list[str]]

Groups amino acids into user-defined residue classes.

distance

dict[str, list[float]] or similar

Defines custom distance classes used instead of automatic distance discretization.

rsa

dict[str, list[float]] or similar

Defines custom RSA classes used instead of automatic RSA discretization.

When residue classes are provided, each residue is mapped to its corresponding class name before the triad token is created. If no residue classes are defined, the residue identity itself is used.

For example, the following configuration groups amino acids by general physicochemical type:

"classes": {
  "residues": {
    "hydrophobic": ["ALA", "VAL", "LEU", "ILE", "MET", "PHE", "TRP", "PRO"],
    "polar": ["SER", "THR", "ASN", "GLN", "TYR", "CYS", "GLY"],
    "positive": ["LYS", "ARG", "HIS"],
    "negative": ["ASP", "GLU"]
  }
}

With this definition, a triad containing LEU, ILE, and VAL will be represented using the shared class hydrophobic rather than the individual residue names.

This can increase the tolerance of the comparison by allowing chemically similar residues to match even when their exact identities differ.

Custom distance and RSA classes may also be provided. When these are defined, they replace the default discretization based on parameters such as distance_bin_width, rsa_bin_width, and their associated tolerances.

Note

The classes section is optional. If it is omitted, residues, distances, and RSA values are handled using their default representations.

Note

Residue classes affect triad tokens directly. As a result, changing these definitions may significantly alter the number and type of associations detected.