Operators¶

Operator blocks transform molecules within a workflow. Most are 1:1 transforms; some (conformer generation, ionization, stereo enumeration) produce multiple outputs per input.

Block properties to keep in mind when building workflows:

Required Inputs — file paths or text values that must be provided before running the workflow. Pass them as constructor keyword arguments in Python (MyBlock(key="value")), or via the MCP agent using run_workflow set_inputs.
Output Properties — named properties attached to each output molecule when the block is included. Downstream blocks (filters, selectors, score blocks) can reference these by name.
Mutable Parameters — numeric or categorical settings tuned automatically by Bayesian optimization. Set defaults at construction; the optimizer adjusts them during optimize_workflow.

Note: By convention the constructors only explicitely contain extra keyword arguments. However, all required inputs and mutable parameters can also be passed.

Standardization¶

`cmxflow.operators.standardize.MoleculeStandardizeBlock(canonicalize_tautomers: bool = False, **kwargs: Any)` ¶

Bases: MoleculeBlock

Standardize molecules for drug discovery preprocessing.

Applies a standard pipeline: metal disconnection, normalization, salt/fragment removal, and charge neutralization. Optionally canonicalizes tautomers.

Example

workflow.add(
    MoleculeSourceBlock(),
    MoleculeStandardizeBlock(),
    MoleculeSinkBlock()
)

Initialize the MoleculeStandardizeBlock.

Parameters:

Name	Type	Description	Default
`canonicalize_tautomers`	`bool`	Whether to canonicalize tautomers.	`False`
`**kwargs`	`Any`	Additional keyword arguments passed to set_inputs.	`{}`

Deduplication¶

`cmxflow.operators.dedup.MoleculeDeduplicateBlock(**kwargs: Any)` ¶

Bases: MoleculeBlock

Remove duplicate molecules from a stream based on canonical SMILES.

Keeps the first occurrence and discards subsequent duplicates. Uses RDKit canonical SMILES as the deduplication key. Cannot be parallelized because it relies on shared mutable state.

Example

workflow.add(
    MoleculeSourceBlock(),
    MoleculeDeduplicateBlock(),
    MoleculeSinkBlock()
)

Initialize the de-duplication block.

RDKit Methods¶

`cmxflow.operators.method.RDKitBlock(method: Callable[[Chem.Mol], Any] | str, name: str | None = None, **kwargs)` ¶

Bases: MoleculeBlock

Apply an arbitrary RDKit method to each molecule in the stream.

The method can be a callable or a dot-separated string path (e.g., "rdkit.Chem.Descriptors.MolWt"). If the method returns a Mol, it replaces the molecule; if it returns a scalar (int, float, str, bool), the value is stored as a molecule property; if it returns None, the molecule is filtered out.

Output Properties

<method_name>: Scalar result stored as a molecule property. The key is the method name (or the explicit name argument).

Example

workflow.add(
    MoleculeSourceBlock(),
    RDKitBlock("rdkit.Chem.Descriptors.MolWt"),
    MoleculeSinkBlock()
)

Initialize with an RDKit method.

Parameters:

Name	Type	Description	Default
`method`	`Callable[[Mol], Any] \| str`	RDKit method as callable or string path (e.g., "rdkit.Chem.Descriptors.MolWt").	required
`name`	`str \| None`	Optional property name for scalar results. Defaults to the method name extracted from callable or path.	`None`

Filtering¶

`cmxflow.operators.filter.SubstructureFilterBlock(**kwargs: Any)` ¶

Bases: MoleculeBlock

Filter molecules based on substructure matches.

Molecules are flagged using SMARTS patterns and/or built-in RDKit filter catalogs (PAINS, BRENK, NIH, ZINC). Uses OR logic: a molecule is flagged if it matches any pattern or catalog entry. The mode input controls whether matching molecules are removed or kept.

Required Inputs

query (text): Space-separated catalog names and/or SMARTS patterns.
mode (text): "remove" (default) drops matches; "keep" retains only matches.

Example

workflow.add(
    MoleculeSourceBlock(),
    SubstructureFilterBlock(query="PAINS BRENK", mode="remove"),
    MoleculeSinkBlock()
)

Initialize the SubstructureFilterBlock.

`cmxflow.operators.filter.PropertyFilterBlock(**kwargs: Any)` ¶

Bases: MoleculeBlock

Filter molecules based on numeric property conditions.

Conditions are specified in the filters input text using AND logic — a molecule must satisfy every condition to pass. Supported syntax: simple comparisons (MW>200), reversed comparisons (200<MW), range expressions (200<MW<500), and comma-separated multiple conditions. Supported operators: <, >, <=, >=, ==, !=.

Required Inputs

filters (text): Comma-separated filter expressions (e.g. "200<MolWt<500, logP>0").

Example

workflow.add(
    MoleculeSourceBlock(),
    RDKitBlock("rdkit.Chem.Descriptors.MolWt"),
    PropertyFilterBlock(filters="200<MolWt<500"),
    MoleculeSinkBlock()
)

Initialize the PropertyFilterBlock.

Selection¶

`cmxflow.operators.select.PropertyHeadBlock(**kwargs: Any)` ¶

Bases: PropertySelectBlock

Return molecules with the highest values of a specified property.

Collects all input molecules, sorts by the specified property in descending order, and yields the top N (highest values first).

Required Inputs

property (text): Name of the molecule property to sort by.
count (text): Number of molecules to return. 0 or empty returns all, sorted.

Example

workflow.add(
    MoleculeSourceBlock(),
    RDKitBlock("rdkit.Chem.Descriptors.MolWt"),
    PropertyHeadBlock(property="MolWt", count="10"),
    MoleculeSinkBlock()
)

Initialize the PropertyHeadBlock.

`cmxflow.operators.select.PropertyTailBlock(**kwargs: Any)` ¶

Bases: PropertySelectBlock

Return molecules with the lowest values of a specified property.

Collects all input molecules, sorts by the specified property in ascending order, and yields the bottom N (lowest values first).

Required Inputs

property (text): Name of the molecule property to sort by.
count (text): Number of molecules to return. 0 or empty returns all, sorted.

Example

workflow.add(
    MoleculeSourceBlock(),
    RDKitBlock("rdkit.Chem.Descriptors.MolWt"),
    PropertyTailBlock(property="MolWt", count="10"),
    MoleculeSinkBlock()
)

Initialize the PropertyTailBlock.

2D Similarity¶

`cmxflow.operators.sim2d.MoleculeSimilarityBlock(**kwargs: Any)` ¶

Bases: MoleculeBlock

Compute 2D fingerprint similarity against a set of query molecules.

For each input molecule, computes the maximum fingerprint similarity across all query molecules and attaches the score and best-matching query name as properties.

Required Inputs

queries (file): Path to query molecule file (SDF, SMILES, etc.).

Output Properties

max_similarity: Maximum similarity score to any query molecule.
most_similar_query: Name or index of the most similar query molecule.

Example

workflow.add(
    MoleculeSourceBlock(),
    MoleculeSimilarityBlock(queries="reference_ligands.sdf"),
    MoleculeSinkBlock()
)

Mutable Parameters

fingerprint_type: Fingerprint algorithm (morgan, rdkit, maccs, atom_pair, topological_torsion).
similarity_metric: Similarity function (tanimoto, dice, cosine, sokal, russel).
radius: Morgan fingerprint radius (1–4).
nbits: Fingerprint bit length (512–4096).

Initialize the similarity search block.

3D Similarity¶

`cmxflow.operators.sim3d.Molecule3DSimilarityBlock(**kwargs: Any)` ¶

Bases: MoleculeBlock

Compute 3D molecular similarity against a set of query molecules.

Both input and query molecules must have pre-existing 3D conformers. For each input molecule, computes maximum similarity across all conformer pairs and attaches the result as properties.

Required Inputs

query (file): Path to query molecule file with 3D conformers.

Output Properties

similarity_3d: Maximum 3D similarity score to any query conformer.
most_similar_query_3d: Name of the most similar query molecule.
similarity_3d_method: Similarity method used.
similarity_3d_conf_id: Conformer ID that gave the best similarity.

Example

workflow.add(
    MoleculeSourceBlock(),
    EnumerateStereoBlock(),
    ConformerGenerationBlock(),
    MoleculeAlignBlock(query="reference.sdf"),
    Molecule3DSimilarityBlock(query="reference.sdf"),
    MoleculeSinkBlock()
)

Mutable Parameters

method: Similarity method (shape_tanimoto, shape_tversky, usr, usrcat).
tversky_alpha: Tversky alpha parameter (0.0–1.0).
tversky_beta: Tversky beta parameter (0.0–1.0).

Initialize the 3D similarity block.

Ionization¶

`cmxflow.operators.ionize.IonizeMoleculeBlock(ph_min: float = 6.4, ph_max: float = 8.4, **kwargs: Any)` ¶

Bases: MoleculeBlock

Generate pH-dependent ionization states using dimorphite_dl.

This is a 1:N transform: one input molecule can yield multiple protonation variants. Includes automatic correction for tertiary amide nitrogens that dimorphite_dl incorrectly protonates.

When the input carries a 3D conformer the conformer is preserved: dimorphite only changes formal charges and hydrogen counts on an unchanged heavy-atom skeleton, so the protonation state is transferred back onto the original 3D heavy atoms (matched exactly by atom-map number, not substructure search) and any hydrogens added during protonation get coordinates from the heavy-atom geometry. Inputs without a 3D conformer take the plain SMILES path.

Example

workflow.add(
    MoleculeSourceBlock(),
    IonizeMoleculeBlock(),
    MoleculeSinkBlock()
)

Mutable Parameters

precision: pH precision window around min/max (0.1–3.0).
max_variants: Maximum number of ionization variants per molecule (1–128).

Initialize the IonizeMoleculeBlock.

Parameters:

Name	Type	Description	Default
`ph_min`	`float`	Minimum pH for protonation.	`6.4`
`ph_max`	`float`	Maximum pH for protonation.	`8.4`
`**kwargs`	`Any`	Additional keyword arguments passed to set_inputs.	`{}`

Stereoisomers¶

`cmxflow.operators.confgen.EnumerateStereoBlock()` ¶

Bases: MoleculeBlock

Enumerate all stereoisomers of each input molecule.

This is a 1:N transform that yields all possible stereoisomers for each input molecule. Properties from the input molecule are copied to each output stereoisomer.

Example

workflow.add(
    MoleculeSourceBlock(),
    EnumerateStereoBlock(),
    MoleculeSinkBlock()
)

Initialize the stereoisomer enumeration block.

Conformer Generation¶

`cmxflow.operators.confgen.ConformerGenerationBlock(**kwargs: Any)` ¶

Bases: MoleculeBlock

Generate 3D conformers using RDKit's ETKDGv3 algorithm.

Molecules must have fully specified stereochemistry before conformer generation. Use EnumerateStereoBlock upstream to resolve any unspecified stereocenters.

Example

workflow.add(
    MoleculeSourceBlock(),
    EnumerateStereoBlock(),
    ConformerGenerationBlock(),
    MoleculeSinkBlock()
)

Mutable Parameters

numConfs: Number of conformers to generate (1–100).
pruneRmsThresh: RMS threshold for pruning similar conformers (0.0–3.0).
useRandomCoords: Use random initial coordinates instead of distance geometry.

Initialize the conformer generation block.

Alignment¶

`cmxflow.operators.align.MoleculeAlignBlock(**kwargs: Any)` ¶

Bases: MoleculeBlock

Align input molecule conformers to a set of 3D reference molecules.

For each input molecule, all conformers are aligned to all reference conformers using the selected method. The single best-scoring conformer (highest shape Tanimoto after alignment) is retained and returned. Input molecules must already have 3D conformers.

Required Inputs

query (file): Path to reference molecule file with 3D conformers.

Output Properties

alignment_shape_similarity: Shape Tanimoto similarity of the best-aligned conformer.
alignment_score: RMSD of the best alignment.
alignment_reference: Name of the reference molecule used.
alignment_method: Alignment algorithm used.
alignment_ref_index: Index of the reference molecule used.
alignment_mcs: MCS SMARTS pattern (only present when method is mcs).

Example

workflow.add(
    MoleculeSourceBlock(),
    EnumerateStereoBlock(),
    ConformerGenerationBlock(),
    MoleculeAlignBlock(query="reference.sdf"),
    MoleculeSinkBlock()
)

Mutable Parameters

alignment_method: Alignment algorithm (crippen_o3a, mmff_o3a, mcs).

Initialize the molecular alignment block.

Docking¶

`cmxflow.operators.dock.MoleculeDockBlock(score_components: bool = True, score_strain: bool = False, score_only: bool = False, receptor_most_populated_only: bool = False, budget: int = 1, **kwargs: Any)` ¶

Bases: MoleculeBlock

MoleculeBlock for docking ligands into protein binding sites.

Performs pose optimization using an empirical scoring function with configurable parameters. Supports both rigid-body and flexible (torsional) docking. Electrostatic complementarity (EC) is evaluated once on the final pose and reported as docking_ec — a standalone score, never part of the search.

Two modes, both requiring a receptor and a site_reference:

Free docking (index_poses=False, default): multi-start search (n_starts) anchored on the site-reference centroid.
Scaffold-indexed docking (index_poses=True): the first molecule of each Bemis-Murcko scaffold is docked fully and its core pose cached; later siblings transfer that pose and run a single constrained local search — much faster for congeneric series, and series-consistent. The site_reference ligand is seeded as the first scaffold entry, so its experimentally grounded pose is the preferred template.

Required Inputs

receptor (file): Path to receptor PDB file.
site_reference (file): Molecule file (.sdf, .mol2, etc.) whose heavy-atom centroid defines the binding site center. Sobol restart samples are anchored to this pocket, so molecules dock from a freshly generated conformer — no preceding alignment step is required. It is the reference template seeded into the scaffold index when index_poses=True. Technically optional: omit only for MCS/overlay refinement workflows where the input pose is already in the binding site (the search then recenters on the input conformer position).

Output Properties

docking_initial_pose_score: Score before optimization.
docking_score: Final optimized score (empirical + EC adjustment, plus ligand strain when score_strain=True).
docking_empirical: Pure empirical score (without EC term).
docking_ec: Electrostatic complementarity of the final pose, in [-1, 1] (0.0 only when EC protein data is unavailable).
docking_strain: Ligand strain penalty — intramolecular energy added vs the input conformer (>=0). Reported regardless of score_strain.
docking_converged: Whether optimization converged. When score_components=True (default), also writes the raw (pre-torsion-divisor) weighted terms, matching smina's term log; the torsion divisor is applied only to docking_score/docking_empirical:
docking_gauss1: Gaussian term (weight * raw sum, no torsion divisor).
docking_repulsion: Repulsion term (weight * raw sum, no torsion divisor).
docking_hydrophobic: Hydrophobic term (weight * raw sum, no divisor).
docking_hbond: H-bond term (weight * raw sum, no torsion divisor).
docking_n_rot: Torsional entropy energetic term (w_rot * N_rot).
docking_scoring_function: Scoring weights used, for reproducibility.

Example

# site_reference recenters the search, so a fresh conformer docks
# directly — no MoleculeAlignBlock required.
workflow.add(
    MoleculeSourceBlock(),
    EnumerateStereoBlock(),
    ConformerGenerationBlock(),
    MoleculeDockBlock(
        receptor="protein.pdb",
        site_reference="crystal_ligand.sdf",
    ),
    MoleculeSinkBlock(),
)

Mutable Parameters

w_gauss1: Vinardo Gaussian attractive term weight.
w_repulsion: Vinardo repulsion term weight.
w_hydrophobic: Vinardo hydrophobic term weight.
w_hbond: Vinardo hydrogen bond term weight.
w_rot: Torsional entropy divisor weight (0=pure Vinardo, 0.02=smina default).
n_starts: Number of L-BFGS-B restarts. 1 = local minimize from the input pose only. For blind docking (with site_reference), use 1+2^k for ideal Sobol balance: 3, 5, 9, 17, 33, 65. Row 0 always minimizes from the aligned pose; rows 1+ sample the binding site box.
basin_hops: Iterated-local-search refinement steps per restart (0 = single minimize). Higher finds lower-energy poses at more cost.
basin_hop_starts: Number of top-scoring minimized starts to carry into basin hopping (two-stage: minimize all starts, hop only the best few). 0 (default) or >= n_starts hops every start. No effect when basin_hops = 0.
max_iterations: Maximum L-BFGS-B iterations per restart.
box_size: Translation search box half-width in Angstroms (default 5.0). Centred on site_reference centroid when provided, otherwise on the input conformer position.
rigid: If True, only rigid-body optimization (no torsions).
index_poses: If True, scaffold-indexed (template) docking (see above). A mode toggle, not a search dimension — freeze it during optimization.

Initialize the molecular docking block.

Parameters:

Name	Type	Description	Default
`score_components`	`bool`	If True (default), write per-term weighted score components as SDF properties on each docked molecule.	`True`
`budget`	`int`	Integer sampling-budget multiplier on the per-conformer orientation budget (default 1). Scales both the center orientations and the per-offset spread together, so a larger search box (blind docking) is met by raising budget rather than starving each translation offset. Denominated for RAM: at the schema ceiling (~100 heavy atoms, n_orientation_samples at its 3200 max, default max_confs) each unit costs roughly 1 GB of RAM per process, so set budget to about the GB you can allot per worker. A resource/deployment knob, not an optimized parameter.	`1`
`score_only`	`bool`	If True, skip pose optimization entirely and score the input pose as-is. Useful for rescoring pre-docked poses and for isolating scoring-function cost from the search. Not a mutable parameter -- it changes the behavior of the forward pass only.	`False`
`score_strain`	`bool`	If True, add the ligand strain penalty (intramolecular energy added vs the input conformer, >=0) into `docking_score` and into multistart selection. Default False keeps `docking_score` purely intermolecular (smina-comparable). The strain value is always written as `docking_strain` regardless.	`False`
`receptor_most_populated_only`	`bool`	If True, only the most populated altloc is used in loaded PDB file. If False, all altlocs are used for weighted scoring by occupancy.	`False`
`**kwargs`	`Any`	Passed to `set_inputs`. Accepts the inputs `receptor` and `site_reference` (file paths) and any mutable parameter by name (`n_starts`, `basin_hops`, `max_iterations`, `box_size`, `rigid`, score weights, and `index_poses`). `index_poses` is a bool (default `False`): when `True` the block runs in scaffold-indexed (template) docking mode -- the first molecule of each Bemis-Murcko scaffold is docked fully and its scaffold pose cached at `./.cmxflow/scaffold_index.db`; later molecules sharing that scaffold transfer the cached pose and run a single constrained local search (faster for congeneric series, and series-consistent). The cache persists across runs and is reused. Cache keys are namespaced by the docking parameters and the receptor/reference paths, so changing score weights or search settings, or pointing at a different target/site, never reuses a stale pose. `index_poses` is a mode toggle: leave it out of the optimized parameter space.	`{}`

Clustering¶

`cmxflow.operators.cluster.RepresentativeClusterBlock(**kwargs: Any)` ¶

Bases: MoleculeBlock

Assign molecules to clusters using streaming leader clustering.

For each molecule, computes an ECFP4 fingerprint (or Murcko scaffold fingerprint) and compares it against all existing cluster representatives via Tanimoto similarity. If the best similarity meets the threshold, the molecule joins that cluster; otherwise a new cluster is created. All molecules pass through annotated with cluster metadata.

Output Properties

cluster_id: Integer index of the assigned cluster.
cluster_representative: SMILES of the cluster's representative molecule.
cluster_similarity: Tanimoto similarity to the cluster representative.

Example

workflow.add(
    MoleculeSourceBlock(),
    RepresentativeClusterBlock(),
    MoleculeSinkBlock()
)

Mutable Parameters

threshold: Tanimoto similarity threshold for cluster assignment (0.05–0.95).
scaffold: If True, cluster by Murcko scaffold fingerprint instead of full molecule.

Initialize the representative cluster block.

Operators¶

Standardization¶

cmxflow.operators.standardize.MoleculeStandardizeBlock(canonicalize_tautomers: bool = False, **kwargs: Any) ¶

Deduplication¶

cmxflow.operators.dedup.MoleculeDeduplicateBlock(**kwargs: Any) ¶

RDKit Methods¶

cmxflow.operators.method.RDKitBlock(method: Callable[[Chem.Mol], Any] | str, name: str | None = None, **kwargs) ¶

Filtering¶

cmxflow.operators.filter.SubstructureFilterBlock(**kwargs: Any) ¶

cmxflow.operators.filter.PropertyFilterBlock(**kwargs: Any) ¶

Selection¶

cmxflow.operators.select.PropertyHeadBlock(**kwargs: Any) ¶

cmxflow.operators.select.PropertyTailBlock(**kwargs: Any) ¶

2D Similarity¶

cmxflow.operators.sim2d.MoleculeSimilarityBlock(**kwargs: Any) ¶

3D Similarity¶

cmxflow.operators.sim3d.Molecule3DSimilarityBlock(**kwargs: Any) ¶

Ionization¶

cmxflow.operators.ionize.IonizeMoleculeBlock(ph_min: float = 6.4, ph_max: float = 8.4, **kwargs: Any) ¶

Stereoisomers¶

cmxflow.operators.confgen.EnumerateStereoBlock() ¶

Conformer Generation¶

cmxflow.operators.confgen.ConformerGenerationBlock(**kwargs: Any) ¶

Alignment¶

cmxflow.operators.align.MoleculeAlignBlock(**kwargs: Any) ¶

Docking¶

cmxflow.operators.dock.MoleculeDockBlock(score_components: bool = True, score_strain: bool = False, score_only: bool = False, receptor_most_populated_only: bool = False, budget: int = 1, **kwargs: Any) ¶

Clustering¶

cmxflow.operators.cluster.RepresentativeClusterBlock(**kwargs: Any) ¶

`cmxflow.operators.standardize.MoleculeStandardizeBlock(canonicalize_tautomers: bool = False, **kwargs: Any)` ¶

`cmxflow.operators.dedup.MoleculeDeduplicateBlock(**kwargs: Any)` ¶

`cmxflow.operators.method.RDKitBlock(method: Callable[[Chem.Mol], Any] | str, name: str | None = None, **kwargs)` ¶

`cmxflow.operators.filter.SubstructureFilterBlock(**kwargs: Any)` ¶

`cmxflow.operators.filter.PropertyFilterBlock(**kwargs: Any)` ¶

`cmxflow.operators.select.PropertyHeadBlock(**kwargs: Any)` ¶

`cmxflow.operators.select.PropertyTailBlock(**kwargs: Any)` ¶

`cmxflow.operators.sim2d.MoleculeSimilarityBlock(**kwargs: Any)` ¶

`cmxflow.operators.sim3d.Molecule3DSimilarityBlock(**kwargs: Any)` ¶

`cmxflow.operators.ionize.IonizeMoleculeBlock(ph_min: float = 6.4, ph_max: float = 8.4, **kwargs: Any)` ¶

`cmxflow.operators.confgen.EnumerateStereoBlock()` ¶

`cmxflow.operators.confgen.ConformerGenerationBlock(**kwargs: Any)` ¶

`cmxflow.operators.align.MoleculeAlignBlock(**kwargs: Any)` ¶

`cmxflow.operators.dock.MoleculeDockBlock(score_components: bool = True, score_strain: bool = False, score_only: bool = False, receptor_most_populated_only: bool = False, budget: int = 1, **kwargs: Any)` ¶

`cmxflow.operators.cluster.RepresentativeClusterBlock(**kwargs: Any)` ¶