COSMolKit is a Python molecular toolkit backed by a Rust core. It provides value-style molecule operations, SMILES/SDF/MOL2/XYZ workflows, 2D depiction, native 3D conformer generation, UFF/MMFF optimization, fingerprints, batch processing, and protein-focused structural biology APIs.
The library is built around explicit behavior: supported operations return structured results, unsupported behavior fails visibly, and public molecule transforms are explicit about whether they return new values or mutate in place.
COSMolKit is designed for array-oriented structural data access, keeping molecular data efficient and natural for NumPy, PyTorch, and model-building workflows.
- Python documentation: https://kit.cosmol.org/
- Rust crate notes:
crates/cosmolkit/README.md
pip install cosmolkit- Value-style molecules: methods such as
with_hydrogens(),without_hydrogens(),with_kekulized_bonds(), andwith_2d_coordinates()return new molecule values. - Explicit mutation: in-place
Moleculeoperations always end with_. The trailing underscore has no other publicMoleculemeaning. - Explicit errors: invalid input and unsupported behavior are surfaced as errors instead of silent fallbacks.
- Batch-native processing:
MoleculeBatchkeeps input order, supports structured per-record failures, and can run batch transforms and exports with configurable parallelism. - Array-friendly data access: coordinates, bounds matrices, fingerprints, and graph features are exposed in forms that fit Python numerical workflows.
- Source-backed 3D workflows: conformer generation and UFF/MMFF optimization are available through the public Python API.
Normal molecule operations return new objects and do not mutate their inputs. This follows the same explicit-dataflow direction as modern dataframe libraries: users can reason about each transformation as a new value while COSMolKit can share unchanged internal storage efficiently.
from cosmolkit import Molecule
mol = Molecule.from_smiles("CCO")
mol_h = mol.with_hydrogens()
assert mol is not mol_hfrom cosmolkit import Molecule, MoleculeBatch
mol = Molecule.from_smiles("c1ccccc1O")
mol_2d = mol.with_2d_coordinates()
print(mol_2d.to_smiles())
print(mol_2d.coordinates_2d())
mol_3d = mol.with_hydrogens().with_3d_conformer()
print(mol_3d.coordinates_3d().shape)
svg = mol_2d.to_svg(width=400, height=300)
mol_2d.write_png("phenol.png", width=400, height=300)
fp = mol.fingerprint_morgan(radius=2, n_bits=2048)
print(fp.on_bits())
batch = (
MoleculeBatch.from_smiles_list(
["CCO", "c1ccccc1", "CC(=O)O"],
sanitize=True,
errors="keep",
)
.with_parallel_jobs(8)
.with_progress_bar(False)
)
prepared = batch.with_hydrogens(errors="keep").with_2d_coordinates(errors="keep")
print(prepared.valid_mask())
print(prepared.to_smiles_list())
prepared.to_images(
"molecule_images",
format="png",
size=(300, 300),
errors="keep",
filenames=["ethanol", "benzene", "acetate"],
)Use Protein when the workflow is focused on protein chains rather than the
full structural table.
from cosmolkit import Protein
protein = Protein.from_pdb("1crn.pdb")
print(protein.num_chains())
print(protein.num_residues())
print(protein.num_atoms())
for chain in protein.chains():
print(chain.index(), chain.kind(), len(chain))
for residue in chain.residues():
print(residue.name(), residue.kind(), len(residue))SdfDataset builds a lightweight index of SDF record byte ranges, so individual
records and chunks can be read without loading an entire file into memory.
Molfile-only readers such as Molecule.read_mol() follow RDKit
MolFromMolBlock boundaries: they stop after the first M END line and leave
trailing SDF data fields to the SDF APIs.
from cosmolkit import SdfDataset
dataset = SdfDataset.open("library.sdf")
print(len(dataset))
record = dataset[0]
mol = record.molecule()
for batch in dataset.batches(size=1024, errors="keep", n_jobs=8):
smiles = batch.to_smiles_list()from cosmolkit import EmbedParameters, Molecule
mol = Molecule.from_smiles("CC(=O)NC").with_hydrogens()
params = EmbedParameters.etkdg_v3()
params.random_seed = 0xF00D
params.num_threads = 1
params.track_failures = True
embedded = mol.with_3d_conformer(params)
print(embedded.num_conformers())
print(embedded.coordinates_3d().shape)
print(params.failures)
multi = mol.with_3d_conformers(5, params)
print(multi.num_conformers())
if embedded.has_uff_params():
uff = embedded.with_uff_optimized(max_iters=200)
print(uff.energy())
if embedded.has_mmff_params():
mmff = embedded.with_mmff_optimized(max_iters=200)
print(mmff.needs_more())with_3d_conformer() follows RDKit's ETKDG behavior for trusted molecular
graphs: molecules without explicit hydrogens are embedded as heavy-atom-only
conformers instead of failing or automatically adding hydrogens. Calling
with_hydrogens() first is recommended for all-atom geometry, force-field
optimization, and hydrogen-bond-sensitive workflows. Coordinate-only inputs
such as XYZ blocks do not contain a bond topology and are not valid ETKDG
inputs until a trusted graph has been constructed.
- Molecular graph construction and inspection
- SMILES parsing and writing
- MOL/SDF reading and writing
- MOL2 reading with RDKit-style
Mol2ParserParams - XYZ block reading
- Hydrogen transforms and Kekulization
- Sanitization and chemistry problem detection
- 2D coordinate generation and SVG/PNG depiction
- Native 3D conformer generation with DG/KDG/ETDG/ETKDG parameter presets
- UFF/MMFF optimization of generated or imported 3D conformers
- Morgan and Avalon fingerprints
- Distance-geometry bounds matrices
- Substructure matching and SMARTS parse metadata
- Ordered batch transforms and exports
- PDB/mmCIF molecule-block parsing and protein projection APIs
- Support-status metadata for public features
COSMolKit aims to be Python-friendly, batch-friendly, and suitable for model-building workflows.
- Correctness comes before breadth.
- Public transforms use value semantics.
- Mutation-capable workflows are explicit.
- Unsupported chemistry should fail clearly.
- RDKit-parity behavior is the correctness floor for supported cheminformatics features.
- High-throughput APIs should preserve input order and expose per-record failures.
Python examples live in python/examples/.
Status labels:
- β available in the public Python API
- π§ͺ implemented or partially available, still being hardened
- π§ planned / not yet public
Goal: keep the supported molecular core correct before expanding breadth.
- β Molecule, atom, and bond graph model
- β SMILES parsing
- β SMILES writing with RDKit-style writer options for supported branches
- β Ring perception, valence handling, aromaticity, and Kekulization
- β Hydrogen addition and removal
- β Sanitization for supported chemistry workflows
- β Stereochemistry inspection for supported atom and bond states
- β Distance-geometry bounds matrices
- β Native 3D conformer generation and UFF/MMFF post-optimization for supported molecules
- π§ͺ Morgan fingerprints and Tanimoto similarity
- π§ͺ Avalon fingerprints
- π§ͺ Substructure matching and Python SMARTS parse metadata
- π§ Broader descriptor APIs such as formula, molecular weight, and ring statistics
Goal: make common molecule import, export, and visualization workflows usable from Python.
- β MOL/SDF reading
- β MOL2 reading
- β XYZ block reading
- β SDF dataset indexing for large files
- β SDF writing for supported V2000/V3000 branches
- β PDB block to molecule conversion
- β mmCIF block to molecule conversion through the same molecule-conversion profile
- β 2D coordinate generation
- β SVG drawing
- β PNG export
- π§ͺ RDKit-style visual parity testing for supported depiction output
- π§ Annotation overlays and richer drawing customization
- β 3D conformer generation and embedding APIs
Goal: make high-throughput molecule preparation and export a core product identity.
- β
Ordered
MoleculeBatch.from_smiles_list() - β Batch transforms for sanitization, hydrogens, Kekulization, and 2D coordinates
- β
Configurable parallelism with
with_parallel_jobs() - β
Configurable progress display with
with_progress_bar() - β Per-record errors, valid masks, and error reports
- β Batch SMILES, image, and SDF export paths
- π§ͺ Golden parity tests for parallel batch behavior
- π§ More streaming and chunked dataset workflows
Goal: provide practical Biopython-like structure workflows without forcing users through low-level structural tables.
- β
Protein.from_pdb()/Protein.from_mmcif()high-level entry points - β Protein chain, residue, and atom iteration
- β Protein-only projection from broader structural data
- π§ͺ PDB/mmCIF structural parsing
- π§ Selection utilities for chains, residues, atoms, and neighborhoods
- π§ Ligand, nucleic-acid, and mixed-structure ergonomic APIs
Goal: expose verified molecular behavior through a practical Python interface.
- β Value-style molecule transformations
- β Graph, coordinate, fingerprint, and bounds-matrix accessors
- β Python examples for drawing, SDF-to-SMILES, batch processing, and proteins
- π§ͺ Type stubs and documentation coverage
- π§ Stable model-ready graph exports
- π§ NumPy / PyTorch oriented adapters
- π§ Molecular tokenization and AI-native geometry helpers
Goal: support lightweight chemistry workflows outside native Python processes.
- π§ WASM compilation target
- π§ JavaScript bindings
- π§ Browser-native SMILES/SDF parsing and depiction
COSMolKit is developed with deep respect for RDKit and the broader open-source cheminformatics community. The goal is an independent Rust-native implementation that preserves interoperability and RDKit-parity behavior where appropriate, while offering a deterministic Python API and AI-native extension surface.