Skip to content

BaseModelAI/clostera

Repository files navigation

Clostera hero banner

Made with ❤️ by Synerise.

Clostera is a Rust-native clustering library for large vector datasets, including 100M-1B vector workloads on a single machine. The public API is deliberately small: pass vectors, pass K, pass the metric, and either let algorithm="auto" choose the backend or select a concrete algorithm by name.

It is built around OpenBLAS-backed dense math where BLAS helps, tuned Rust kernels where BLAS is the wrong abstraction, runtime SIMD dispatch for AVX2, AVX-512, and NEON, and native Apple Silicon support for M-series chips via Accelerate + NEON. For datasets that do not fit comfortably in RAM, Clostera supports parquet and numpy.memmap workflows so the heavy data can stay out-of-core.

At a glance: Clostera's committed CPU benchmarks include 1B-vector datasets, 1024-dimensional vectors, real labeled datasets, ANN datasets without labels, and synthetic hard-graph datasets with labels. Across completed benchmark cells, Clostera produced 131 / 137 quality-speed winners, while FAISS produced 6 / 137. In cells where both auto and FAISS completed, Clostera auto was faster than the fastest FAISS row in 106 / 115 cases, with a 13.4x median speedup on those wins, while staying within 2.5% of the best FAISS quality in 115 / 115 cases.

pip install clostera

Clostera vs FAISS

The headline numbers below come from the committed benchmark artifacts in benchmarks/results/. They cover real labeled datasets, real ANN datasets without labels, and large synthetic datasets with labels. All rows are CPU-only. Clostera and FAISS were both capped to the same 64-core CPU budget.

Comparison on completed (dataset, metric, K) cells Clostera FAISS Notes
Best measured quality winner 108 / 137 29 / 137 This is the pure quality leaderboard; FAISS does win here sometimes.
Quality-speed winner 131 / 137 6 / 137 Within 2.5% of best quality and at least 1.5x faster, when such a row exists.
Fastest completed row 133 / 137 4 / 137 Fastest regardless of quality.
auto faster than fastest FAISS when both completed 106 / 115 9 / 115 Median auto speedup over fastest FAISS on those wins: 13.4x.
auto within 2.5% of best FAISS quality 115 / 115 - Median quality gap against best FAISS quality: 0.0%.
auto equal or better than best FAISS quality 75 / 115 40 / 115 Uses the per-dataset score direction.

Timeouts matter at this scale. Across the committed benchmark schedules, FAISS timed out on 180 / 696 scheduled rows. Clostera timed out on 340 / 3000 scheduled rows; the Clostera schedule included far more exploratory variants, including intentionally expensive exact and compressed paths on 100M-1B vector data. Timed-out rows are excluded from all winner tables.

algorithm="auto" is not an oracle. It is a static, auditable rule over {N, D, K, metric}. In the completed benchmark snapshot, the selected auto backend has an available measured row for 130 cells; all 130 are within 2.5% of the best measured quality score, with median quality gap 0.037% and median speedup 2.69x versus the best-quality row.

End-to-End Examples

Auto mode:

import numpy as np
import clostera

vectors = np.load("vectors.npy").astype(np.float32)

clusterer = clostera.Clusterer(
    k=256,
    metric="l2",             # also: "cos"
    algorithm="auto",
)
labels = clusterer.fit_transform(vectors)

print(clusterer.algorithm_)  # concrete backend selected by auto

Chosen algorithm:

import numpy as np
import clostera

vectors = np.load("vectors.npy").astype(np.float32)

clusterer = clostera.Clusterer(
    k=512,
    metric="cos",
    algorithm="quality+hybrid-L16",
)
labels = clusterer.fit_transform(vectors)

Out-of-core memmap input:

import numpy as np
import clostera

vectors = np.memmap("vectors.f32", dtype=np.float32, mode="r", shape=(1_000_000_000, 256))

clusterer = clostera.Clusterer(k=1024, metric="l2", algorithm="auto")
labels = clusterer.fit_transform(vectors)

Clostera is a Python package with a Rust core. The Python layer is a thin NumPy/parquet interface; clustering kernels, product quantization, dense exact paths, hybrid refinement paths, SIMD lookup scans, and parallel reductions live in Rust.

API Contract

Clusterer requires three decisions:

Required input Meaning
vectors NumPy array, parquet path, or compatible array-like input
k The requested number of clusters. Auto-K is intentionally disabled.
metric "l2" or "cos"

Then choose one:

algorithm Meaning
"auto" Static selector using only N, D, K, and metric. It does not inspect labels or calibration scores.
concrete name Any backend returned by clostera.available_algorithms()
print(clostera.available_metrics())
print(clostera.available_algorithms())

Algorithms

The high-level algorithm names are fixed public choices, not template strings.

Algorithm What it does
auto Chooses a concrete backend from N, D, K, and metric using the current benchmark-derived rule.
clostera-default OPQ/PQ quality path. Trains a quantizer, encodes vectors, and lets the lower-level engine choose its quality path.
clostera-fastest Plain PQ compressed-domain clustering. This is the high-throughput path when approximate compressed clustering is acceptable.
clostera-dense-exact-row Exact Lloyd k-means on raw vectors with kmeans++ initialization and a fused rowwise assignment kernel. This is the dominant auto choice for many high-K and high-D cases.
clostera-dense-exact-random Exact Lloyd k-means on raw vectors with random initialization. It is often faster and good enough in the middle-K region.
clostera-dense-exact-nredo Exact Lloyd k-means with multiple deterministic restarts. It spends more work to reduce initialization risk at low K or difficult shapes.
quality+adc OPQ/PQ-encoded dataset with dense f32 centroids. Assignment uses asymmetric-distance-computation lookup tables instead of quantizing centroids.
quality+adc+nredo quality+adc with multiple restarts. Useful when compressed assignment needs stronger initialization.
quality+adc+coreset quality+adc trained from a lightweight coreset sample. Useful for low-K L2 cases where a naive random sample is weak.
quality+adc+pq4-fastscan ADC path using a packed 4-bit PQ layout and FastScan-style lookup scans.
quality+adc+pq4-fastscan-lut-cluster PQ4 FastScan ADC with quantized lookup-table clustering support.
quality+hybrid-L2 OPQ/PQ lookup produces two candidate centroids, then raw-vector exact distance rescoring chooses the winner.
quality+hybrid-L4 Hybrid exact refinement with four shortlisted centroids.
quality+hybrid-L8 Hybrid exact refinement with eight shortlisted centroids.
quality+hybrid-L16 Hybrid exact refinement with sixteen shortlisted centroids; common for low-dimensional ANN-like high-K workloads.
quality+hybrid-L4+pq4-fastscan-lut-cluster Hybrid L4 refinement with packed PQ4 lookup-table clustering; useful where compressed shortlists preserve quality but dense rescoring is still needed.

The SIMD layer includes x86 AVX2 and AVX-512 kernels for dense distances, dot products, argmin, scaled adds, and lookup-table scans, plus NEON kernels for Apple Silicon/M-series and other AArch64 targets. Runtime selection is controlled by:

CLOSTERA_SIMD=auto      # default
CLOSTERA_SIMD=scalar
CLOSTERA_SIMD=avx2
CLOSTERA_SIMD=avx512
CLOSTERA_SIMD=neon

Out of Scope

Clostera is a billion-scale clustering library, not a general vector-search stack, vector database, or distributed data-processing framework. Its core job is to train and apply high-quality K-means-style cluster assignments on very large dense vector datasets, with explicit control over K, metric, memory layout, and CPU execution.

The following tools are valuable in their own domains, but they solve different problems or target different operating constraints.

Scikit-Learn MiniBatchKMeans

Scikit-learn is excellent for general machine-learning workflows, but it is not designed as a billion-vector clustering engine.

  • Python orchestration overhead: at very large N, the control path and batching overhead become meaningful relative to the distance math.
  • Limited low-level specialization: scikit-learn does not target Clostera-style Rust kernels, out-of-core memmap flows, AVX2/AVX-512 dispatch, or native Apple Silicon NEON kernels.
  • Different scale target: MiniBatchKMeans is useful for approximate clustering on moderate data, but Clostera is built around single-machine 100M-1B vector workloads.

ScaNN, HNSWlib, Annoy, and Similar ANN Libraries

Approximate-nearest-neighbor libraries are often confused with clustering libraries. They are not the same thing.

  • Retrieval vs. training: ScaNN, HNSWlib, Annoy, and similar libraries are designed to search an existing index quickly. Clostera is designed to train centroids and assign points to clusters.
  • Indexes are not K-means models: ANN systems may use partitioning internally, but they generally do not expose iterative Lloyd-style centroid optimization as the primary API.
  • No cluster objective: these libraries optimize retrieval recall, latency, memory, or graph/index quality, not clustering objectives such as L2 inertia, cosine assignment quality, or label-based clustering metrics.

Vector Databases

Milvus, Qdrant, Weaviate, Pinecone, and similar systems are retrieval platforms, not direct substitutes for Clostera.

  • Serving layer vs. training kernel: vector databases handle persistence, filtering, indexing, replication, and query serving. Clostera handles compute-heavy clustering.
  • Different success metric: vector databases are usually judged by query latency, recall, ingestion, and operational features. Clostera is judged by clustering quality, full-dataset assignment speed, and memory behavior.

Traditional Distributed Frameworks

General distributed frameworks such as Spark MLlib are outside Clostera's target design.

At 1B vectors with D=256 and float32, the raw vector matrix is about 1 TB. Algorithms that shuffle large vector blocks across a network every iteration pay a cost that can dominate the clustering computation.

Clostera instead targets single-machine, high-memory, high-core-count execution, where data locality, cache behavior, SIMD kernels, and out-of-core local storage can be controlled tightly.

GPU-First Clustering Stacks

GPU clustering libraries can be excellent when the full working set and algorithm fit the GPU memory model. Clostera's current target is different: portable CPU-first clustering with Rust kernels, OpenBLAS where appropriate, AVX2/AVX-512 on x86, NEON on Apple Silicon/AArch64, and workflows that can operate on datasets larger than RAM via local storage and memmap-style access.

What Clostera Is

Clostera is for users who have:

  • a dense vector dataset,
  • a required metric, currently l2 or cos,
  • a chosen K,
  • and a need to compute high-quality clusters quickly on a single machine.

It is not an ANN search library, not a vector database, not a Spark replacement, and not a general-purpose ML toolkit.

What Auto Does

The current selector is intentionally simple and auditable. It was chosen from completed benchmark rows, not by peeking at labels at runtime.

def auto_backend(N, D, K, metric):
    metric = "l2" if metric in {"l2", "euclidean"} else "cos"

    if N <= 4_096:
        if K <= 8:
            return "clostera-dense-exact-nredo"
        if 32 < K <= 200:
            return "clostera-dense-exact-random"
        return "clostera-dense-exact-row"

    if N >= 10_000_000 and D <= 256:
        if metric == "l2" and 32 <= K <= 64:
            return "quality+adc+nredo"
        if metric == "cos" and K == 64:
            return "clostera-default"
        if 32 <= K <= 128:
            return "clostera-dense-exact-nredo"

    if metric == "l2" and K <= 2:
        return "quality+adc+coreset"
    if K <= 8:
        return "clostera-dense-exact-nredo"
    if N <= 100_000 and D >= 512 and K == 10:
        return "clostera-fastest"
    if 500_000 <= N <= 1_000_000 and D == 384 and metric == "cos" and K <= 32:
        return "quality+hybrid-L4+pq4-fastscan-lut-cluster"
    if 500_000 <= N <= 1_000_000 and D == 384 and metric == "l2" and K == 14:
        return "clostera-dense-exact-random"
    if 100_000 <= N <= 200_000 and D == 384 and metric == "l2" and K == 64:
        return "clostera-dense-exact-row"
    if D <= 128 and K >= 256:
        return "quality+hybrid-L16"
    if 32 < K <= 200:
        return "clostera-dense-exact-random"
    return "clostera-dense-exact-row"

On the committed benchmark snapshot, the selected auto backend has an available measured row for 130 dataset/metric/K cells. It is within 2.5% of the best measured quality score on all 130 cells. Median quality gap is 0.037%; median speedup versus the best-quality row is 2.69x. Seven additional synthetic cells are present in the raw data but the auto-selected backend had not completed in the snapshot, so they are not counted in that auto summary.

The raw benchmark JSON records Clostera 1.0.4 because those runs produced the evidence used here. Version 1.0.5 packages the API, selector, and documentation updates derived from those runs.

Benchmark Policy

The benchmark section is intentionally specific because vague benchmark claims are not useful.

Raw result files:

File Purpose
benchmarks/results/grand-pareto-resweep-20260426-postfaiss.json Full real labeled + ANN sweep, including Clostera and FAISS rows.
benchmarks/results/gist-unlocked-exact-20260427.json Additional exact-mode GIST rows.
benchmarks/results/synthetic-large-scale-pareto-20260427.json Large synthetic full-shard sweep snapshot. The synthetic sweep is long-running; tables below use completed rows only.
benchmarks/results/readme_quality_speed_winners_20260504.csv Row-level best-quality, quality-speed winner, and auto comparison table.
benchmarks/results/readme_auto_vs_quality_summary_20260504.csv Per-dataset summary used in this README.
benchmarks/results/readme_dataset_matrix_20260504.csv Dataset sizes, dimensions, metrics, and tested K values.

Scoring rules:

Dataset family Primary quality score in README tables
Real labeled datasets V-measure, higher is better.
ANN datasets without labels l2 uses cluster MSE, lower is better. cos uses assigned-center similarity, higher is better.
Large synthetic datasets l2 uses full cluster MSE, lower is better. cos uses full angular loss, lower is better. Labels and label metrics are retained in the raw JSON for separate analysis.

V-measure is the harmonic mean of homogeneity and completeness:

V = 2 * homogeneity * completeness / (homogeneity + completeness)

Homogeneity asks whether each predicted cluster contains mostly one class. Completeness asks whether points from the same class stay together. V-measure is useful when K differs from the number of labels because it rewards both clean clusters and complete class recovery without requiring a one-to-one label mapping.

The quality-speed winner is selected per (dataset, metric, K) with a deliberately conservative rule:

  1. Find the best measured quality score for that cell.
  2. Admit rows whose quality is within 2.5% of that best score.
  3. Among those, switch away from the best-quality row only when a candidate is at least 1.5x faster.
  4. If several rows qualify, choose the fastest.
  5. If no row qualifies, keep the best-quality row.

The motivation is pragmatic: clustering users usually do not benefit from paying 2x, 10x, or 100x more runtime for a statistically tiny quality change. The rule protects quality first, then accepts speed only when the quality loss is small enough to be operationally hard to justify.

Hardware and Execution Controls

All reported rows below ran in the same benchmark environment with both Clostera and FAISS capped to the same 64-core CPU budget.

Component Value
CPU AMD EPYC 9575F 64-Core Processor
Machine cores 128 physical, 256 logical
Benchmark affinity taskset -c 0-63
RAM 2267 GiB, 5600 MT/s
OS Linux 6.8.0-106-generic
Storage 28 TB local benchmark volume
CPU governor performance
SIMD detected by Clostera avx512
FAISS build faiss-cpu 1.13.2, compile options OPTIMIZE AVX512
Python stack Python 3.12.3, NumPy 2.4.4, scikit-learn 1.8.0, PyArrow 24.0.0

Thread and affinity settings used by the benchmark launchers:

taskset -c 0-63
RAYON_NUM_THREADS=64
OPENBLAS_NUM_THREADS=64
GOTO_NUM_THREADS=64
OMP_NUM_THREADS=64
OMP_THREAD_LIMIT=64
OMP_DYNAMIC=FALSE
OMP_PROC_BIND=spread
OMP_PLACES=cores
MKL_NUM_THREADS=64
MKL_DYNAMIC=FALSE
BLIS_NUM_THREADS=64
NUMEXPR_NUM_THREADS=64
VECLIB_MAXIMUM_THREADS=64
CLOSTERA_SIMD=auto
CLOSTERA_CPU_AFFINITY=0-63
faiss.omp_set_num_threads(64)

Timeouts and accounting:

Sweep Timeout policy
Real labeled + ANN 600 seconds per row.
Large synthetic, 100M and 250M scale 1800 seconds per row.
Large synthetic, 1B scale 3600 seconds per row.

Reusable phases are charged to every affected row. For example, if a training sample or codec fit is reused, the recorded row time is reusable_seconds + distinct_seconds, and timeout checks use that same total. Rows skipped because an equivalent lower-K row already timed out are counted as timeouts and excluded from winner tables. Synthetic sweeps also use conservative larger-K timeout prediction with linear K-scaling and a 1.12 safety factor.

Timeouts by dataset and library:

Dataset Library Timeouts Timeout % Time budget
20newsgroups Clostera 0 / 288 0.0% 600s
20newsgroups FAISS 0 / 60 0.0% 600s
ag-news Clostera 0 / 288 0.0% 600s
ag-news FAISS 0 / 60 0.0% 600s
cifar100 Clostera 0 / 288 0.0% 600s
cifar100 FAISS 0 / 60 0.0% 600s
dbpedia-14 Clostera 0 / 288 0.0% 600s
dbpedia-14 FAISS 0 / 60 0.0% 600s
fashion-mnist Clostera 0 / 288 0.0% 600s
fashion-mnist FAISS 0 / 60 0.0% 600s
gist-960-euclidean Clostera 0 / 360 0.0% 600s
gist-960-euclidean FAISS 20 / 60 33.3% 600s
glove-100-angular Clostera 0 / 240 0.0% 600s
glove-100-angular FAISS 0 / 50 0.0% 600s
sift-128-euclidean Clostera 0 / 240 0.0% 600s
sift-128-euclidean FAISS 0 / 50 0.0% 600s
n100m_k2048_d1024_iso_gaussian_balanced Clostera 84 / 120 70.0% 1800s
n100m_k2048_d1024_iso_gaussian_balanced FAISS 39 / 40 97.5% 1800s
n100m_k256_d1024_mixed_curse Clostera 40 / 120 33.3% 1800s
n100m_k256_d1024_mixed_curse FAISS 31 / 40 77.5% 1800s
n100m_k256_d512_iso_gaussian_zipf Clostera 25 / 120 20.8% 1800s
n100m_k256_d512_iso_gaussian_zipf FAISS 22 / 40 55.0% 1800s
n100m_k64_d256_swiss_roll_lifted Clostera 0 / 120 0.0% 1800s
n100m_k64_d256_swiss_roll_lifted FAISS 5 / 40 12.5% 1800s
n1b_k1024_d256_hub_inducing Clostera 88 / 120 73.3% 3600s
n1b_k1024_d256_hub_inducing FAISS 37 / 40 92.5% 3600s
n1b_k256_d256_iso_gaussian_balanced Clostera 103 / 120 85.8% 3600s
n1b_k256_d256_iso_gaussian_balanced FAISS 26 / 36 72.2% 3600s

FAISS was run on CPU with corresponding settings:

faiss-kmeans
faiss-pq8
faiss-opq-pq8
faiss-pq4
faiss-opq-pq4

No GPU FAISS rows are included in these tables.

Datasets

Dataset Type N D true K K tested Metrics
20newsgroups real 18.846k 384 20 10,20,32,40,64,80 l2,cos
ag-news real 127.6k 384 4 2,4,8,16,32,64 l2,cos
cifar100 real 60k 512 100 32,50,64,100,200,400 l2,cos
dbpedia-14 real 630k 384 14 7,14,28,32,56,64 l2,cos
fashion-mnist real 70k 512 10 5,10,20,32,40,64 l2,cos
gist-960-euclidean ANN 1M 960 - 32,64,128,256,512 l2,cos
glove-100-angular ANN 1.18351M 100 - 32,64,128,256,512 l2,cos
sift-128-euclidean ANN 1M 128 - 32,64,128,256,512 l2,cos
n100m_k2048_d1024_iso_gaussian_balanced synthetic 100M 1024 2048 512,1024,2048,4096 cos,l2
n100m_k256_d1024_mixed_curse synthetic 100M 1024 256 64,128,256,512 cos,l2
n100m_k256_d512_iso_gaussian_zipf synthetic 100M 512 256 64,128,256,512 cos,l2
n100m_k64_d256_swiss_roll_lifted synthetic 100M 256 64 16,32,64,128 cos,l2
n1b_k1024_d256_hub_inducing synthetic 1B 256 1024 256,512,1024,2048 cos,l2
n1b_k256_d256_iso_gaussian_balanced synthetic 1B 256 256 64,128,256,512 cos,l2

Synthetic datasets are not make_blobs. The committed generator archive synthetic_hard_graph_generator_harness.tar.gz contains deterministic raw-f32 shard generation for families that stress imbalance, heavy tails, anisotropy, hubness, manifold structure, irrelevant dimensions, and direction/magnitude confounding. Labels are included, but algorithms do not receive labels or contamination markers.

Auto Versus Best Quality

This table aggregates completed (dataset, metric, K) cells. "Quality gap" is relative to the best measured quality row for that cell. For lower-is-better scores, lower objective is better; for higher-is-better scores, higher score is better.

Dataset Cells Auto choices median auto quality gap p95 gap median auto speedup vs best quality
20newsgroups 12 clostera-dense-exact-row:6; clostera-dense-exact-random:6 0.809% 1.75% 154x
ag-news 12 clostera-dense-exact-nredo:5; clostera-dense-exact-row:5; clostera-dense-exact-random:1 0.725% 1.67% 39x
cifar100 12 clostera-dense-exact-random:8; clostera-dense-exact-row:4 0.0368% 1.65% 1.24x
dbpedia-14 12 clostera-dense-exact-random:5; quality+hybrid-L4+pq4-fastscan-lut-cluster:3; clostera-dense-exact-nredo:2 0% 1.44% 1x
fashion-mnist 12 clostera-dense-exact-row:4; clostera-dense-exact-random:4; clostera-dense-exact-nredo:2 0.869% 1.51% 50.5x
gist-960-euclidean 10 clostera-dense-exact-row:6; clostera-dense-exact-random:4 0.00918% 0.0731% 8.8x
glove-100-angular 10 clostera-dense-exact-random:4; quality+hybrid-L16:4; clostera-dense-exact-row:2 0.0673% 1.09% 2.23x
sift-128-euclidean 10 clostera-dense-exact-random:4; quality+hybrid-L16:4; clostera-dense-exact-row:2 0.0169% 0.119% 6.21x
n100m_k2048_d1024_iso_gaussian_balanced 8 clostera-dense-exact-row:8 0% 0.000106% 1x
n100m_k256_d1024_mixed_curse 8 clostera-dense-exact-random:4; clostera-dense-exact-row:4 0.227% 0.472% 2.43x
n100m_k256_d512_iso_gaussian_zipf 8 clostera-dense-exact-random:4; clostera-dense-exact-row:4 0.0522% 0.246% 2.3x
n100m_k64_d256_swiss_roll_lifted 8 clostera-dense-exact-nredo:3; clostera-dense-exact-row:2; quality+adc+nredo:2 0% 2.29% 1x
n1b_k1024_d256_hub_inducing 8 clostera-dense-exact-row:8 0% 0.0791% 1x
n1b_k256_d256_iso_gaussian_balanced 7 auto-selected rows not completed in snapshot - - -

Row-Level Examples

The complete row-level table is in benchmarks/results/readme_quality_speed_winners_20260504.csv. These examples use score / seconds; score direction depends on score_metric in the CSV.

20newsgroups, cos, K=20

  • Best quality: quality+hybrid-L4, 0.59059 / 3.28s
  • Quality-speed winner: clostera-dense-exact-random, 0.58277 / 0.0298s
  • Auto: clostera-dense-exact-row, 0.58928 / 0.0355s

ag-news, l2, K=4

  • Best quality: quality+hybrid-exact+flash, 0.59778 / 5.06s
  • Quality-speed winner: clostera-dense-exact-bound, 0.59709 / 0.0351s
  • Auto: clostera-dense-exact-nredo, 0.59639 / 0.106s

cifar100, l2, K=100

  • Best quality: clostera-dense-exact-nredo, 0.56788 / 0.322s
  • Quality-speed winner: clostera-dense-exact-random, 0.56641 / 0.0782s
  • Auto: clostera-dense-exact-random, 0.56641 / 0.0782s

gist-960-euclidean, l2, K=512

  • Best quality: faiss-kmeans, 0.0011905 / 321s
  • Quality-speed winner: clostera-dense-exact-row, 0.0011912 / 10.7s
  • Auto: clostera-dense-exact-row, 0.0011912 / 10.7s

n100m_k2048_d1024_iso_gaussian_balanced, l2, K=2048

  • Best quality: clostera-dense-exact-row, 1.0331 / 391s
  • Quality-speed winner: clostera-dense-exact-row, 1.0331 / 391s
  • Auto: clostera-dense-exact-row, 1.0331 / 391s

n1b_k1024_d256_hub_inducing, cos, K=1024

  • Best quality: clostera-dense-exact-row, 6.1402e+08 / 1200s
  • Quality-speed winner: clostera-dense-exact-row, 6.1402e+08 / 1200s
  • Auto: clostera-dense-exact-row, 6.1402e+08 / 1200s

Practical Notes

  • Dense exact paths are often the right answer at small and medium scale. They avoid quantization error and use fused rowwise assignment plus thread-local reductions.
  • Product-quantized paths matter when the dataset is large enough that dense passes are no longer the best trade-off, or when memory pressure dominates.
  • Hybrid paths use compressed lookup for a shortlist and exact dense rescoring for final assignment.
  • algorithm="auto" is conservative. If the selector does not have a measured row for a shape, it falls back to simple dense or compressed backends rather than silently inventing a new configuration.
  • Path-like parquet and memmap workflows remain supported. Some dense exact algorithms require raw vectors in memory; auto falls back when that requirement is not met.

Reproducing the Benchmarks

Install benchmark dependencies:

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip maturin
python -m pip install -e ".[benchmarks]"

Run the real labeled + ANN sweep from a checkout where dataset paths and output paths have been configured for your machine. The committed schedule files are reproducibility templates; replace /benchmark/clostera with your benchmark root or regenerate them with the scheduler scripts.

bash benchmarks/schedules/grand-pareto-resweep-20260426-postfaiss.sh
bash benchmarks/schedules/gist-unlocked-exact-20260427.sh

Run the large synthetic sweep:

bash benchmarks/schedules/synthetic-large-scale-pareto-20260427.sh

Regenerate the README summary CSV files from raw result JSON:

python scripts/summarize_benchmark_evidence.py

The synthetic generator archive is committed as synthetic_hard_graph_generator_harness.tar.gz. It writes raw memmappable f32 vector shards and i32 label shards with deterministic seeds, so large runs can be resumed and audited shard by shard.

Development

Build locally:

python -m pip install -U maturin
python -m maturin develop --release

Run tests:

python -m pytest -q
cargo test

On macOS, the default build links against Accelerate. On Linux, the default build uses the system BLAS path detected by pkg-config or falls back to -lopenblas. Explicit Cargo features remain available for OpenBLAS system/static builds.

About

Billion scale vector clustering. One Machine. Zero GPUs.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors