← Back to Writing

Writing

How I Built dimRed: A Dimensional Reduction Toolkit

2 min read
ML Python Technical Open Source
How I Built dimRed: A Dimensional Reduction Toolkit

dimRed started as a pragmatic fix for a problem I kept encountering: great dimensional-reduction tutorials existed, but turning those ideas into repeatable, production-ready pipelines was messy. I was moving between Python notebooks, Scala services, and PySpark jobs, and I wanted a toolkit that made those transitions seamless without rewriting the same math three times.

Motivation: reproducible dimensional reduction at scale

The initial scope was small, but the requirements were clear:

  • Consistent API across Python, Scala, and PySpark.
  • Research-friendly defaults that matched standard literature.
  • Production-ready hooks for scaling large datasets.
  • Readable outputs for debugging and post-hoc analysis.

Those constraints pushed me toward a minimal, well-typed core with adapters for each runtime, rather than a single monolithic implementation.

Implementation choices that kept the codebase sane

I focused on keeping the API surface area tight: fit, transform, and fit_transform with predictable inputs and outputs. Internally, each algorithm shared a similar flow: normalize data, compute the projection, and capture metadata for inspection.

A small API that scales

from dimred import PCA

pca = PCA(n_components=3, whiten=True)
embedding = pca.fit_transform(matrix)

print(pca.explained_variance_ratio_)

That pattern made it straightforward to plug into notebooks or scheduled batch jobs without bespoke glue code.

PySpark challenges: distributed linear algebra is different

PySpark forced me to rethink a few assumptions. In a local environment, you can compute eigenvectors and call it a day. In Spark, you have to respect the cost of shuffles, caching, and the underlying row/column matrix abstractions.

  • RowMatrix vs. DataFrame: switching contexts too often was expensive.
  • Iterative algorithms (like t-SNE) needed guardrails to avoid runaway jobs.
  • Memory pressure meant careful partitioning and explicit caching.

I leaned on MLlib primitives where possible and built custom glue where I needed better controls.

Results and what came next

The library landed with a small but steady audience and eventually crossed 62 GitHub stars. More importantly, it became the backbone of my own dimension-reduction experiments and sparked a longer series of articles and projects.

Key outcomes:

  1. A consistent toolkit across three runtimes.
  2. Fewer “forked” implementations of the same algorithm.
  3. Clearer debugging through standardized metadata.

If I were to revisit dimRed today, I’d invest in faster GPU backends and stronger benchmarking harnesses. But the core idea still stands: good abstractions make the science easier to scale.