Aethron Labs / Projects

Building foundation models for
scientific interpretation.

Transforming scientific data into actionable discovery. Starting with mass spectrometry — the language of molecules.

01Overview

What We Do

Aethron Labs is an independent research lab focused on developing large-scale machine learning systems for interpreting complex scientific data. Our work is centered on building foundational capabilities rather than narrow tools or application-specific models.

02NexaMol

Progress Log

YouTube / Problem Overview

Loom / Technical Demo

12.38M

Model parameters

45.7%

Contrastive loss reduction

0.4255

Morgan fingerprint cosine

5M

Spectra indexed in Qdrant

Foundation & Data

◈Acquired large-scale MS/MS dataset (~579 GiB)
◈Built Rust + Python preprocessing pipeline
◈Achieved ~160K spectra/sec on commodity CPUs
◈Processed 100% into verified, versioned shards (GeMS v1)
◈Strict train/test/validation splits — no leakage
◈Arrow-based training shards + HDF5 workflows

Model Training

◈Trained Final_V3 encoder-only transformer (12.38M params)
◈Phase-1 campaign: V1–V3 over ~201M spectra
◈Contrastive loss: 0.6206 → 0.3373 (45.7% reduction)
◈Validation loss fell 74.2% across V1–V3
◈Embedding std stable at 0.072 — promoted Final_V3
◈Hardware: 2× RTX PRO 6000 Blackwell, W&B observability

Commercial Execution

◈Identified CROs as primary commercial segment
◈Drafted GTM messaging and LOI templates
◈Conducted direct CRO outreach
◈Applied to BoostVC, Convergent, Artizen, Founders Inc
◈Written technical docs and 1-2 page proposals
◈Capital-efficient ignition → pre-seed execution plan

02.5Milestone

GeMS v1 — Dataset Complete

MILESTONE ACHIEVED

GeMS v1 (General Mass Spectrometry Dataset v1) — ML-ready training corpus assembled. Week of March 10, 2026.

579.6 GiB

Total dataset size

338

Total shards

~18K/s

Spectra throughput

Train

270 shards

80% of corpus

Test

33 shards

10% of corpus

Validation

35 shards

10% of corpus

GeMS v1 / Shard Distribution VisualizationCOMPLETE

Pipeline Architecture

◈Rust + Python hybrid preprocessing
◈Arrow-format training shards
◈Strict no-leakage train/test/val splits
◈Versioned, reproducible shard generation

Data Quality

◈~18K spectra/sec processing throughput
◈Instrument-agnostic normalization
◈HuggingFace → VM → storage pipeline
◈Verified checksums on all 338 shards

03Milestone

Final_V3 — Foundation Encoder

CHECKPOINT PROMOTED

Final_V3 encoder promoted as stable SSL foundation checkpoint. April 14, 2026. Published: A Reproducible ML Systems Stack for Structure-Aware MS/MS Representation Learning.

12.38M

Model parameters

45.7%

Contrastive loss reduction (V1–V3)

~201M

Spectra in phase-1 campaign

74.2%

Validation loss reduction

80.4%

Gradient norm reduction

0.072

Embedding std (stable)

Phase V1COMPLETE

0.6206

Cold start · 27 train shards · ~66-67M spectra

Phase V2COMPLETE

~0.45

V1 checkpoint lineage · 27 train shards · ~66-67M spectra

Phase V3PROMOTED

0.3373

V2 checkpoint lineage · 27 train shards · ~66-67M spectra

Architecture

◈Encoder-only transformer, 12.38M parameters
◈Self-supervised contrastive pretraining
◈Phase-gated training with promotion gates
◈W&B observability across all runs

Systems Stack

◈2× RTX PRO 6000 Blackwell training lane
◈Object-storage-first artifacts (Wasabi)
◈Resumable hydration + role-specific GPU containers
◈Full checkpoint lineage and manifest discipline

04Milestone

V26 — Structure Alignment

ALIGNMENT VALIDATED

Final_V3 encoder attached to RDKit Morgan fingerprint targets on a corrected label-gated surface. Structure signal confirmed without representation collapse.

0.0653

Validation structure BCE

0.4255

Validation fingerprint cosine

11/20

Ground-truth candidate matches

What Worked

◈Encoder attaches to RDKit Morgan targets without collapse
◈Stable embedding variance across alignment run
◈20-example validation gallery: 11/20 ground-truth matches
◈Fingerprint decoding shows usable structure signal

Open Frontier

◈Trained retrieval projection remains weak
◈Top-1 ranking not yet reliable
◈Confidence calibration negative — not promoted
◈Next: reranking, ambiguity tiers, confidence-as-abstention

05Milestone

Inference Layer — Qdrant Atlas

PROTOTYPE LIVE

First inference system built around Final_V3. Qdrant vector indexing, chemical-space atlas, outlier triage, and family-analysis surfaces. Practical framing: structure-elucidation support, not autonomous identification.

5M

Spectra indexed (Qdrant)

100K

Rich chemical-space atlas

3,129

Formulas in atlas

Inference Surfaces

◈Qdrant vector indexing — nearest-neighbor retrieval
◈1M / 5M density atlases for chemical-space inspection
◈Outlier triage and rare-compound detection
◈Family-analysis surfaces for compound clustering
◈4,107 InChIKey prefixes in rich atlas

Product Framing

◈Candidate narrowing, not autonomous identification
◈Structure-elucidation support for analyst workflows
◈Evidence aggregation over ambiguous candidate sets
◈Not a de novo generator — bounded, defensible claims
◈Ranking and confidence remain the next engineering target

06Context

The Problem

Across the life sciences and molecular research, data generation has dramatically outpaced our ability to interpret it. Core analytical technologies produce enormous volumes of rich, high-dimensional measurements, yet downstream understanding still depends on fragile heuristics, limited reference data, and manual analysis.

This gap constrains discovery, slows research, and limits what can be reliably inferred from experimental data.

07Technical

Our Approach

We believe this is fundamentally a representation problem. Aethron Labs is building foundation models that learn directly from raw scientific data, capturing underlying structure in a way that generalizes across instruments, conditions, and experimental settings.

The goal is not to replace existing workflows, but to create a new computational substrate that makes scientific interpretation more scalable, reliable, and extensible.

08Business

Market Opportunity

Pre-commercial stage — projections based on preliminary market research and industry analysis

Top-down Context

$200B

Global pharma R&D annually

$90B

Global CRO market annually

$50B+

Addressable analytical services

Bottom-up Entry Wedge

30-50%

Reduction in manual interpretation time

10-100+

Instruments per large CRO

1M+

Spectra analyzed annually

Mid-to-large CROs typically operate 10s–100s of LC-MS/MS instruments processing millions of spectra per year, with teams of analysts whose time is the primary cost driver. This spend is recurring, operational, and directly tied to throughput and turnaround time.

Initial commercialization targets enterprise API licensing priced against analyst time and throughput. Targeting ~200–500 CROs and pharma analytical groups globally, with early adopters likely the top 10–50 CROs by analytical volume. Initial contracts plausibly in the $100K–$1M ARR range per customer — supporting a credible $50–200M serviceable obtainable market before broader expansion.

06Strategy

Go-To-Market

Simple and Credible

The initial GTM is intentionally narrow and execution-driven. Aethron Labs targets CROs first — the organizations that feel the MS/MS bottleneck most acutely.

Turnaround time, analyst throughput, and defensibility of results directly determine their margins and competitiveness. The goal is not rapid scaling at first, but credible proof that this infrastructure works in real workflows.

Direct CRO Outreach

Identify high-pain workflows: metabolite ID, impurity analysis, dereplication.

Scoped Pilots

Run alongside existing tools. Measured on time saved, coverage, analyst effort.

API Integration

Embed into existing pipelines — no UI disruption, no workflow replacement.

Validated Conversion

Convert pilots into paid API access or enterprise licensing.

07Roadmap

What This Becomes

What begins as programmatic molecular search for LC-MS/MS expands as models and representations mature:

Phase 1

Molecular Interpretation Infrastructure

◈LC-MS/MS annotation and search
◈Metabolomics and impurity identification
◈Direct integration into CRO and pharma pipelines

Phase 2

Embedded Discovery Infrastructure

◈Drug discovery, DMPK, metabolomics, materials research
◈Standard interpretation layer — not a standalone tool
◈Cross-instrument generalization

Phase 3

Scientific Foundation Infrastructure

◈Reusable substrate for molecular and materials science
◈TAM expands to multiple tens of billions
◈Core scientific computing infrastructure

08Team

Founder Profile

Allan

Founder

5 yrs

ML & Scientific Computing

[email protected]

3 yrs — Open-Source Research

◈Molecular science
◈Biomaterials
◈Quantum systems
◈Computational fluid dynamics

2 yrs — Industry

◈Startups
◈Large-scale production systems
◈Infrastructure engineering

This background spans the full stack required for this problem: scientific domain understanding, large-scale ML systems, and production engineering realities. Aethron Labs is structured to reflect this combination from day one.

09Vision

Motivation

This effort is motivated by a rare convergence:

◈

Scientific fields are generating orders of magnitude more data

◎

Interpretation remains the bottleneck — not collection

◉

Modern ML can now operate at the scale and complexity required

The opportunity is not incremental optimization. It is to define a new category of scientific infrastructure that sits between raw experimental data and downstream discovery.

By starting with a concrete, economically grounded use case (CRO workflows) and expanding deliberately, Aethron Labs aims to accelerate scientific discovery, improve reproducibility, and create durable infrastructure with impact beyond a single domain.

This is a long-term bet on advancing science as a system, not just improving a workflow.

10Connect

Get in Touch

If you work in scientific research, analytical chemistry, pharma, or scientific machine learning, and are interested in exchanging perspectives — I welcome the conversation.

Building foundation models forscientific interpretation.