Diren Kumaratilleke
IIIPrimitive · Structure

Crystara.

A new AI training paradigm past transformers. Transformers scale a fixed architectural object — self-attention — by pouring more parameters, more data, and more compute through it. Crystara (TCD-JEPA) refuses the premise. Instead of scaling the predictor, it grows the predictor: a recursive three-system loop explores the energy landscape with Fisher-information-metric Langevin dynamics, runs Vietoris-Rips persistent homology on the exploration trajectories, and crystallizes the stable topological features into typed H₀ / H₁ / H₂ predictor modules at runtime. The architecture is not designed; it is discovered. Across three real heterogeneous graphs — Georgetown CSET semiconductor (519 entities), GDELT global news (380 entities), SEC EDGAR (9,725 entities / ~3.9M edges) — Crystara adds +20 to +36.6 AUC points to baseline JEPA, beats supervised GAT (DeepMind), GCN (Google Brain), and GraphSAGE on the semiconductor graph, and scales to entity counts where GAT runs out of memory. On the semiconductor graph, the pipeline crystallizes 16 interpretable modules that map 1-to-1 to real industry clusters — with no labels, no prompting. The first runtime-discovered predictor architecture for the JEPA family, and the first concrete instance of a paradigm that moves past transformer scaling.

Sources

Past transformers.

The dominant paradigm of the 2020s is transformer scaling: fix the architecture at self-attention, pour more parameters and more tokens through it, and ride the loss curve downward. The paradigm has worked — and it is also paradigmatically conservative. Every frontier model is the same object at a bigger size. The architecture is static; only the compute moves.

TCD-JEPA proposes a different paradigm. Instead of scaling a fixed predictor, grow the predictor. Let the model explore the places it is currently most uncertain, run topological analysis on the trajectories, and crystallize stable geometric features into typed predictor modules at training time. The architecture is not handed down; it is discovered — typed by the homology group the feature was born from (H₀ attractors, H₁ cycles, H₂ boundaries), lifecycle-managed by a registry, and routed at inference by a learned gate. This is a post-transformer move because it changes the object. It does not scale the transformer; it replaces the fixed-architecture assumption underneath.

The problem with JEPA.

Joint Embedding Predictive Architectures are the right shape for self-supervised representation learning. They are also static in exactly the transformer-paradigm way: a single predictor head tries to span the entire structure of the input distribution. When that distribution is geometrically rich — clusters, loops, voids, room transitions, citation communities, supply-chain hierarchies — one head is wrong on average in more than one way, and the error is visible in the k-NN quality of the embeddings and in link-prediction AUC.

Crystara (implemented as tcd-jepa) keeps the JEPA backbone and replaces the single predictor with a runtime-grown family. The family is not designed. It is crystallized — in the literal topological sense — out of where the model is currently most wrong. It is the first runtime-discovered predictor architecture for the JEPA family, and the first concrete instance of a post-transformer paradigm that moves by changing the object rather than scaling it.

The three-system loop.

System 1 — Stream Encoder.

A ViT backbone with EMA target network, instrumentation hooks (per-layer statistics, representation diversity), and the energy surface E(z) = ‖p(s_θ(x)) − sg(s_ξ(y))‖². System 1 never stops; it publishes the current landscape.

System 2 — Recursive Manifold Explorer.

Reads the landscape and detects “blank spaces” — regions of low Hessian eigenvalue or high predictor variance. It then walks them via Langevin dynamics on a Fisher-information metric, biased toward uncertain regions:

z_{t+1} = z_t − η G(z_t)⁻¹ ∇E(z_t) + √(2η / β) · ε_t

Temperature β is spatially biased toward the blank regions. The explorer’s job is to leave trails — not to fit, not to predict, just to walk where the predictor currently has no opinion.

System 3 — Module Crystallizer.

Runs Vietoris–Rips persistent homology on the exploration trajectories. Stable topological features become predictor modules, typed by the homology group they were born from:

The loop is closed: new modules feed back into the predictor, reshaping the energy surface for the next exploration pass. Convergence is monitored via module count M(t), representation distribution R(t), and energy-landscape smoothness S(t):

C(t) = |M(t) − M(t−1)| / M(t)
      + KL( R(t) ‖ R(t−1) )
      + |S(t) − S(t−1)|

Empirical benchmark — three real heterogeneous graphs.

The authoritative benchmark runs self-supervised link prediction (AUC) across three real-world graphs spanning four orders of magnitude in edge count, against baseline JEPA and three supervised GNN baselines — GAT (DeepMind), GCN (Google Brain), GraphSAGE.

+36.6
AUC pts vs JEPA
CSET semiconductor graph, 519 entities.
+22.1
AUC pts vs JEPA
GDELT global news, 380 entities.
+20.0
AUC pts vs JEPA
SEC EDGAR, 9,725 entities / ~3.9M edges.
16
Modules discovered
1-to-1 with real industry clusters.

Semiconductor supply chain — CSET, 519 entities.

Georgetown Center for Security and Emerging Technology data, covering the physical semiconductor supply chain (design, fab, packaging, specialty chemicals, lithography, and their interdependencies). Crystara beats every supervised baseline and every self-supervised baseline.

ModelLink-prediction AUCΔ vs Crystara
Crystara82.7%
GAT (DeepMind)70.3%−12.4 pts
GCN (Google Brain)63.9%−18.8 pts
Baseline JEPA46.1%−36.6 pts
GraphSAGE33.8%−48.9 pts

GDELT global news events — 380 entities.

News-derived event graph. Crystara substantially beats baseline JEPA but is behind the supervised GAT and GCN baselines — an honest result: the news-event graph has less physical-cluster structure than the semiconductor one, and the topological-module vocabulary is a weaker match to its geometry.

ModelLink-prediction AUCΔ vs Crystara
GAT92.1%+23.0 pts
GCN85.2%+16.1 pts
Crystara69.1%
GraphSAGE59.9%−9.2 pts
Baseline JEPA47.0%−22.1 pts

SEC EDGAR financial filings — 9,725 entities, ~3.9M edges.

The scale test. Crystara completes training; GAT runs out of memory and returns no result; GraphSAGE produces no usable output; GCN hits 90.8% on edge-reconstruction but only ~7% on downstream classification, which is near-random and makes the edge number misleading.

ModelLink-prediction AUCNote
Crystara66.4%Completed at 9,725 entities.
GCN90.8% (edges) / ~7% (cls)Near-random on classification.
Baseline JEPA46.4%−20.0 pts.
GraphSAGENo usable results.
GATOOMOut of memory.

Qualitative breakthrough — 16 semiconductor modules.

On the CSET semiconductor graph, Crystara crystallized 16 interpretable modules that mapped 1-to-1 to real semiconductor supply-chain clusters, validated against primary sources (CSET, industry trade data). These modules emerged from persistent homology on Langevin trajectories — no labels, no prompting.

The modules capture both obvious clusters (national packaging hubs) and hidden dependencies (EUV ↔ etch/clean flows, AI ASIC ↔ specialty-equipment relationships). That is the signature of a predictor family that has grown into the shape of the data: H₀ attractors pulling industry clusters together, H₁ cycles linking feedback loops in the manufacturing pipeline, H₂ boundary modules handling interfaces between pipeline stages.

Runtime structure discovery & dynamic routing.

Crystallized modules live inside a DynamicPredictor. Each module is typed (H₀ / H₁ / H₂), persistence-scored, and lifecycle-managed. A query embedding is routed to modules using type-specific geometry:

A learned gate mixes the module predictions with the base JEPA prediction:

combined = base_pred
  + α · token_gate(base_pred)
        · module_norm( Σ_m route[m] · module_m(z) )

α is a conservatively-initialized sigmoid-gated scalar (module_weight = 0.05), so modules never dominate the base head unless they earn it. A registry tracks utilization — low-use modules decay, high-use modules can be re-seeded. That lifecycle is what turns “a trained model” into structure-as-artifact at inference time: the predictor family is itself an auditable, composable object you can inspect, prune, and route against.

Representation geometry — supporting diagnostics.

On the controlled Two Rooms environment (64×64 gridworld with a doorway), the mechanism is visible in the embedding geometry:

MetricVanilla JEPATCD-JEPAΔ relative
Linear probe57.93% ± 15.73%57.73% ± 7.60%−0.3%, variance halved
k-NN k=124.33% ± 1.07%29.93% ± 0.87%+23.0%
k-NN k=526.13% ± 1.80%29.73% ± 2.60%+13.8%
k-NN k=2024.17% ± 0.50%34.40% ± 0.00%+42.3%

Linear probe is roughly unchanged — the global class boundary was always learnable. Every local-neighborhood metric (k-NN) improves, with the biggest gain at the widest neighborhood. Variance on the linear probe halves (15.73% → 7.60%), which is the clearest evidence that the explorer-crystallizer loop converges to roughly the same set of modules from different initializations — the persistent-homology features are properties of the data, not of the seed.

Engineering surface.