Past transformers.
The dominant paradigm of the 2020s is transformer scaling: fix the architecture at self-attention, pour more parameters and more tokens through it, and ride the loss curve downward. The paradigm has worked — and it is also paradigmatically conservative. Every frontier model is the same object at a bigger size. The architecture is static; only the compute moves.
TCD-JEPA proposes a different paradigm. Instead of scaling a fixed predictor, grow the predictor. Let the model explore the places it is currently most uncertain, run topological analysis on the trajectories, and crystallize stable geometric features into typed predictor modules at training time. The architecture is not handed down; it is discovered — typed by the homology group the feature was born from (H₀ attractors, H₁ cycles, H₂ boundaries), lifecycle-managed by a registry, and routed at inference by a learned gate. This is a post-transformer move because it changes the object. It does not scale the transformer; it replaces the fixed-architecture assumption underneath.
The problem with JEPA.
Joint Embedding Predictive Architectures are the right shape for self-supervised representation learning. They are also static in exactly the transformer-paradigm way: a single predictor head tries to span the entire structure of the input distribution. When that distribution is geometrically rich — clusters, loops, voids, room transitions, citation communities, supply-chain hierarchies — one head is wrong on average in more than one way, and the error is visible in the k-NN quality of the embeddings and in link-prediction AUC.
Crystara (implemented as tcd-jepa) keeps the JEPA backbone and replaces the single predictor with a runtime-grown family. The family is not designed. It is crystallized — in the literal topological sense — out of where the model is currently most wrong. It is the first runtime-discovered predictor architecture for the JEPA family, and the first concrete instance of a post-transformer paradigm that moves by changing the object rather than scaling it.
The three-system loop.
System 1 — Stream Encoder.
A ViT backbone with EMA target network, instrumentation hooks (per-layer statistics, representation diversity), and the energy surface E(z) = ‖p(s_θ(x)) − sg(s_ξ(y))‖². System 1 never stops; it publishes the current landscape.
System 2 — Recursive Manifold Explorer.
Reads the landscape and detects “blank spaces” — regions of low Hessian eigenvalue or high predictor variance. It then walks them via Langevin dynamics on a Fisher-information metric, biased toward uncertain regions:
z_{t+1} = z_t − η G(z_t)⁻¹ ∇E(z_t) + √(2η / β) · ε_tTemperature β is spatially biased toward the blank regions. The explorer’s job is to leave trails — not to fit, not to predict, just to walk where the predictor currently has no opinion.
System 3 — Module Crystallizer.
Runs Vietoris–Rips persistent homology on the exploration trajectories. Stable topological features become predictor modules, typed by the homology group they were born from:
- H₀ (connected components) → AttractorModule. A local predictor centered on the cluster centroid. Gaussian attention weight
exp(−d² / 2r²)multiplied by an MLP(embed_dim → 2·embed_dim → embed_dim). Discovers dense regions — e.g. visual categories, industry clusters, citation communities. - H₁ (loops) → CycleModule. A periodic predictor: projection onto
num_frequencies = 8, thensin(·f + φ) + cos(·f + φ), thenLinear(2·f, embed_dim). Captures oscillatory structure — cyclic transitions, feedback loops in supply chains. - H₂ (voids / cavities) → BoundaryModule. A boundary-interpolated predictor with a learned gate
α = σ(MLP(z))blending two sub-predictors:α · p_a(z) + (1 − α) · p_b(z). Discovers interfaces — handoff regions between stages of a pipeline, doorways between rooms.
The loop is closed: new modules feed back into the predictor, reshaping the energy surface for the next exploration pass. Convergence is monitored via module count M(t), representation distribution R(t), and energy-landscape smoothness S(t):
C(t) = |M(t) − M(t−1)| / M(t)
+ KL( R(t) ‖ R(t−1) )
+ |S(t) − S(t−1)|Empirical benchmark — three real heterogeneous graphs.
The authoritative benchmark runs self-supervised link prediction (AUC) across three real-world graphs spanning four orders of magnitude in edge count, against baseline JEPA and three supervised GNN baselines — GAT (DeepMind), GCN (Google Brain), GraphSAGE.
Semiconductor supply chain — CSET, 519 entities.
Georgetown Center for Security and Emerging Technology data, covering the physical semiconductor supply chain (design, fab, packaging, specialty chemicals, lithography, and their interdependencies). Crystara beats every supervised baseline and every self-supervised baseline.
| Model | Link-prediction AUC | Δ vs Crystara |
|---|---|---|
| Crystara | 82.7% | — |
| GAT (DeepMind) | 70.3% | −12.4 pts |
| GCN (Google Brain) | 63.9% | −18.8 pts |
| Baseline JEPA | 46.1% | −36.6 pts |
| GraphSAGE | 33.8% | −48.9 pts |
GDELT global news events — 380 entities.
News-derived event graph. Crystara substantially beats baseline JEPA but is behind the supervised GAT and GCN baselines — an honest result: the news-event graph has less physical-cluster structure than the semiconductor one, and the topological-module vocabulary is a weaker match to its geometry.
| Model | Link-prediction AUC | Δ vs Crystara |
|---|---|---|
| GAT | 92.1% | +23.0 pts |
| GCN | 85.2% | +16.1 pts |
| Crystara | 69.1% | — |
| GraphSAGE | 59.9% | −9.2 pts |
| Baseline JEPA | 47.0% | −22.1 pts |
SEC EDGAR financial filings — 9,725 entities, ~3.9M edges.
The scale test. Crystara completes training; GAT runs out of memory and returns no result; GraphSAGE produces no usable output; GCN hits 90.8% on edge-reconstruction but only ~7% on downstream classification, which is near-random and makes the edge number misleading.
| Model | Link-prediction AUC | Note |
|---|---|---|
| Crystara | 66.4% | Completed at 9,725 entities. |
| GCN | 90.8% (edges) / ~7% (cls) | Near-random on classification. |
| Baseline JEPA | 46.4% | −20.0 pts. |
| GraphSAGE | — | No usable results. |
| GAT | OOM | Out of memory. |
Qualitative breakthrough — 16 semiconductor modules.
On the CSET semiconductor graph, Crystara crystallized 16 interpretable modules that mapped 1-to-1 to real semiconductor supply-chain clusters, validated against primary sources (CSET, industry trade data). These modules emerged from persistent homology on Langevin trajectories — no labels, no prompting.
- CMP (chemical-mechanical polishing) pipeline.
- Netherlands / ASML lithography ecosystem.
- Singapore assembly-test-packaging corridor.
- China packaging cluster.
- Design-to-fab chain.
- Specialty chemicals cluster.
- EUV ↔ etch/clean flows.
- AI ASICs ↔ Hitachi dependency.
- Lithography ↔ CMP handoff.
- …and seven more, each independently validated.
The modules capture both obvious clusters (national packaging hubs) and hidden dependencies (EUV ↔ etch/clean flows, AI ASIC ↔ specialty-equipment relationships). That is the signature of a predictor family that has grown into the shape of the data: H₀ attractors pulling industry clusters together, H₁ cycles linking feedback loops in the manufacturing pipeline, H₂ boundary modules handling interfaces between pipeline stages.
Runtime structure discovery & dynamic routing.
Crystallized modules live inside a DynamicPredictor. Each module is typed (H₀ / H₁ / H₂), persistence-scored, and lifecycle-managed. A query embedding is routed to modules using type-specific geometry:
- H₀ modules — routed by centroid distance.
- H₁ modules — routed by phase alignment in the learned frequency basis.
- H₂ modules — routed by distance to the discovered boundary.
A learned gate mixes the module predictions with the base JEPA prediction:
combined = base_pred
+ α · token_gate(base_pred)
· module_norm( Σ_m route[m] · module_m(z) )α is a conservatively-initialized sigmoid-gated scalar (module_weight = 0.05), so modules never dominate the base head unless they earn it. A registry tracks utilization — low-use modules decay, high-use modules can be re-seeded. That lifecycle is what turns “a trained model” into structure-as-artifact at inference time: the predictor family is itself an auditable, composable object you can inspect, prune, and route against.
Representation geometry — supporting diagnostics.
On the controlled Two Rooms environment (64×64 gridworld with a doorway), the mechanism is visible in the embedding geometry:
| Metric | Vanilla JEPA | TCD-JEPA | Δ relative |
|---|---|---|---|
| Linear probe | 57.93% ± 15.73% | 57.73% ± 7.60% | −0.3%, variance halved |
| k-NN k=1 | 24.33% ± 1.07% | 29.93% ± 0.87% | +23.0% |
| k-NN k=5 | 26.13% ± 1.80% | 29.73% ± 2.60% | +13.8% |
| k-NN k=20 | 24.17% ± 0.50% | 34.40% ± 0.00% | +42.3% |
Linear probe is roughly unchanged — the global class boundary was always learnable. Every local-neighborhood metric (k-NN) improves, with the biggest gain at the widest neighborhood. Variance on the linear probe halves (15.73% → 7.60%), which is the clearest evidence that the explorer-crystallizer loop converges to roughly the same set of modules from different initializations — the persistent-homology features are properties of the data, not of the seed.
Engineering surface.
- Three orchestrators (training, distributed, manifold) plus the recursive-loop glue.
- Persistent-homology backends with fallback chain: giotto-tda → ripser → scipy.
- 186 tests. NaN guards, DDP / FSDP compatibility, SLURM scripts for multi-node runs.
- Nine real-world data adapters. CSET semiconductor, GDELT, SEC EDGAR, USPTO, PubMed, ogbn-arxiv, and more — the same three-system loop handles each without changing the predictor class.
- Typed-module API. Dynamic loading, runtime routing, lifecycle management, and persistence-score-based pruning.