Participatory Data Estate — Diren Kumaratilleke

The Participatory Data Estate — what the framework is.

Standard data-governance architectures assume three things that modern governance environments can no longer afford: that data ingestion is a batch (scrape, chunk, embed, freeze), that moderation is a private workflow, and that the audit trail is a compliance artifact collected for regulators rather than shared with constituents. Each of those assumptions fails the moment the governance substrate is amended, contested, reorganized, or audited in public.

The Participatory Data Estate inverts all three. Ingestion is a continuous pipeline — Submit → Moderate → Thin → Crystallize — in which raw human submissions are accepted without being trusted, moderated through an explicit approval surface, chunked into independently-retrievable units, and embedded into a hybrid vector + full-text index. Moderation is a public transition: every state change is a row in a approval_log table with a public-read RLS policy, so the full provenance chain of any document — who submitted, when it was approved, by whom, with what notes — is queryable by anyone with SQL. Retrieval never sees pending or rejected content. Security is federal-agency-derived, not student-gov-derived: time-constant auth, RLS on every table, rate limits on four action classes, CSP/HSTS/X-Frame, XSS detection. That combination — participatory ingestion + public audit ledger + federal-hardening controls — does not exist in the dominant data-governance stacks. It is what the framework proposes: a new shape for digital data governance.

The framework generalizes. Any organization whose knowledge-corpus is amended through human submissions (a municipal agency, an NGO, a standards body, a scholarly society, a regulator, a policy platform) needs exactly this shape. SGUNCCH, below, is the first live deployment — a full-platform UNC student-government stack running the framework end-to-end. It exists because student governance is an unusually severe test: amendments are frequent, contests are real, the electorate is the same size as a small city, and the platform must degrade gracefully when a central service is unreachable. If the framework runs here, it runs elsewhere.

SGUNCCH — the first live deployment.

Most student-government platforms ship as a WordPress site with a public feedback form. SGUNCCH ships with every element of the Participatory Data Estate framework, hardened with patterns drawn from federal-agency security guidance (NIST 800-53 / OWASP ASVS families). Every item below is implemented in the repository — not aspirational, not planned — and I do not believe any other student-government platform currently in production runs this stack:

Control	Implementation
Admin authentication	Environment-variable secret, time-constant comparison to prevent timing attacks.
Session management	4-hour inactivity expiry, rotating tokens.
Rate limiting	API 100 / min · Login 5 / 15 min · Forms 10 / min · Feedback 5 / hour.
Input sanitization	`sanitizeText`, `sanitizeEmail`, `sanitizeURL`, `sanitizePhone`, `sanitizeObject`, schema-based `validateFormData`.
XSS defense	Content Security Policy + `containsXSS` detector + HTML-encoded text inputs.
HTTP headers	`X-Frame-Options: DENY`, `X-Content-Type-Options: nosniff`, `X-XSS-Protection`, `Referrer-Policy`, `Permissions-Policy`, `Strict-Transport-Security`.
Row-level security	Enabled on every Supabase table. Public-read policies only on explicitly-approved content.
API hardening	HTTP method allowlist on all routes, rate-limit middleware, error messages sanitized in production, no sensitive data in error responses.
Audit trail	`approval_log` table, publicly-readable RLS policy, indexed by document and by time.

Platform scope.

SGUNCCH is not one feature. It is four composable surfaces shipped in one hardened codebase:

The Scroll — governance-document RAG. Submit → Moderate → Thin → Crystallize, hybrid pgvector + GIN full-text retrieval, hash-deduplicated submissions, 15 seed documents (constitution, statutes, policies, conduct code), public approval log.
The Budget Engine — AI-scored funding allocation. 10 category classes (events, travel, merch, supplies, wellness, food, marketing, technology, emergency, other), 5 request statuses (pending, approved, denied, reallocated, spent), SG-priority alignment scoring across 5 priority categories (wellness, basic-needs, academic-support, safety, sustainability), price-check calls to Groq for reality-checking line items.
The Knowledge Base + Chat Layer — a Groq-backed chat interface grounded in the RAG corpus, with 1536-dim embeddings, an active message history table, and transparent source citation.
The Policy Platform — 40 policies across 8 departments (Student Wellness, Basic Needs, Academic Affairs, Civic Engagement, Communications, DEI, Environmental, State & External), real-time progress tracking, mobile-first responsive design.

Each surface shares the same security posture, the same audit trail, and the same RLS discipline. The platform degrades gracefully: if Supabase is unreachable, a committed codex-seed.json file-based fallback takes over so the governance corpus remains queryable. Graceful degradation of this shape is standard in hardened SaaS and uncommon in student-gov infrastructure.

The substrate has to stay alive.

Static knowledge bases rot. The dominant RAG pipelines treat documents as a one-time batch: scrape, chunk, embed, freeze. Real organizations do not work that way. Governance is amended. Policies are drafted, contested, approved, superseded. A platform that pretends otherwise is inaccurate by construction.

The Participatory Data Estate (PDE) is the ingestion primitive. It is currently instantiated as The Scroll — the governance-document RAG layer of a UNC-scale student-government policy platform. Its job is to keep the substrate honest: to accept arbitrary human submissions, moderate them, chunk them, embed them, and expose the full provenance chain through an auditable ledger.

The pipeline.

1 · Submit.

A governance_documents row is created with raw content, metadata, and a status of pending. The submission is content-addressable by hash; duplicates are rejected at the index layer (idx_gov_docs_hash).

2 · Moderate.

An approval workflow transitions status to approved or rejected. Row-level security policies ensure that only approved documents are readable by the public RLS policy, while admins retain full management rights. Every state transition is captured as a row in approval_log: timestamp, actor, document id, outcome. That table is the audit trail.

3 · Thin.

Approved documents are broken into document_chunks — paragraph- or section-sized units sized for retrieval, each linked back to its parent by document_id with ON DELETE CASCADE. A generated search_vector column (GIN-indexed) powers full-text search on the chunk body. Thinning is where a blob of prose becomes a set of addressable units.

4 · Crystallize.

Each chunk is embedded into a 1536-dimensional vector and stored via pgvector. The retrieval function match_document_chunks(query_embedding, …) performs cosine similarity over embedding <=> query_embedding. The hybrid layer fuses vector similarity with the GIN full-text index — semantic retrieval plus exact keyword grounding, in a single query. If pgvector is unavailable, the pipeline falls back to FTS-only gracefully.

The transparent allocation ledger.

The ledger is not a conceptual pattern — it is a Postgres table:

CREATE TABLE approval_log (
  id            UUID PRIMARY KEY,
  document_id   UUID REFERENCES governance_documents(id) ON DELETE CASCADE,
  action        TEXT NOT NULL,
  actor         TEXT,
  performed_at  TIMESTAMPTZ NOT NULL,
  notes         TEXT
);

CREATE INDEX idx_approval_log_doc  ON approval_log(document_id);
CREATE INDEX idx_approval_log_time ON approval_log(performed_at DESC);

CREATE POLICY "Anyone can read approval log"
  ON approval_log FOR SELECT USING (true);

Publicly readable, time-indexed, document-indexed. You can walk the full provenance of any document — who submitted it, when it was approved, by whom, with what notes — as a SQL query. The RLS policy explicitly makes the log transparent to the world. That is the “allocation” in participatory: the chain from a constituent’s submission to an answer is readable, not laundered through an opaque moderator.

The retrieval contract.

-- Given a 1536-dim query vector q and a threshold t:

SELECT
  dc.id,
  dc.document_id,
  dc.content,
  1 - (dc.embedding <=> q) AS similarity
FROM   document_chunks dc
JOIN   governance_documents gd ON gd.id = dc.document_id
WHERE  gd.status = 'approved'
  AND  1 - (dc.embedding <=> q) > t
ORDER  BY dc.embedding <=> q
LIMIT  10;

The contract is narrow and auditable. Retrieval never sees pending or rejected content. Every result carries a document_id that resolves — through a public foreign key — into the approval log. An answer that cites document X, chunk 3 can be walked back to the moment of approval, by whom, and the full original text, without leaving the database.

Why this is the ingestion primitive.

Coordination (BTUT) is linear-time. Structure discovery (Crystara) is runtime. Signal (NIV) is transparent and validated. The fourth primitive has to make the whole thing continuously feedable. PDE does that:

It accepts arbitrary submissions without treating them as trusted.
It has an explicit approval surface — not a hidden moderation queue.
It thins before it embeds — every chunk is an independently retrievable, independently citable unit.
It crystallizes to a hybrid index — vector + full-text, with graceful degradation.
It logs every state transition publicly, so the provenance chain is queryable by anyone.

The Latent Ocean needs a substrate that can absorb the world as the world changes. A model that cannot ingest new governance in the same week that the governance was amended is not a substrate — it is a snapshot. PDE is the primitive that keeps the snapshot from being the final word.

← NIV — Signal

Convergence →