Storage
DesignStorage is the set of patterns every entity in Omniglass lands on, so an operator can trust that scope, audit, retention, and lineage behave the same way no matter which table the data lives in. This page describes how storage works, the patterns every other leaf’s entities land on, not a per-table column dump.
Postgres is the relational system of record: it holds the entities, events, alarms, actions,
audit, config, and the platform settings store. It is the record/state/intent lane. It is never a
message bus: the live signal travels on NATS JetStream, and Postgres earns its place as the durable
record. Two writes paths land here, and only one is the request path. Operator mutations and the
record/state/intent lane (config, ack/snooze, settings, manual commands, plus the event and
alarm rows an event_rule consumer commits in one transaction) are written synchronously through
the Storage Gateway. The datapoint tables are an async SINK: a NATS persistence consumer
batch-writes datapoints off the data lane (datapoints), idempotent on
(series, ts), so the rule engine never waits on a datapoint reaching Postgres. Committed changes on
the record lane are fanned out by a leader-elected CDC publisher (logical decoding of the WAL) to
JetStream; there is no dual-write, the change is born in the commit and CDC carries it. The column schemas live
with each owning feature: datapoints (the three
kind-tables), events (the event row), alarms and
actions (alarm / action), config and
credentials (variable / config / tags), core
entities and templates (the structural and
template tables), collection (interfaces and tasks),
calculations (the rule families), files,
time, and identity and access.
Conventions
Section titled “Conventions”- No
tenant_id. Isolation is per-database (a database per tenant); there is no tenant column anywhere. The key registriesdatapoint_typeandevent_typecarry ascope(template / org / official) deciding where the name is unique (key scope), and the non-template registries (interface_type,component_type,variable_type) carry anofficialboolean, the same axis minus the template layer:official: truerows are the ship-with canonical set distributed with the binary, andofficial: falserows are operator- or org-authored, local to this deployment. - Three storage shapes. Ground-truth records are append-only and immutable, each named for
what it is:
log_datapoint(a datapoint kind),audit_log(operator actions), and the standing*_logground-truth logs (session_log,internal_log, plus thecollection_log/node_logcompanions). There is notelemetrytable: datapoints are published to the JetStream data lane, not synchronously inserted, so the raw payload is not persisted in steady state; the persistence consumer sinks the typed datapoint, and raw appears only on acollection.failedevent or a dev raw-mode tap (datapoints). A schedule fire is not a record here: it is aneventwithorigin=scheduled. There is no separate rule-execution table: derived rows carry their lineage on the row. Datapoints (metric_datapoint/state_datapoint/log_datapoint) are the typed observation firehose. Stateful entities and projections (alarm,action, current-value) hold state directly or are rebuildable read models, views by default. The model is not event-sourced. - Provenance and lineage on every datapoint:
provenance(observed / calculated / intended),source(which sensor or path, for observed), and a lineage pointer. observed and calculated both carrysource_rule(+ version), the function or calc_rule that produced the row; intended carriesevent_id(the command). A CHECK enforces the pointer per provenance; observed vs calculated is theprovenancevalue itself, not a column-presence trick. Declared config is not a datapoint provenance; it lives in config, keyed to the same signal. - Ownership is the exclusive-arc on every datapoint table,
event,alarm, andvariable:owner_kindenum plus the matching typed FK (component_id/system_id/location_id/node_id, or none for the singletonglobal) plus a CHECK that exactly the matching column is set (or all null forglobal). System-, location-, node-, and global-level datapoints are first-class. The full pattern is on core entities. - Keys: datapoints and events use a surrogate id plus
ts; the key registrydatapoint_typecarries ascope(template / org / official) deciding where the name is unique ((template_id, name)at template scope,nameat org/official); structural entities are name-keyed; ataskis content-addressed (hash(interface, kind, schedule, params)); anodeby name.
How the records relate
Section titled “How the records relate”The relationships, not the columns. The columns of each table live on its owning leaf (linked above).
The structural and template entities (component / system / location and the *_template /
*_template_version / system_template_member / system_member families) relate as shown on
core entities and templates; the
collection entities (interface_type / interface / task) on
collection.
Two lanes land in Postgres differently
Section titled “Two lanes land in Postgres differently”Every row in Postgres arrives on one of two lanes, and the lane decides how the row is written and how the rest of the platform learns it changed.
- The data lane (a sink). Observed and calculated datapoints live on the JetStream data lane.
The rule engine consumes them directly off NATS; Postgres is the durable record, not the live
signal. The persistence consumer is a durable JetStream consumer that batch-writes the
metric_datapoint/state_datapoint/log_datapointtables as an async sink, idempotent on(series, ts), so a redelivery lands the same row and the firehose never blocks on the database. Datapoints do not flow through CDC: they are already on NATS. - The record/state/intent lane (PG-first, CDC-out). Events, alarms, actions, and operator
mutations (config, ack/snooze, settings, manual commands) are born in a Postgres transaction.
When an
event_ruleconsumer fires, it writes theeventrow and thealarmtransition in one transaction (the alarm transition is serialized per(event_rule, owner)); the API writes config, acks, and settings the same way. There is no row-lock single-fire worklist and noLISTEN/NOTIFYfan-out: the change is committed once, and the CDC publisher carries it outward.
The CDC publisher is leader-elected (exactly one active, fail over on death) via a NATS KV
CAS lock, the same singleton pattern the clock uses (time). It reads the WAL
by logical decoding and publishes each committed change to JetStream, where action_rule,
reconcile, and projection consumers react. The replication slot and publication it reads are
ensured in the idempotent boot phase (the same phase that upserts ship-with reference data),
not a run-once migration: boot creates them if absent and leaves them untouched if present, so a
fresh database and an existing one converge to the same state. Delivery is at-least-once with an
idempotency key per change, so a consumer that sees a change twice is a no-op.
Ground-truth records
Section titled “Ground-truth records”The immutable, append-only records, each named for what it is. They are the lineage targets and what
a backtest reads; none is derived. The detailed columns of audit_log live on
audit, session_log on nodes; the rest is a
compact list here because storage is their natural architectural home:
log_datapoint(a component’s own words, a datapoint kind, datapoints);audit_log(operator actions: actor, verb, resource,old -> new; the lineage target for operator writes; secret decrypts always recorded, audit);session_log(connection-lifecycle transitions, node-reported; the connection log, nodes);internal_log(platform self-narration: startup / reconcile / migration / node-reg / config-sync, workers);- the
collection_log/node_logcompanions (the cheap per-run execution record and the node’s operational narration).
There is no separate rule-execution table: a derived row is the evidence of its rule’s run, carrying its lineage on the row (below).
The lineage CHECK (the pattern)
Section titled “The lineage CHECK (the pattern)”Lineage lives on the derived row, no separate execution table. This is the pattern every derived
row follows: source_rule (+ version) is set for observed and calculated (the function or calc_rule
that produced the row); intended carries the command event_id. The pointer per provenance is enforced
so e.g. “intended with no command event” is impossible at the storage layer. One example, the datapoint
tables:
CHECK ( (provenance IN ('observed','calculated') AND source_rule IS NOT NULL AND event_id IS NULL) OR (provenance = 'intended' AND event_id IS NOT NULL AND source_rule IS NULL))Observed and calculated both carry source_rule; they are distinguished by the provenance
column, not a pointer-presence trick (an edge function versus a calc_rule). The intended split is
the one the CHECK enforces. This is one of three layers: the CHECK enforces which pointers are populated, foreign keys enforce
the ids are real, and the app enforces the value type matches the key’s kind.
The datapoint tables also carry nullable correlation_id and caused_by_event_id trace
columns. These are orthogonal to the lineage pointers above: they are not lineage pointers, so they
do not participate in the exclusive-lineage CHECK. They carry causation across the command -> device
-> observed-datapoint round trip so the cycle guard walks a real id (datapoints,
alarms and actions). On the wire these ride in NATS message
headers: a datapoint published to the data lane carries its correlation_id / caused_by_event_id
in the message header alongside the Nats-Msg-Id dedup key, and the persistence consumer lands them
into these columns, so the trace is unbroken from the live signal to the durable record.
Current value and projections: views by default
Section titled “Current value and projections: views by default”alarm and action are stateful entities that hold their own current state in a real table
(not event-sourced). Everything else that is “current state” is a read model, and the default is
a plain SQL view (always-correct, never stale, zero maintenance). A worker-maintained table is a
measured optimization, earned only when a read profile shows a view too slow.
| Read model | Of | Shape | Notes |
|---|---|---|---|
current_value | latest datapoint per (owner, key, instance, provenance), fused across sources per the key’s fusion_policy | view | the dashboard read; per-provenance so observed and intended are both visible (the divergence model needs both), per-instance so siblings of one key stay distinct, fusion applied on read. The one table candidate if a profile earns it, metric kind only |
session | session_log | view | low-volume; node, interface, status, opened_at, last_activity_at, command/error counts |
When the view stops scaling. A latest-per-key view’s cost scales with the number of distinct
keys (a loose index scan), not total rows. Point and scoped reads (“current value of X on Y”) are
a covering-index probe, fast at any size. A full-fleet “every current value” is O(distinct keys):
comfortable to hundreds of thousands, painful past a few million. A naive DISTINCT ON scans the
whole log and dies on the firehose; never that plan.
So only current_value for the metric firehose is even a table candidate, and only when
frequent full-fleet reads meet low-millions-plus distinct keys. The sparse kinds (state / log)
stay views indefinitely. A worker-maintained table costs one upsert per datapoint write (write
amplification, hot-key contention) and reintroduces a staleness window; that cost must be earned by
a read profile, not assumed. Never a materialized view: a PG MV is stale between refreshes and
has no incremental refresh, so a refresh is a full firehose recompute. The choice is plain view
(default) versus inline table (profiled).
Partitioning and retention
Section titled “Partitioning and retention”- Append-only tables are range-partitioned by
ts(native declarative partitioning;pg_partmanwhere the provider permits, else a documented manual roll). The firehose (metric_datapoint) is the partitioning-critical one. - Retention is per table, set by policy, not one global TTL:
metric_datapointshort,state_datapoint/log_datapointlonger,audit_loglongest (compliance),internal_logshort. On-row lineage ages out with its datapoint. The per-table defaults are cascade-resolved (cascade) with global defaults, so a class or entity can hold longer or shorter without a global change. - The
raw_samplebuffer (the opt-in raw-retention policy, collection) is range-partitioned bytsand cold-tierable like the metric partitions, on a short retention. It is bounded, sampled, and short-lived; it is not a telemetry table. - Views are not partitioned (bounded by fleet size, not time) and are computed from the underlying tables, never the source of truth.
The Storage Gateway and tiering
Section titled “The Storage Gateway and tiering”The Storage Gateway is the only door to the database (no direct access, no
PostgREST); it is also where IAM scope is injected, per action: every query carries
visible_set(P, action) for the specific action it performs, so a read filters by read-scope and an
:ack write filters by ack-scope. A write whose action-scoped predicate matches 0 rows is surfaced to
the handler as a 403 or 404, never a silent success, matching the up-front canDo decision
(identity and access). Isolation is per-database (one database per
tenant, paired one-to-one with one NATS account, datapoints), so there
is no tenant context to set. Every read and write lands here: the synchronous request path runs in
scoped mode, and the persistence-consumer datapoint sink and the CDC publisher run in system
mode (trusted internal work, all-visibility), the same three-mode contract identity and access
describes. The CDC publisher reads committed changes by logical decoding of the WAL, a
replication-protocol stream beneath the table surface; that is how it learns of a change without
re-querying, not a second application path around the Gateway. Because every
application read and write goes through the Gateway, the physical backend is swappable beneath it:
- default: Postgres for everything (datapoints, ground-truth records, views, registries). In single-binary mode the one binary embeds a real Postgres (the same code path runs an external Postgres at scale); the data lane’s persistence consumer and the record lane’s CDC publisher both target this one backend.
- tiering: the firehose does not stay in hot Postgres forever. Aged
metric_datapoint/log_datapointpartitions tier out to a columnar or object store (Parquet on S3-compatible, or an embedded columnar engine) behind the same gateway, so historical queries fan across hot and cold with no model change. The cold tier is partitioned byts.
Query construction: typed, parameterized, generated
Section titled “Query construction: typed, parameterized, generated”The gateway builds every query with jet, a type-safe SQL builder
whose column and table types are generated from the dbmate-managed schema (dbmate stays the single
schema authority; jet regenerates after migrate). The shape is dynamic (the per-action scope predicate,
the filter expression, order, pagination compose at runtime) but the safety
is structural, not by discipline:
- Values are always bound parameters, never interpolated into SQL text.
- Identifiers (columns, tables) are typed constants from the generated schema, so a wrong or attacker-supplied column name is a compile error, never a string. The filter language’s field names resolve against those same generated columns before they become a predicate.
- Operators are a closed set.
A wrong column or type fails the build, so the compiler and tests catch a bad query before runtime, which
is what keeps the gateway safe to evolve and safe for an AI to edit. Because all dynamic construction
lives in this one module, the injection-safe discipline is a single reviewable chokepoint. The one
carve-out is the high-volume datapoint insert (the persistence consumer), which may use pgx COPY for
throughput, still inside the gateway. It runs in all-visibility system mode, not per-row scoped: its
safety rests on the typed column targets plus the upstream admission consumer having already confined
owners (identity and access), not on a per-write scope predicate.