Semantic Models

Semantic models are YAML files that describe your database tables as datasets with typed fields, relationships, and metrics. They follow the Open Semantic Interchange (OSI) spec and use DuckDB SQL syntax for expressions.

archmax uses the OSI YAML format as its internal storage format — every dataset, field, relationship, and metric is persisted as spec-compliant YAML on disk. When AI agents request model information through MCP tools, archmax converts the OSI YAML into a compressed markdown digest on-the-fly. This digest preserves all semantically relevant information (field types, descriptions, enums, relationships, examples) while using 3–5× fewer tokens than the equivalent YAML, making agent interactions significantly cheaper and faster.

File Structure

Each project stores its semantic models under <ARCHMAX_DATA_DIR>/projects/<projectId>/src/. A model consists of a root file and per-dataset files:

src/
├── ecommerce.yaml           # Root file: name, description, relationships, metrics
└── ecommerce/
    ├── orders.yaml           # Dataset: orders table
    ├── customers.yaml        # Dataset: customers table
    └── products.yaml         # Dataset: products table

Root File

The root YAML file contains model-level metadata:

name: ecommerce
description: E-commerce data model for order analytics
ai_context:
  instructions: Use this model for revenue and order analysis
  synonyms:
    - sales model
    - shop data
relationships:
  - name: order_customer
    from_model: orders
    from_columns: [customer_id]
    to_model: customers
    to_columns: [customer_id]
    ai_context: Links orders to the customer who placed them
metrics:
  - name: total_revenue
    description: Sum of all order amounts
    expression:
      dialects:
        - dialect: ANSI_SQL
          expression: "SUM(orders.total_amount)"
    ai_context: Total revenue across all orders

Dataset Files

Each dataset maps to a database table or view. Every dataset MUST declare a view_query in its COMMON custom extension — it is the SELECT body the platform wraps as the dataset’s view, and downstream MCP tooling cannot query a dataset that lacks one.

name: orders
source: shopify.public.orders
primary_key: [order_id]
description: Customer orders with line items and totals
ai_context:
  instructions: Each row is one order. Use total_amount for revenue calculations.
custom_extensions:
  - vendor_name: COMMON
    data: '{"view_query":"SELECT order_id, total_amount, created_at FROM shopify.public.orders WHERE cancelled_at IS NULL"}'
fields:
  - name: order_id
    expression:
      dialects:
        - dialect: ANSI_SQL
          expression: "order_id"
    custom_extensions:
      - vendor_name: COMMON
        data:
          data_type: INTEGER
  - name: total_amount
    expression:
      dialects:
        - dialect: ANSI_SQL
          expression: "total_amount"
    custom_extensions:
      - vendor_name: COMMON
        data:
          data_type: DECIMAL
  - name: created_at
    expression:
      dialects:
        - dialect: ANSI_SQL
          expression: "created_at"
    dimension:
      is_time: true
    custom_extensions:
      - vendor_name: COMMON
        data:
          data_type: TIMESTAMP

The `view_query` Custom Extension

Every dataset SHALL declare a view_query in its COMMON custom extension. It is the SELECT body the platform wraps as the dataset’s view inside the per-model DuckDB schema. The platform never auto-derives a view from fields; without view_query the dataset is unqueryable.

Three common shapes:

custom_extensions:
  - vendor_name: COMMON
    data: '{"view_query":"SELECT id, status, total_amount, customer_id, ordered_at FROM shop_db.public.orders"}'

custom_extensions:
  - vendor_name: COMMON
    data: '{"view_query":"SELECT id, status, total_amount, customer_id, ordered_at FROM shop_db.public.orders WHERE cancelled_at IS NULL AND test IS NOT TRUE"}'

custom_extensions:
  - vendor_name: COMMON
    data: '{"view_query":"SELECT o.id, o.status, o.total_amount, c.email AS customer_email, o.ordered_at FROM shop_db.public.orders o LEFT JOIN shop_db.public.customers c ON c.id = o.customer_id"}'

Rules:

The body MUST be a single SELECT — no CREATE VIEW wrapper, no semicolon-separated multi-statement payloads.
Every declared field.name MUST appear as a column. Use <expression> AS "<name>" when the physical column name differs.
Reference physical tables by their full catalog.schema.table path. Do not reference other datasets in the same model from inside view_query; cross-dataset joins belong in the relationships layer.
The body is re-materialised as CREATE OR REPLACE VIEW on every model-scoped query, so editing view_query takes effect on the next call without any cache invalidation.

Dataset Groups

When a model contains many datasets, you can organize them into groups, which appear as visual bounding-box rectangles in the graph view. Groups are stored in the root YAML file’s custom_extensions:

custom_extensions:
  - vendor_name: COMMON
    data: '{"dataset_groups":[{"id":"grp_abc12345","name":"Order Management","datasets":["orders","order_items","customers"],"color":"sage"}]}'

Managing Groups in the Graph View

Right-click a dataset → “Create group” to start a new group containing that dataset
Right-click a dataset → “Add to group” to move it into an existing group
Right-click a dataset → “Remove from group” to ungroup it
Right-click a group box → “Rename group” or “Delete group”
Double-click a group label to rename it inline

The AI builder automatically creates groups when building models with 4 or more datasets, clustering by schema prefix, star-schema topology, or business domain.

Available Group Colors

Groups use a 4-color CI palette: sage, rose, blue, purple. Colors are assigned automatically when creating groups.

Dataset Detail Panel

Click any dataset node in the graph view to open the dataset detail panel, a vertical panel that slides in from the right edge of the graph. The panel shows the selected dataset’s metadata in one place:

Description — the dataset’s summary
AI description — the ai_context instructions surfaced to AI agents
Synonyms and Examples — the ai_context synonyms and examples, one entry per line
Fields — every field with its data type and description

The panel spans the full height of the editor (alongside the Graph / Tree / YAML toolbar). Click a different node to switch the panel to that dataset, and use the close button in the panel header to collapse it (the node stays selected in the graph). Clicking the empty canvas clears the selection. Drag the panel’s left edge to resize it.

Editing and Saving

The description, AI description, synonyms, examples, and each field’s description are editable directly in the panel. Make your edits and click Save to persist them. Saving updates only the selected dataset’s YAML file — source, field expressions, view_query, and other custom extensions are left untouched. The Save button stays disabled until you make a change. This is a quick way to correct or enrich metadata without opening the YAML or the chat agent.

AI Context

Every entity (model, dataset, field, relationship, metric) supports ai_context, either a plain string or a structured object with instructions, synonyms, and examples. This metadata is surfaced to AI agents through MCP tools, helping them understand what the data means and how to use it.

AI-Assisted Builder

The admin UI includes a chat-based AI agent that can build semantic models for you. Navigate to Semantic Models and start a new conversation describing what kind of model you want. The agent will:

Schema discovery: list tables and columns from your connections
Field mapping: create typed fields with correct SQL expressions
Enum detection: find columns with limited distinct values
Relationship inference: detect foreign keys and join paths
Metric definition: suggest common aggregations

You can also create and edit models manually through the YAML files or the admin UI editor.

Project Instructions (`AGENTS.md`)

You can give the builder agent durable, project-specific guidance by adding an optional AGENTS.md file at the root of the project directory (<ARCHMAX_DATA_DIR>/projects/<projectId>/AGENTS.md). When present, its contents are loaded into the agent’s instructions at the start of every conversation, so it is a good place for a domain glossary, naming conventions, or standing rules like “always add an ai_context description to every field”. The file is optional — if it is absent the agent simply uses its defaults — and it is treated as authoritative for the project, taking precedence over the agent’s general defaults when they conflict.

This file is versioned alongside the rest of the project, and it is not auto-generated — it is yours to author and edit.

Publishing

After editing a model, click Publish to make it available to MCP clients. Publishing assembles the split source files (src/) into optimized single-file YAMLs in the build/ directory. MCP tools in production always read from the published build.

How Agents See Your Models

AI agents never interact with the raw OSI YAML files. Instead, the MCP tools (get_semantic_model, get_datasets) convert the YAML into a compressed markdown digest before returning it to the agent. This conversion:

Reduces token usage by 3–5× — the OSI YAML format includes verbose structures like expression.dialects[].expression, custom_extensions[].vendor_name, and deeply nested ai_context objects. The digest flattens these into compact markdown tables and bullet lists.
Preserves all agent-relevant context — field types, descriptions, enum values, example data, synonyms, relationships, metric expressions, and query instructions are all included.
Supports pagination — large models with many datasets or fields are paginated so agents can drill into specific sections without loading everything at once.

For example, a field that looks like this in OSI YAML:

- name: status
  expression:
    dialects:
      - dialect: ANSI_SQL
        expression: "status"
  description: Current order status
  custom_extensions:
    - vendor_name: COMMON
      data: '{"data_type":"VARCHAR","distinct_values":["pending","shipped","delivered"]}'
  ai_context:
    synonyms:
      - order state
    instructions: Filter on this field to segment by fulfillment stage

Becomes a single line in the markdown digest:

- **status** `VARCHAR` {pending, shipped, delivered} — Current order status | _order state_ | Note: Filter on this field to segment by fulfillment stage

This compression is what makes it practical to give agents full context about large models without burning through token budgets.