Skip to content

Semantic Models

Semantic models are YAML files that describe your database tables as datasets with typed fields, relationships, and metrics. They follow the Open Semantic Interchange (OSI) spec and use DuckDB SQL syntax for expressions.

archmax uses the OSI YAML format as its internal storage format — every dataset, field, relationship, and metric is persisted as spec-compliant YAML on disk. When AI agents request model information through MCP tools, archmax converts the OSI YAML into a compressed markdown digest on-the-fly. This digest preserves all semantically relevant information (field types, descriptions, enums, relationships, examples) while using 3–5× fewer tokens than the equivalent YAML, making agent interactions significantly cheaper and faster.

Each project stores its semantic models under <ARCHMAX_DATA_DIR>/projects/<projectId>/src/. A model consists of a root file and per-dataset files:

src/
├── ecommerce.yaml # Root file: name, description, relationships, metrics
└── ecommerce/
├── orders.yaml # Dataset: orders table
├── customers.yaml # Dataset: customers table
└── products.yaml # Dataset: products table

The root YAML file contains model-level metadata:

name: ecommerce
description: E-commerce data model for order analytics
ai_context:
instructions: Use this model for revenue and order analysis
synonyms:
- sales model
- shop data
relationships:
- name: order_customer
from_model: orders
from_columns: [customer_id]
to_model: customers
to_columns: [customer_id]
ai_context: Links orders to the customer who placed them
metrics:
- name: total_revenue
description: Sum of all order amounts
expression:
dialects:
- dialect: ANSI_SQL
expression: "SUM(orders.total_amount)"
ai_context: Total revenue across all orders

Each dataset maps to a database table or view:

name: orders
source: shopify.public.orders
primary_key: [order_id]
description: Customer orders with line items and totals
ai_context:
instructions: Each row is one order. Use total_amount for revenue calculations.
fields:
- name: order_id
expression:
dialects:
- dialect: ANSI_SQL
expression: "order_id"
custom_extensions:
- vendor_name: COMMON
data:
data_type: INTEGER
- name: total_amount
expression:
dialects:
- dialect: ANSI_SQL
expression: "total_amount"
custom_extensions:
- vendor_name: COMMON
data:
data_type: DECIMAL
- name: created_at
expression:
dialects:
- dialect: ANSI_SQL
expression: "created_at"
dimension:
is_time: true
custom_extensions:
- vendor_name: COMMON
data:
data_type: TIMESTAMP

When a model contains many datasets, you can organize them into groups, which appear as visual bounding-box rectangles in the graph view. Groups are stored in the root YAML file’s custom_extensions:

custom_extensions:
- vendor_name: COMMON
data: '{"dataset_groups":[{"id":"grp_abc12345","name":"Order Management","datasets":["orders","order_items","customers"],"color":"sage"}]}'
  • Right-click a dataset → “Create group” to start a new group containing that dataset
  • Right-click a dataset → “Add to group” to move it into an existing group
  • Right-click a dataset → “Remove from group” to ungroup it
  • Right-click a group box → “Rename group” or “Delete group”
  • Double-click a group label to rename it inline

The AI builder automatically creates groups when building models with 4 or more datasets, clustering by schema prefix, star-schema topology, or business domain.

Groups use a 4-color CI palette: sage, rose, blue, purple. Colors are assigned automatically when creating groups.

Every entity (model, dataset, field, relationship, metric) supports ai_context, either a plain string or a structured object with instructions, synonyms, and examples. This metadata is surfaced to AI agents through MCP tools, helping them understand what the data means and how to use it.

The admin UI includes a chat-based AI agent that can build semantic models for you. Navigate to Semantic Models and start a new conversation describing what kind of model you want. The agent will:

  1. Schema discovery: list tables and columns from your connections
  2. Field mapping: create typed fields with correct SQL expressions
  3. Enum detection: find columns with limited distinct values
  4. Relationship inference: detect foreign keys and join paths
  5. Metric definition: suggest common aggregations

You can also create and edit models manually through the YAML files or the admin UI editor.

After editing a model, click Publish to make it available to MCP clients. Publishing assembles the split source files (src/) into optimized single-file YAMLs in the build/ directory. MCP tools in production always read from the published build.

AI agents never interact with the raw OSI YAML files. Instead, the MCP tools (get_semantic_model, get_datasets) convert the YAML into a compressed markdown digest before returning it to the agent. This conversion:

  • Reduces token usage by 3–5× — the OSI YAML format includes verbose structures like expression.dialects[].expression, custom_extensions[].vendor_name, and deeply nested ai_context objects. The digest flattens these into compact markdown tables and bullet lists.
  • Preserves all agent-relevant context — field types, descriptions, enum values, example data, synonyms, relationships, metric expressions, and query instructions are all included.
  • Supports pagination — large models with many datasets or fields are paginated so agents can drill into specific sections without loading everything at once.

For example, a field that looks like this in OSI YAML:

- name: status
expression:
dialects:
- dialect: ANSI_SQL
expression: "status"
description: Current order status
custom_extensions:
- vendor_name: COMMON
data: '{"data_type":"VARCHAR","distinct_values":["pending","shipped","delivered"]}'
ai_context:
synonyms:
- order state
instructions: Filter on this field to segment by fulfillment stage

Becomes a single line in the markdown digest:

- **status** `VARCHAR` {pending, shipped, delivered} — Current order status | _order state_ | Note: Filter on this field to segment by fulfillment stage

This compression is what makes it practical to give agents full context about large models without burning through token budgets.