Data Federation
archmax uses DuckDB as an in-process query engine that can federate queries across multiple database connections. Each project gets its own DuckDB instance with connections attached as named schemas.
How It Works
Section titled “How It Works”When you add a connection to a project and mark it as active, archmax attaches it to the project’s DuckDB instance using DuckDB’s extension system:
| Database Type | DuckDB Extension | Registry |
|---|---|---|
| PostgreSQL | postgres | Core |
| MySQL | mysql | Core |
| SQL Server | mssql | Community |
| SQLite | sqlite | Core |
| DuckDB | Native | - |
| Iceberg REST Catalog | iceberg + httpfs | Core |
Each connection’s slug becomes the DuckDB schema alias. For example, a connection with slug shopify attached from a Postgres database makes its tables available as shopify.<schema>.<table>.
SQL Server (MSSQL) Connections
Section titled “SQL Server (MSSQL) Connections”The SQL Server connector uses the DuckDB MSSQL community extension, which is installed automatically from the community registry.
When using Connection Details (host/port fields), archmax builds an ADO.NET connection string:
Server=host,port;Database=db;User Id=user;Password=pass;Encrypt=yesWhen using Connection URI, you can pass any format the extension accepts (ADO.NET or URI) directly.
The Encrypt connection (TLS) toggle controls the Encrypt parameter and defaults to on. Disable it only for local development servers that don’t support TLS.
Iceberg REST Catalog Connections
Section titled “Iceberg REST Catalog Connections”Iceberg connections let you query data in Apache Iceberg tables exposed through an Iceberg REST Catalog (Lakekeeper, Polaris, Cloudflare R2, etc.).
Unlike RDBMS connections, Iceberg uses a two-step attach: archmax creates a DuckDB secret with the bearer token, then attaches the catalog with the REST endpoint.
Required fields:
| Field | Description |
|---|---|
| Catalog Endpoint | URL of the Iceberg REST Catalog (e.g. https://catalog.example.com) |
| Warehouse | Warehouse identifier registered in the catalog |
| Bearer Token | Authentication token for the catalog |
Once attached, Iceberg namespaces and tables are available as <slug>.<namespace>.<table> and can be joined with any other connection type.
Limitations:
- Read-only access (no INSERT/UPDATE/DELETE)
- Bearer token auth only (OAuth2 support planned)
- Backed by S3 or GCS object storage (other backends not yet supported by DuckDB)
Cross-Connection Queries
Section titled “Cross-Connection Queries”Once multiple connections are attached, you can query across them in a single SQL statement:
SELECT s.order_id, s.total_amount, d.invoice_numberFROM shopify.public.orders sJOIN datev.dbo.invoices d ON s.order_id = d.external_order_idWHERE s.created_at > '2024-01-01'This is used internally by:
- The Data Browser for exploring connection schemas
- The Semantic Model Agent for discovering and mapping tables
- The MCP
execute_querytool for AI agent queries (through scoped VIEWs)
Scoped VIEWs
Section titled “Scoped VIEWs”When AI agents use execute_query or runModelQuery, they don’t access raw connection schemas directly. Instead, each semantic model gets a set of VIEWs whose SELECT bodies are authored explicitly by the modeller in each dataset’s view_query custom extension. This is the curated access layer.
Per-model VIEWs are (re-)materialised on every model-scoped query with CREATE OR REPLACE VIEW; there is no in-memory cache. Editing a dataset’s view_query takes effect on the very next call. Agents reference datasets by their bare name (e.g. FROM "orders") — the platform resolves the name to the right VIEW automatically. A dataset without view_query is unqueryable, and the tools return a clear error identifying the offending datasets.
Bare-name references are enforced structurally. Every query passes through a validator that parses the SQL with DuckDB’s own parser and rejects any base-table reference whose AST has a non-empty schema_name or catalog_name, regardless of how the reference is spelled in source text — quoted ("shopify"."public"."orders"), unicode-escaped, or dollar-quoted variants are all rejected on the parse tree, not on the raw text. See the execute_query reference for the full denylist.
Limits of the View Layer
Section titled “Limits of the View Layer”What the view layer does protect against
Section titled “What the view layer does protect against”These properties hold whenever the validator and view_query materialisation are working correctly:
- No raw catalog escape from MCP. An
execute_querycaller cannot reach attached connections directly — anycatalog.schema.tablereference is rejected on the AST regardless of quoting variant ("shopify"."public"."orders",U&"…", dollar-quoted, mixed-case). - No system-catalog enumeration.
information_schema,pg_catalog,sqlite_master,main,temp, andsystemare denied at the validator. So is anything beginning with_scope_orduckdb_(covers cross-model schema references and DuckDB metadata views likeduckdb_secrets). - No file or network egress.
enable_external_access = falseblocksread_csv/read_parquet/pg_read_file/COPY/ HTTP fetches at the engine level; the validator denylist covers the same surface at parse time. - No DDL or mutation. Only SELECT / WITH / set-operation / EXPLAIN / DESCRIBE statements pass;
EXPLAIN ANALYZE(and comment-evasion variants) is rejected because it executes the wrapped statement. - No stale view exposure. A
view_querythat fails to rematerialise (validator rejection or DuckDB error) makes the model unqueryable until fixed; MCP refuses to fall back to the previous view body.
What the view layer does NOT protect against
Section titled “What the view layer does NOT protect against”- Modeller mistakes. A
view_queryofSELECT * FROM shop.public.usersexposes every column ofusers, including PII, password hashes, and audit columns. The platform never inspects view bodies for sensitive content, never auto-redacts columns, and never enforces row-level filters. Treat eachview_queryas an authorisation contract you have to write by hand. - Co-tenancy across MCP tokens scoped to the same model. Two MCP bearer tokens granting access to model
ecommercesee exactly the same rows. There is no per-token row filter, nocurrent_user-aware view body, no automatic tenant-id injection. If you need per-tenant isolation, model the tenant column explicitly inview_queryand split tenants across separate archmax projects (or build a tenant-awareview_querythat reads the agent identity fromparams). - Co-tenancy across models in the same project. Every model in a project shares one DuckDB instance, one
duckdb.dbfile, and the same set of attached connections with the same database credentials. There is no per-model database role. A connection user with DELETE rights upstream gives DuckDB latent DELETE rights for every model in the project — onlyenable_external_access = falseand the read-only validator keep that latent power from being used. - DuckDB / extension privilege-escalation bugs. The validator parses with DuckDB’s own parser and runs queries on the same process that holds every attached catalog. A binder, optimizer, or extension bug that lets a structurally-valid SELECT reach an unintended object (or trigger arbitrary code via a malicious community extension) is not contained by the view layer. Keep DuckDB and its extensions up to date.
- Resource exhaustion across the project.
threads = 2andmemory_limit = '512MB'are session settings on a shared instance. A pathological query under one model can still cause memory pressure for queries running under another model in the same project, and the per-projectwithProjectQuerySlotsemaphore caps concurrency but does not isolate work. - Side-channels through the persistent file.
<ARCHMAX_DATA_DIR>/projects/<projectId>/duckdb.dbholds every model’s_scope_<model>schema in plain DDL. Anyone who can read the file can read every materialised view body. Protect the data directory with the same filesystem permissions you would give a database data dir. - Upstream credential leakage. Connection passwords / tokens are encrypted at rest and redacted in API responses, but they are still passed to DuckDB’s connector for every
ATTACH. A query that triggers a driver-level error does not leak them to the API caller (the connection-test endpoint sanitises and the federated query path returns generic errors), but anything that gets into operator logs is your responsibility to retain or rotate.
Operator checklist
Section titled “Operator checklist”- Use a least-privilege database user for every connection — read-only on exactly the schemas your
view_querybodies need. The validator is defence-in-depth on top of this, not a replacement for it. - Treat
view_queryas the authorisation boundary. Project only the columns the model needs (SELECT id, status, total_amount FROM …, neverSELECT *). Apply the same row filters you would apply in a database VIEW intended for an external consumer (WHERE deleted_at IS NULL,WHERE tenant_id = …). - Don’t put two trust domains in the same project. Separate customers, separate environments (prod vs staging), or separate sensitivity tiers go into separate archmax projects so they don’t share the DuckDB instance, the persistent file, or the connection set.
- Rotate the upstream database password when scope changes. Removing a dataset’s
view_querydoesn’t revoke any database privilege — only deactivating or deleting the connection does. If a model previously had access to a sensitive column and you tightened the body, also rotate the upstream credential. - Lock down
ARCHMAX_DATA_DIRwith filesystem-level permissions; treat each project’sduckdb.dbas sensitive.
Connection Lifecycle
Section titled “Connection Lifecycle”- Active connections are attached to DuckDB and available for querying
- Inactive connections are detached from DuckDB but their configuration is preserved
- Deleted connections are soft-deleted and fully detached
Each project’s DuckDB state lives in a persistent duckdb.db file under <ARCHMAX_DATA_DIR>/projects/<projectId>/. The instance is opened lazily on first use and closed on graceful shutdown so the file lock is released cleanly. Materialised VIEWs survive restarts because they are stored in that file. The file MUST NOT be committed to a project’s Git repo (it is in the default project .gitignore) and it is safe to delete when the project is offline.
Persistent DuckDB File (Operations)
Section titled “Persistent DuckDB File (Operations)”Each project’s DuckDB instance is backed by a single file at <ARCHMAX_DATA_DIR>/projects/<projectId>/duckdb.db (plus its .wal and .tmp siblings during writes). Materialised VIEWs survive process restarts because they are stored in this file.
Operational notes:
- Where it lives —
<ARCHMAX_DATA_DIR>/projects/<projectId>/duckdb.db.ARCHMAX_DATA_DIRdefaults to./datain development. - Resetting it — call
POST /api/projects/<projectId>/connections/reinit?reset=true. The API disposes the cached instance, deletes the file, and rebuilds from scratch on the next query. Without?reset=true, the file is preserved acrossreinit. - Backups — the file is a regular DuckDB database. Stop the process, copy the file (and its
.wal/.tmpsiblings if present), and restart. There is no online-backup contract. - Git —
duckdb.db,duckdb.db.wal, andduckdb.db.tmpare excluded from every project’s.gitignoreand from the repository-level.gitignore. Do not commit them — they leak transient state, are noisy, and may contain catalog metadata. - Graceful shutdown — the API closes every cached instance on SIGTERM/SIGINT before exit so the file lock is released cleanly. Force-killing the process may leave the lock held; the next start opens the file fine, but you may briefly see a recovery message in the logs.
- Disk pressure — the file grows roughly with the count and complexity of attached catalogs and materialised VIEWs. There is no automatic compaction; reset via
?reset=trueif a project’s file grows unexpectedly large.
Refreshing Schemas
Section titled “Refreshing Schemas”DuckDB reads upstream schema information at the time each connection is attached and keeps serving that snapshot from the cached instance. If a table is added, dropped, or altered in the underlying database, those changes are not visible until the project’s DuckDB instance is rebuilt.
Use the Re-explore schemas button in the Data Sources page header to force a refresh. It disposes the cached project DuckDB instance, re-attaches every active connection from scratch, and probes the result so the data browser, semantic-model agent, and MCP tools immediately see the current schema. The button reports the total number of tables visible after the refresh; if a connection is unreachable, an error toast surfaces the underlying message.