Skip to content

Data Federation

archmax uses DuckDB as an in-process query engine that can federate queries across multiple database connections. Each project gets its own DuckDB instance with connections attached as named schemas.

When you add a connection to a project and mark it as active, archmax attaches it to the project’s DuckDB instance using DuckDB’s extension system:

Database TypeDuckDB ExtensionRegistry
PostgreSQLpostgresCore
MySQLmysqlCore
SQL ServermssqlCommunity
SQLitesqliteCore
DuckDBNative-
Iceberg REST Catalogiceberg + httpfsCore

Each connection’s slug becomes the DuckDB schema alias. For example, a connection with slug shopify attached from a Postgres database makes its tables available as shopify.<schema>.<table>.

The SQL Server connector uses the DuckDB MSSQL community extension, which is installed automatically from the community registry.

When using Connection Details (host/port fields), archmax builds an ADO.NET connection string:

Server=host,port;Database=db;User Id=user;Password=pass;Encrypt=yes

When using Connection URI, you can pass any format the extension accepts (ADO.NET or URI) directly.

The Encrypt connection (TLS) toggle controls the Encrypt parameter and defaults to on. Disable it only for local development servers that don’t support TLS.

Iceberg connections let you query data in Apache Iceberg tables exposed through an Iceberg REST Catalog (Lakekeeper, Polaris, Cloudflare R2, etc.).

Unlike RDBMS connections, Iceberg uses a two-step attach: archmax creates a DuckDB secret with the bearer token, then attaches the catalog with the REST endpoint.

Required fields:

FieldDescription
Catalog EndpointURL of the Iceberg REST Catalog (e.g. https://catalog.example.com)
WarehouseWarehouse identifier registered in the catalog
Bearer TokenAuthentication token for the catalog

Once attached, Iceberg namespaces and tables are available as <slug>.<namespace>.<table> and can be joined with any other connection type.

Limitations:

  • Read-only access (no INSERT/UPDATE/DELETE)
  • Bearer token auth only (OAuth2 support planned)
  • Backed by S3 or GCS object storage (other backends not yet supported by DuckDB)

Once multiple connections are attached, you can query across them in a single SQL statement:

SELECT
s.order_id,
s.total_amount,
d.invoice_number
FROM shopify.public.orders s
JOIN datev.dbo.invoices d
ON s.order_id = d.external_order_id
WHERE s.created_at > '2024-01-01'

This is used internally by:

  • The Data Browser for exploring connection schemas
  • The Semantic Model Agent for discovering and mapping tables
  • The MCP execute_query tool for AI agent queries (through scoped VIEWs)

When AI agents use execute_query or runModelQuery, they don’t access raw connection schemas directly. Instead, each semantic model gets a set of VIEWs whose SELECT bodies are authored explicitly by the modeller in each dataset’s view_query custom extension. This is the curated access layer.

Per-model VIEWs are (re-)materialised on every model-scoped query with CREATE OR REPLACE VIEW; there is no in-memory cache. Editing a dataset’s view_query takes effect on the very next call. Agents reference datasets by their bare name (e.g. FROM "orders") — the platform resolves the name to the right VIEW automatically. A dataset without view_query is unqueryable, and the tools return a clear error identifying the offending datasets.

Bare-name references are enforced structurally. Every query passes through a validator that parses the SQL with DuckDB’s own parser and rejects any base-table reference whose AST has a non-empty schema_name or catalog_name, regardless of how the reference is spelled in source text — quoted ("shopify"."public"."orders"), unicode-escaped, or dollar-quoted variants are all rejected on the parse tree, not on the raw text. See the execute_query reference for the full denylist.

These properties hold whenever the validator and view_query materialisation are working correctly:

  • No raw catalog escape from MCP. An execute_query caller cannot reach attached connections directly — any catalog.schema.table reference is rejected on the AST regardless of quoting variant ("shopify"."public"."orders", U&"…", dollar-quoted, mixed-case).
  • No system-catalog enumeration. information_schema, pg_catalog, sqlite_master, main, temp, and system are denied at the validator. So is anything beginning with _scope_ or duckdb_ (covers cross-model schema references and DuckDB metadata views like duckdb_secrets).
  • No file or network egress. enable_external_access = false blocks read_csv / read_parquet / pg_read_file / COPY / HTTP fetches at the engine level; the validator denylist covers the same surface at parse time.
  • No DDL or mutation. Only SELECT / WITH / set-operation / EXPLAIN / DESCRIBE statements pass; EXPLAIN ANALYZE (and comment-evasion variants) is rejected because it executes the wrapped statement.
  • No stale view exposure. A view_query that fails to rematerialise (validator rejection or DuckDB error) makes the model unqueryable until fixed; MCP refuses to fall back to the previous view body.

What the view layer does NOT protect against

Section titled “What the view layer does NOT protect against”
  • Modeller mistakes. A view_query of SELECT * FROM shop.public.users exposes every column of users, including PII, password hashes, and audit columns. The platform never inspects view bodies for sensitive content, never auto-redacts columns, and never enforces row-level filters. Treat each view_query as an authorisation contract you have to write by hand.
  • Co-tenancy across MCP tokens scoped to the same model. Two MCP bearer tokens granting access to model ecommerce see exactly the same rows. There is no per-token row filter, no current_user-aware view body, no automatic tenant-id injection. If you need per-tenant isolation, model the tenant column explicitly in view_query and split tenants across separate archmax projects (or build a tenant-aware view_query that reads the agent identity from params).
  • Co-tenancy across models in the same project. Every model in a project shares one DuckDB instance, one duckdb.db file, and the same set of attached connections with the same database credentials. There is no per-model database role. A connection user with DELETE rights upstream gives DuckDB latent DELETE rights for every model in the project — only enable_external_access = false and the read-only validator keep that latent power from being used.
  • DuckDB / extension privilege-escalation bugs. The validator parses with DuckDB’s own parser and runs queries on the same process that holds every attached catalog. A binder, optimizer, or extension bug that lets a structurally-valid SELECT reach an unintended object (or trigger arbitrary code via a malicious community extension) is not contained by the view layer. Keep DuckDB and its extensions up to date.
  • Resource exhaustion across the project. threads = 2 and memory_limit = '512MB' are session settings on a shared instance. A pathological query under one model can still cause memory pressure for queries running under another model in the same project, and the per-project withProjectQuerySlot semaphore caps concurrency but does not isolate work.
  • Side-channels through the persistent file. <ARCHMAX_DATA_DIR>/projects/<projectId>/duckdb.db holds every model’s _scope_<model> schema in plain DDL. Anyone who can read the file can read every materialised view body. Protect the data directory with the same filesystem permissions you would give a database data dir.
  • Upstream credential leakage. Connection passwords / tokens are encrypted at rest and redacted in API responses, but they are still passed to DuckDB’s connector for every ATTACH. A query that triggers a driver-level error does not leak them to the API caller (the connection-test endpoint sanitises and the federated query path returns generic errors), but anything that gets into operator logs is your responsibility to retain or rotate.
  • Use a least-privilege database user for every connection — read-only on exactly the schemas your view_query bodies need. The validator is defence-in-depth on top of this, not a replacement for it.
  • Treat view_query as the authorisation boundary. Project only the columns the model needs (SELECT id, status, total_amount FROM …, never SELECT *). Apply the same row filters you would apply in a database VIEW intended for an external consumer (WHERE deleted_at IS NULL, WHERE tenant_id = …).
  • Don’t put two trust domains in the same project. Separate customers, separate environments (prod vs staging), or separate sensitivity tiers go into separate archmax projects so they don’t share the DuckDB instance, the persistent file, or the connection set.
  • Rotate the upstream database password when scope changes. Removing a dataset’s view_query doesn’t revoke any database privilege — only deactivating or deleting the connection does. If a model previously had access to a sensitive column and you tightened the body, also rotate the upstream credential.
  • Lock down ARCHMAX_DATA_DIR with filesystem-level permissions; treat each project’s duckdb.db as sensitive.
  • Active connections are attached to DuckDB and available for querying
  • Inactive connections are detached from DuckDB but their configuration is preserved
  • Deleted connections are soft-deleted and fully detached

Each project’s DuckDB state lives in a persistent duckdb.db file under <ARCHMAX_DATA_DIR>/projects/<projectId>/. The instance is opened lazily on first use and closed on graceful shutdown so the file lock is released cleanly. Materialised VIEWs survive restarts because they are stored in that file. The file MUST NOT be committed to a project’s Git repo (it is in the default project .gitignore) and it is safe to delete when the project is offline.

Each project’s DuckDB instance is backed by a single file at <ARCHMAX_DATA_DIR>/projects/<projectId>/duckdb.db (plus its .wal and .tmp siblings during writes). Materialised VIEWs survive process restarts because they are stored in this file.

Operational notes:

  • Where it lives<ARCHMAX_DATA_DIR>/projects/<projectId>/duckdb.db. ARCHMAX_DATA_DIR defaults to ./data in development.
  • Resetting it — call POST /api/projects/<projectId>/connections/reinit?reset=true. The API disposes the cached instance, deletes the file, and rebuilds from scratch on the next query. Without ?reset=true, the file is preserved across reinit.
  • Backups — the file is a regular DuckDB database. Stop the process, copy the file (and its .wal/.tmp siblings if present), and restart. There is no online-backup contract.
  • Gitduckdb.db, duckdb.db.wal, and duckdb.db.tmp are excluded from every project’s .gitignore and from the repository-level .gitignore. Do not commit them — they leak transient state, are noisy, and may contain catalog metadata.
  • Graceful shutdown — the API closes every cached instance on SIGTERM/SIGINT before exit so the file lock is released cleanly. Force-killing the process may leave the lock held; the next start opens the file fine, but you may briefly see a recovery message in the logs.
  • Disk pressure — the file grows roughly with the count and complexity of attached catalogs and materialised VIEWs. There is no automatic compaction; reset via ?reset=true if a project’s file grows unexpectedly large.

DuckDB reads upstream schema information at the time each connection is attached and keeps serving that snapshot from the cached instance. If a table is added, dropped, or altered in the underlying database, those changes are not visible until the project’s DuckDB instance is rebuilt.

Use the Re-explore schemas button in the Data Sources page header to force a refresh. It disposes the cached project DuckDB instance, re-attaches every active connection from scratch, and probes the result so the data browser, semantic-model agent, and MCP tools immediately see the current schema. The button reports the total number of tables visible after the refresh; if a connection is unreachable, an error toast surfaces the underlying message.