Aegis

Open, audit-grade agentic data quality framework — LLM-powered diagnosis, full audit trail, runs everywhere.

Aegis orchestrates a 5-node LangGraph pipeline that validates your data, diagnoses failures with an LLM, traces root causes through your lineage graph, and logs every decision to a searchable audit trail — all from a single YAML file.

Get Started in 5 minutes View on GitHub

See it in action

╭──────────────────────────────────────────────────────╮
│ Aegis DQ  —  RetailCo E-commerce Demo                │
│ LLM: amazon.nova-pro-v1:0 via AWS Bedrock            │
╰──────────────────────────────────────────────────────╯

✓ Pipeline complete in 7.1s · 12 rules · $0.0056 LLM cost

╭──────────────── Validation Summary ─────────────────╮
│  Rules checked  │  12                               │
│  Passed         │  1   │  Failed  │  11             │
│  Pass rate      │  8%  │  Cost    │  $0.005576      │
╰─────────────────────────────────────────────────────╯

Failures by Severity
  ● CRITICAL (6)  customers_email_not_null · orders_amount_positive
                  orders_customer_fk · payments_order_fk
                  products_price_positive · products_sku_unique
  ● HIGH     (4)  customers_email_not_empty · orders_date_order
                  orders_status_valid · products_stock_non_negative
  ● MEDIUM   (1)  customers_tier_accepted

LLM Diagnoses
  orders_customer_fk  →  Order placed with customer_id=99 that does not exist.
                         Likely cause: customer deleted or test record not cleaned up.

Remediation SQL (LLM-generated)
  orders_status_valid          UPDATE orders SET status = 'SHIPPED' WHERE status = 'DISPATCHED';
  products_price_positive      UPDATE products SET price = ABS(price) WHERE price < 0;
  products_stock_non_negative  UPDATE products SET stock_quantity = 0 WHERE stock_quantity < 0;

Key features

Feature	Detail
5-node pipeline	plan → parallel_table → reconcile → remediate → report (tables run concurrently)
31 rule types	completeness, uniqueness, validity, referential, statistical, timeliness, volume, ML anomaly
6 warehouse adapters	DuckDB, Postgres/Redshift, BigQuery, Databricks, Athena, Snowflake
4 LLM providers	Anthropic Claude, OpenAI, Ollama (local/offline), AWS Bedrock
SQL verification	3-stage pipeline — syntax, schema-aware, dry-run — with LLM self-correction
Rule versioning	`version`, `status` (draft/active/deprecated), `generated_by` on every rule
LLM rule generation	`aegis generate TABLE --db path --kb policy.md` — schema-aware structural rules + business validation rules from a KB document
Full audit trail	Every LLM call and decision logged to SQLite with FTS5 search
GitHub Action	CI/CD gate — fails the job when rules fail, outputs pass-rate and report JSON
MCP server	Use Aegis as a Claude tool — run checks from Claude Desktop
Fine-tuning export	`aegis audit export-dataset` dumps ShareGPT JSONL for model training
Apache 2.0	Fully open source, self-hosted, no SaaS required

How it compares

	Aegis	Great Expectations	Soda Core	Monte Carlo	dbt tests
License	Apache 2.0	Apache 2.0	Apache 2.0	Commercial	Apache 2.0
Self-hosted	✅	✅	✅	❌	✅
LLM-powered diagnosis	✅	❌	❌	Partial	❌
Root cause analysis	✅	❌	❌	✅	❌
SQL auto-fix proposals	✅	❌	❌	❌	❌
LLM rule generation	✅	❌	❌	❌	❌
ML anomaly detection	✅	❌	❌	✅	❌
Audit trail	✅	Partial	Partial	✅	❌
Local LLM (Ollama)	✅	❌	❌	❌	❌
AWS Bedrock	✅	❌	❌	❌	❌
GitHub Action	✅	❌	Partial	❌	✅
Fine-tuning export	✅	❌	❌	❌	❌
MCP server	✅	❌	❌	❌	❌

Full comparison →

Architecture

rules.yaml
    │
    ▼
  plan ──► parallel_table ──► reconcile ──► remediate ──► report
                 │
         ┌──────────────────┐
         │  per table:      │
         │  execute         │
         │  classify        │
         │  diagnose        │
         │  rca             │
         └──────────────────┘

Adapters

LLM adapters:        Anthropic  •  OpenAI  •  Ollama (local)  •  AWS Bedrock
Warehouse adapters:  DuckDB  •  Postgres/Redshift  •  BigQuery  •  Databricks  •  Athena  •  Snowflake

Full architecture docs →

Install

pip install aegis-dq

Then follow the 5-minute quickstart →