Changelog

All notable changes to SMEAGLE will be documented in this file.

The format is based on Keep a Changelog.

[Unreleased]

Added

DigitalOcean GPU deployment: On-demand H200 (141GB VRAM) droplet via deploy-do.yml workflow with SSH-based deploy/destroy lifecycle, persistent DO Volume for model cache, and full-precision OLMo 3.1 32B Think (BF16)
BF16 full-precision model overlay: New docker-compose.olmo-32b-think-bf16.yml for allenai/Olmo-3.1-32B-Think at full 65k context on H200
FP8 model compose overlay: New docker-compose.olmo-32b-fp8.yml for kaitchup/Olmo-3.1-32B-Instruct-fp8-dynamic with FP8 KV cache on L40S
AWQ 8-bit model compose overlay: New docker-compose.olmo-32b-8bit.yml for cyankiwi/Olmo-3.1-32B-Think-AWQ-8bit as an alternative

Changed

Reranker upgraded to Mixedbread mxbai-rerank-large-v2: Replaced cross-encoder/ms-marco-MiniLM-L-12-v2 (33M params) with mixedbread-ai/mxbai-rerank-large-v2 (435M params) for higher-quality passage reranking
LLM switched to FP8 Instruct model: Production deploy now uses kaitchup/Olmo-3.1-32B-Instruct-fp8-dynamic (FP8, 20k context, FP8 KV cache) instead of the AWQ 4-bit Think variant. Weave traces showed peak usage at ~16k tokens — 20k context fits comfortably on L40S 48GB
LLM repetition penalty: Default and production LLM_REPETITION_PENALTY raised from 1.1 to 1.25 to reduce circular chain-of-thought

[0.6.0] - 2026-02-19

Added

CloudWatch logging for all containers: All eight services stream to /smeagle/{service} with 14-day retention
Deploy timeline markers: SSM scripts emit timestamped [DEPLOY_TIMELINE] lines for each deploy stage
P5 (H100) us-east-1 deployment: New workflow, bootstrap script, and Compose overlay for a second region
IP update script: make update-ip refreshes security group rules across both regions
Weave eval GitHub Action: Trigger evaluations from the GitHub UI with configurable parameters
Source perspective labels: Passages carry perspective tags (Wargame, Chinese Gov't, Chinese Academic, Western Analysis) so the LLM can weight provenance
Source interpretation guidance: System prompt teaches the model how to weight different source types
Perspective mapping: corpus/source_perspectives.json maps all 88 sources to perspective categories
Chain-of-thought display: Reasoning blocks shown as collapsible steps in Chainlit, Weave traces, and HTML comparison reports
Eval sweep runner: YAML-driven experiment automation (eval/sweep.py) — define a manifest of parameter variations (repetition penalty, top_k, temperature, etc.), run all experiments in one command, and generate a markdown comparison report

Changed

Best eval scores: RAG reached 12 pts, No-RAG reached 16 pts (out of 30)
GPU-accelerated reranker: Cross-encoder moved from CPU to GPU, fixing timeouts under concurrent load
RAG prompt rebalanced: Sources treated as complementary rather than subordinate; fixed zero-citation problem in 18/30 responses
Citation count metric: Unique [S##] references tracked per response in Weave evals
LLM client migrated to OpenAI SDK: Replaced httpx with openai.AsyncOpenAI for auto-instrumented token tracking and retries
Repetition penalty + token cap: LLM_REPETITION_PENALTY (1.1) and LLM_MAX_TOKENS (4,096) curb circular chain-of-thought
Config defaults aligned with production: rerank_enabled, max_context_tokens, top_k_*, llm_model match deployed values
LLM upgrade to OLMo 3.1 32B Think: AWQ 4-bit quantized, doubling parameter count while fitting on L40S 48GB. Apache 2.0 licensed
Reasoning extraction: Handles both explicit and implicit <think> tags; skip-RAG path also captures chain-of-thought
LLM timeout raised to 300s: Accommodates longer reasoning model outputs
Context window tuned: --max-model-len 49152, MAX_CONTEXT_CHARS 80000 for reasoning headroom
Parallel request capacity: Handles at least 4 concurrent requests

Fixed

Date-prefix source ID guard: Rejects numeric prefixes >999 from date-stamped filenames

[0.5.0] - 2026-02-17

Added

Cross-encoder reranker: Replaced no-op reranker placeholder with cross-encoder/ms-marco-MiniLM-L-12-v2 (Microsoft Research). Reranks the merged vector+lexical candidate pool before final top-K truncation for more precise passage selection.
top_k_rerank setting: New config parameter (default 50) caps how many candidates enter the cross-encoder, balancing accuracy vs latency.
Corpus health endpoint: GET /v1/bundles/{bundle_id}/health returns chunk size distribution, metadata coverage, duplicate detection, and index consistency diagnostics for ingestion tuning
answer_choice field: API responses now include explicit answer_choice field for A/B/C questions, parsed via dedicated logic independent of LLM output format
Terminal scoring output: compare_rag now prints RAG vs No-RAG scoring directly to terminal after completion (previously only in HTML reports)
.env support: compare_rag auto-loads environment variables from .env file for API_KEY and API_URL
Mini test question set: feb8withletters-mini.txt (3 questions) for quick smoke tests
New local dev environment: docker-compose.dev.yml mirrors the production stack (Caddy, OAuth, all services) but routes LLM calls to the remote AWS GPU instance via SSM port forwarding -- no local GPU required
make dev / make dev-down / make dev-logs / make dev-tunnel etc.
Hot reload for api/ and tools/ via volume mounts + uvicorn --reload
CPU-only embeddings for macOS compatibility
Full knowledge corpus ingestion via make dev-ingest

Changed

LLM upgrade to reasoning model: Switched from Ministral-3-14B-Instruct-2512 (FP8) to Ministral-3-14B-Reasoning-2512 (BF16) for chain-of-thought reasoning. Hypothesis: native [THINK] reasoning should improve worldview-vs-source discrimination where prompt engineering alone could not. Context window reduced from 64k to 32k tokens to fit BF16 weights in VRAM.
RAG prompt hierarchy: Rewrote RAG RULES to treat the constitution/worldview as the primary decision-making framework, with retrieved sources as supplementary evidence rather than the sole basis for reasoning. Reduced top_k_final from 20 to 8 to cut noise for the 14B model.
Cleaned system prompt: Stripped Tags: and Sources: metadata lines from the constitution prompt to eliminate source number collisions with RAG citation labels ([S9], [S10], etc.).
Eval tracking: SmeagleModel in weave_eval.py now logs full server-side config (LLM model, top_k, chunk_size, chunk_overlap, rerank) per eval run for reproducibility.
Embedding model upgrade: Migrated from all-MiniLM-L6-v2 (384d, 256 tokens) to nomic-ai/nomic-embed-text-v1.5 (768d, 8,192 tokens) for significantly better retrieval quality. Chunks are now fully represented without truncation.
GPU-accelerated embeddings: Embeddings service now runs on GPU in production (~5ms per query vs ~50ms on CPU). vLLM reduced from 90% to 80% VRAM to make room.
Task prefix support: Embeddings API accepts a task field (search_query / search_document) for asymmetric retrieval, as required by the nomic model.
Simplified compare_rag: Removed --concurrency and --multi-turn flags (questions now always run sequentially)
Prompt location: Default system prompt moved from eval/prompts/ to api/rag/prompts/default.txt to ship with API code in Docker containers
Immediate output: Force line-buffered stdout in compare_rag so progress prints appear immediately in all terminal environments

Removed

Old local dev compose files: Deleted docker-compose.local.yml, docker-compose.local-auth.yml, and docker-compose.override.yml -- all replaced by docker-compose.dev.yml
Old Makefile targets: make local, make local-simple, and related targets replaced by make dev family
GPU instance manager: Removed api/gpu_manager.py, admin GPU start/stop/status endpoints, and frontend GPU control panel -- no longer needed since everything runs on a single GPU instance

Fixed

Production prompt regression: Default system prompt now correctly loads in Docker containers (was silently falling back to generic 2-sentence prompt)

[0.4.0] - 2026-02-09

Infrastructure

GPU upgrade: Migrated from g5.2xlarge (A10G 24GB) to g6e.2xlarge (L40S 48GB)
64k token context window (up from 32k)
220k max context chars (up from 110k)
Unified deployment: Consolidated CPU and GPU workflows into a single deploy.yml
Terminated all CPU instances; everything runs on GPU
Start/stop actions for cost control
Replaced Ollama with vLLM as the sole LLM backend across all environments
Removed OpenWebUI -- Chainlit is the only chat interface
Chainlit and API now stream to CloudWatch Logs

[0.3.0] - 2026-02-01

Added

Evaluation toolkit: RAG vs No-RAG comparison with side-by-side HTML reports, multi-turn mode, and context window tracking
Prompt management: Versioned prompts with W&B Weave integration and CLI (eval.prompts)
Question set management: Structured question sets for evaluation (eval.questions)
Per-session system prompts in Chainlit UI (gear icon)
Stable source IDs: Citations use [S1], [S2] references matching the master spreadsheet, with Google Drive links
Token usage passthrough from vLLM and graceful context overflow handling
skip_rag and system_prompt API parameters for A/B testing
/v1/config endpoint exposing RAG configuration

[0.2.0] - 2026-01-19

Changed

Chainlit replaces Open WebUI as the production chat interface
Full RAG pipeline visibility with step-by-step debug info
Source citations, retrieval candidates, and timing metrics
Header-based authentication via oauth2-proxy

[0.1.0] - 2026-01-19

Added

Initial release: OpenAI-compatible RAG API with hybrid retrieval (vector + lexical), citation verification, Google Drive sync, and GPU instance management
Docker Compose deployments, Caddy reverse proxy with SSO, GitHub Actions CI/CD