ProjectCreator & AI Researcher2026in-progress

GovBench

An open-source LLM evaluation framework for judicial and governmental deployment readiness.

Stack

PythonLLM EvaluationAI SafetySentiment AnalysisHallucination Detection

Outcomes

6Pillars (bias, integrity, jurisdiction, …)

4Production LLMs evaluated

12Identity variants per scenario

Live GitHub

What it is

An open-source LLM evaluation framework that scores models on their readiness for judicial and governmental deployment. GovBench measures things production benchmarks ignore: demographic bias, jurisdictional awareness, corruption resistance, and minority protection — across the US, India, and EU legal systems.

Key points

Six evaluation pillars — demographic bias, procedural integrity, corruption resistance, jurisdictional awareness, transparency, and minority protection. Each pillar produces an interpretable sub-score that rolls into a composite deployment-readiness grade.
Demographic isolation testing — 12 identity variants per scenario with naturalistic prompting, measuring inherent bias in bail, sentencing, welfare, and immigration decisions across the US, India, and EU.
Three evaluation modes — baseline, pressure, and adversarial. Models are stress-tested under each to surface failures only adversarial prompts reveal.
Automated scoring pipeline — sentiment-variance analysis, hallucination detection, and position-drift tracking are combined into a single composite grade that can run unattended on every model release.
Real findings on production models — evaluated Claude Sonnet 4.6, Gemini Flash Lite, DeepSeek V4, and GPT-OSS 120B. Top performers scored 90%+ overall, while GPT-OSS dropped to 63% on minority protection and DeepSeek V4 to 56.7% on jurisdictional awareness.

Result

Open-sourced on GitHub as a reusable framework. Reveals concrete weaknesses in production LLMs that standard benchmarks (MMLU, HellaSwag, etc.) miss entirely.

More Work.

Browse all projects