Balancing Cost and Quality in AI Summarization

🧠

Learning in Practice

This post reflects insights gained while building and optimizing my legislation summarization pipeline — a project that automatically generates legislative summaries across thousands of bills. I used it as a test bed for comparing cost-efficiency and quality across Anthropic and OpenAI models.

Background

Scaling summarization across hundreds or even thousands of legislative documents introduces a real-world tradeoff between cost and quality.

Each summary request can involve thousands of input tokens, and when multiplied across large datasets, the financial impact becomes significant.
That’s where input caching and model right-sizing become critical tools for optimization.

The Experiment

I ran the same piece of legislation — House Resolution 211 — through multiple AI models to evaluate performance, tone, and accuracy relative to cost.

The input (prompt) used for these tests was approximately 1,450 tokens, while my production pipeline’s actual cached input per bill averages ~3,500 tokens.

Models Tested

Anthropic Claude Sonnet 4.5
Anthropic Claude Haiku 3.5
OpenAI GPT-5 Nano
OpenAI GPT-5 Mini
DeepSeek V3.1 (via OpenRouter)

Cost Efficiency Breakdown

Category	Rank	Notes
Claude Sonnet 4.5	B	$0.01449 → $0.01368 w/ caching
Claude Haiku 3.5	A	$0.00353 → $0.00333 w/ caching
GPT-5 Nano	A+	$0.00106 → &0.00100 w/ caching
GPT-5 Mini	A-	$0.00424 → $0.00398 w/ caching
DeepSeek V3.1	Free	Zero-cost test via OpenRouter

💾

Note on Input Caching

Input caching drastically reduces repeated-token costs — from $0.05 → $0.005 per 1M input tokens. In my case, caching saved about 11% of input cost across 3500+ summarizations.

Quality Observations

While cost per summary is an important metric, quality is harder to quantify.
Below are excerpts from each model’s summary of the same bill (see full comparison file in the repository).

📊Comparative Highlights

🦜Claude Sonnet 4.5

Experience: High-detail, verbose

Strong procedural comprehension, but tends toward formal repetition. Best suited for long-form policy writing.

🎯Claude Haiku 3.5

Experience: Fast & concise

Balanced summary quality, though less nuanced on legislative structure.

⚡GPT-5 Nano

A+

Experience: Extremely cost-efficient

Consistently accurate and clear, matching higher-tier models in legislative context while costing 10x less.

🤖GPT-5 Mini

A-

Experience: Near-Sonnet detail

Excellent structural explanation, slightly higher cost. Good compromise between quality and runtime.

🌊DeepSeek V3.1

B-

Experience: Adequate for free-tier

Summarizes cleanly, lacks nuance. Great for broad analysis or zero-cost experimentation.

Standard Summary

This is a procedural resolution that sets the rules for considering three separate pieces of legislation in the House of Representatives. It establishes the debate parameters and voting procedures for: (1) a joint resolution to overturn an IRS rule requiring brokers to report gross proceeds from digital asset (cryptocurrency) sales, (2) a bill to extend the time period for prosecuting unemployment fraud cases related to COVID-19 relief programs, and (3) a continuing resolution to fund the government through the remainder of fiscal year 2025. Additionally, the resolution includes a provision that affects how Congress counts days when considering whether to terminate a national emergency declared by the President on February 1, 2025.

Key Points

Allows consideration of H.J. Res. 25 to disapprove an IRS rule on cryptocurrency broker reporting requirements, with one hour of debate controlled by the Ways and Means Committee
Permits consideration of H.R. 1156 to extend the statute of limitations for prosecuting fraud in pandemic-era unemployment programs (CARES Act related)
Enables consideration of H.R. 1968, a continuing resolution to maintain government funding through September 30, 2025
Waives all points of order (procedural objections) against consideration of all three measures
Suspends the calendar day count for Congressional Review Act purposes regarding a February 1, 2025 presidential national emergency declaration

Impact Areas

Cryptocurrency industry and digital asset traders (regarding tax reporting requirements)
Federal law enforcement and prosecutors pursuing pandemic unemployment fraud cases
Federal government operations and funding continuity
Congressional oversight of presidential emergency powers
Legislative procedure and House floor management

Lessons Learned

Cost Awareness

95/100

Experience: Extensive experimentation with caching

Highlights: Understanding true per-token economics

Model Selection

90/100

Experience: Comparative analysis across providers

Highlights: Choosing right model for task scale

Prompt Engineering

85/100

Experience: Optimized prompts for reuse and caching

Highlights: Minimizing re-tokenization for stable inputs

Takeaways

GPT-5 Nano offers exceptional value: near enterprise-grade summarization quality at a fraction of the cost.
Input caching is not just a minor optimization — it’s the difference between scaling feasibly and going bankrupt.
Model selection should be guided by fit-for-purpose, not by raw model size or hype.
Anthropic’s Haiku and GPT-5 Nano both excel for summarization tasks, though the Nano’s cost-efficiency gives it a decisive edge.

💬

Personal Insight

After processing over 3,500 legislative summaries, my conclusion is clear: GPT-5 Nano delivers the best overall value — balancing quality, speed, and cost with impressive consistency.

Reflection

Balancing cost and performance in AI pipelines is more than an optimization problem — it’s a mindset shift.
The key isn’t to always use the “best” model, but the most appropriate one.
In production, efficiency and predictability often matter more than perfection.

Next Steps

Add automated daily cost tracking via OpenAI Usage API
Benchmark output quality with automated scoring
Experiment with hybrid caching and reranking