Skip to content

How to Compact Data Before Feeding to AI Coding Assistants to Save Tokens

Problem

The number one source of token waste in AI coding assistants is raw data. A 10MB CSV log costs roughly 250K tokens to process. A 10-line summary costs 200 tokens. In a session where I feed the AI 50+ such data reads, that difference compounds into millions of tokens.

I found this out the hard way after wondering why my token usage was so high despite doing the same amount of work. The AI didn’t need the whole haystack — I should have given it the needle map.

The core principle

Replace full raw data with compact “working views” before the AI reads it. The AI charges by token for everything it sees. If you give it noise, you pay for noise.

How to compact each data type

I wrote small preprocessing scripts for each data type. Here’s what works.

Logs

Filter by timestamp or keyword, deduplicate consecutive lines, extract only ERROR/WARN levels:

compact_logs.sh
grep -i "error\|exception\|fail" huge-app.log | sort | uniq -c | sort -rn | head -50

This single line turns a 50MB log file into 50 lines of ranked errors.

JSON

Use jq to select only the relevant fields. Flatten nested structures. Limit array length:

compact_json.py
import json, sys
with open(sys.argv[1]) as f:
data = json.load(f) if sys.argv[1].endswith('.json') else sys.stdin.read()
if isinstance(data, list):
print(json.dumps([{k: item[k] for k in ('id', 'status', 'error')
if k in item} for item in data[:20]], indent=2))

This extracts only id, status, and error from the first 20 items, ignoring all other fields.

CSVs

Show row count, column names, and a few sample rows instead of the full table:

csv_summary.sh
echo "Rows: $(wc -l < data.csv)" && head -1 data.csv && echo "..." && tail -3 data.csv

This gives the AI enough context to understand the data structure without reading 10000 rows.

File trees

Summarize by directory and file count instead of listing every file:

file_tree_summary.sh
find src -type f | sed 's|/[^/]*$||' | sort | uniq -c | sort -rn | head -30

Source code

Extract function names, class names, and imports instead of reading entire files:

extract_signatures.sh
grep -n "^def \|^class \|^import \|^from " src/app.py | head -50

Why this matters

Token costs are linear with input size. A raw find . -type f output listing 8000 files costs roughly 10K tokens. A summarized tree with directory counts costs about 200 tokens. Over 100 such reads in a day, that’s 1M tokens saved on a single technique.

This single strategy contributed significantly to my 138M to 20M token per day reduction.

Common mistakes

  • Feeding full error stack traces without filtering unique frames
  • Letting the AI read entire source files when you only need one function
  • Passing raw find . -type f output listing 5000+ files
  • Not deduplicating repeated log entries before analysis

Summary

In this post, I showed how to compact data before feeding it to AI coding assistants. The key point is that preprocessing data into working views is the single highest-leverage token-saving technique. A 50-line Python helper can save 250K+ tokens per read.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments