How to Compact Data Before Feeding to AI Coding Assistants to Save Tokens
Problem
The number one source of token waste in AI coding assistants is raw data. A 10MB CSV log costs roughly 250K tokens to process. A 10-line summary costs 200 tokens. In a session where I feed the AI 50+ such data reads, that difference compounds into millions of tokens.
I found this out the hard way after wondering why my token usage was so high despite doing the same amount of work. The AI didn’t need the whole haystack — I should have given it the needle map.
The core principle
Replace full raw data with compact “working views” before the AI reads it. The AI charges by token for everything it sees. If you give it noise, you pay for noise.
How to compact each data type
I wrote small preprocessing scripts for each data type. Here’s what works.
Logs
Filter by timestamp or keyword, deduplicate consecutive lines, extract only ERROR/WARN levels:
grep -i "error\|exception\|fail" huge-app.log | sort | uniq -c | sort -rn | head -50This single line turns a 50MB log file into 50 lines of ranked errors.
JSON
Use jq to select only the relevant fields. Flatten nested structures. Limit array length:
import json, sys
with open(sys.argv[1]) as f: data = json.load(f) if sys.argv[1].endswith('.json') else sys.stdin.read()
if isinstance(data, list): print(json.dumps([{k: item[k] for k in ('id', 'status', 'error') if k in item} for item in data[:20]], indent=2))This extracts only id, status, and error from the first 20 items, ignoring all other fields.
CSVs
Show row count, column names, and a few sample rows instead of the full table:
echo "Rows: $(wc -l < data.csv)" && head -1 data.csv && echo "..." && tail -3 data.csvThis gives the AI enough context to understand the data structure without reading 10000 rows.
File trees
Summarize by directory and file count instead of listing every file:
find src -type f | sed 's|/[^/]*$||' | sort | uniq -c | sort -rn | head -30Source code
Extract function names, class names, and imports instead of reading entire files:
grep -n "^def \|^class \|^import \|^from " src/app.py | head -50Why this matters
Token costs are linear with input size. A raw find . -type f output listing 8000 files costs roughly 10K tokens. A summarized tree with directory counts costs about 200 tokens. Over 100 such reads in a day, that’s 1M tokens saved on a single technique.
This single strategy contributed significantly to my 138M to 20M token per day reduction.
Common mistakes
- Feeding full error stack traces without filtering unique frames
- Letting the AI read entire source files when you only need one function
- Passing raw
find . -type foutput listing 5000+ files - Not deduplicating repeated log entries before analysis
Summary
In this post, I showed how to compact data before feeding it to AI coding assistants. The key point is that preprocessing data into working views is the single highest-leverage token-saving technique. A 50-line Python helper can save 250K+ tokens per read.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments