How to Cite PDF Sources in RAG Answers Using Bounding Boxes with OpenDataLoader
Problem
When I built a RAG chatbot for internal documentation, users kept asking the same question: “Where did that answer come from?” My system could only respond with “according to the document” — which is not useful for compliance, audit, or fact-checking.
Researchers, analysts, and legal professionals need to see the exact source. They want to click on an answer and jump to the right page, right paragraph, right table.
Q: What is the approval limit for purchase orders over $50,000?A: The approval limit is the regional director level. Source: employee_handbook.pdfThat “Source: employee_handbook.pdf” is useless. Which page? Which paragraph? Users end up searching the PDF manually.
The data structure that makes it possible
OpenDataLoader PDF is the only open-source parser that provides bounding boxes for every element by default. Each JSON element looks like this:
{ "type": "heading", "id": 42, "page number": 1, "bounding box": [72.0, 700.0, 540.0, 730.0], "content": "Introduction"}The bounding box is [left, bottom, right, top] in PDF points (72pt = 1 inch). The page number is 1-indexed. Together, they uniquely identify exactly where this content appears in the source PDF.
The workflow
The idea is straightforward: when you chunk the JSON output, store the bounding box and page number as metadata on each chunk. When a chunk is retrieved for an answer, use that metadata to highlight the source location.
Step 1: Extract PDF with bounding boxes
import opendataloader_pdf
opendataloader_pdf.convert( input_path=["report.pdf"], output_dir="output/", format="json")Step 2: Chunk with citation metadata
import json
with open("output/report.json") as f: doc = json.load(f)
chunks = []for element in doc.get("kids", []): if element.get("type") in ("paragraph", "heading", "table"): chunks.append({ "text": element.get("content", ""), "metadata": { "source": doc.get("file name"), "page": element.get("page number"), "bbox": element.get("bounding box"), "type": element.get("type") } })Each chunk now has source, page, and bbox — everything you need for citation.
Step 3: Store in vector database with metadata
When you embed and store these chunks, make sure your vector DB preserves the metadata fields. Most vector databases (Pinecone, Qdrant, Weaviate, Chroma) support metadata filtering.
Step 4: Render citations in the answer
When a chunk is retrieved for an answer, extract its metadata:
Source: report.pdf, Page 3, Position (72, 700)For a “click to source” UX, render the PDF with a highlight overlay:
Q: What was the total revenue in Q3 2025?A: The total revenue was $4.2M, driven by growth in the APAC region. 📄 Source: report.pdf — Page 3 (click to view highlighted)When the user clicks, open the PDF at page 3 and draw a rectangle overlay at the bounding box coordinates to highlight the exact paragraph.
PDF viewer opens at page 3Draw highlight rect at: left=72pt, bottom=700pt, right=540pt, top=730ptThis highlights the paragraph the answer came from.The full pipeline
PDF file │ ▼OpenDataLoader convert(format="json") │ ▼JSON with elements → each element has type, content, page number, bounding box │ ▼Chunk by element → store bbox + page as metadata │ ▼Embed chunks → store in vector DB with metadata │ ▼User asks question → retrieve chunks → extract metadata │ ▼Render answer with "Source: filename.pdf, Page N" + clickable PDF highlightWhy OpenDataLoader is unique here
“No other open-source parser provides bounding boxes for every element by default.” This is a direct quote from the OpenDataLoader FAQ.
- Docling — outputs Markdown/JSON without coordinates
- Marker — no bounding box output
- PyMuPDF4LLM — no element-level coordinates
- Unstructured.io — partial bounding box support, not for every element
Without bounding boxes, you can only cite at the page level (“See page 3”). With bounding boxes, you can cite at the element level (“See the third paragraph on page 3 — right here”).
Summary
In this post, I showed how to implement “click to source” citations in RAG answers using OpenDataLoader PDF’s bounding boxes. The key point is storing each element’s page number and bounding box as chunk metadata, then using those coordinates to highlight the exact source location in the PDF viewer.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments