Skip to content

How to Cite PDF Sources in RAG Answers Using Bounding Boxes with OpenDataLoader

Problem

When I built a RAG chatbot for internal documentation, users kept asking the same question: “Where did that answer come from?” My system could only respond with “according to the document” — which is not useful for compliance, audit, or fact-checking.

Researchers, analysts, and legal professionals need to see the exact source. They want to click on an answer and jump to the right page, right paragraph, right table.

Typical RAG answer without citation
Q: What is the approval limit for purchase orders over $50,000?
A: The approval limit is the regional director level.
Source: employee_handbook.pdf

That “Source: employee_handbook.pdf” is useless. Which page? Which paragraph? Users end up searching the PDF manually.

The data structure that makes it possible

OpenDataLoader PDF is the only open-source parser that provides bounding boxes for every element by default. Each JSON element looks like this:

JSON element with bounding box
{
"type": "heading",
"id": 42,
"page number": 1,
"bounding box": [72.0, 700.0, 540.0, 730.0],
"content": "Introduction"
}

The bounding box is [left, bottom, right, top] in PDF points (72pt = 1 inch). The page number is 1-indexed. Together, they uniquely identify exactly where this content appears in the source PDF.

The workflow

The idea is straightforward: when you chunk the JSON output, store the bounding box and page number as metadata on each chunk. When a chunk is retrieved for an answer, use that metadata to highlight the source location.

Step 1: Extract PDF with bounding boxes

Extract PDF to JSON
import opendataloader_pdf
opendataloader_pdf.convert(
input_path=["report.pdf"],
output_dir="output/",
format="json"
)

Step 2: Chunk with citation metadata

Chunk JSON elements with source metadata
import json
with open("output/report.json") as f:
doc = json.load(f)
chunks = []
for element in doc.get("kids", []):
if element.get("type") in ("paragraph", "heading", "table"):
chunks.append({
"text": element.get("content", ""),
"metadata": {
"source": doc.get("file name"),
"page": element.get("page number"),
"bbox": element.get("bounding box"),
"type": element.get("type")
}
})

Each chunk now has source, page, and bbox — everything you need for citation.

Step 3: Store in vector database with metadata

When you embed and store these chunks, make sure your vector DB preserves the metadata fields. Most vector databases (Pinecone, Qdrant, Weaviate, Chroma) support metadata filtering.

Step 4: Render citations in the answer

When a chunk is retrieved for an answer, extract its metadata:

Source citation format
Source: report.pdf, Page 3, Position (72, 700)

For a “click to source” UX, render the PDF with a highlight overlay:

RAG answer with clickable citation
Q: What was the total revenue in Q3 2025?
A: The total revenue was $4.2M, driven by growth in the APAC region.
📄 Source: report.pdf — Page 3 (click to view highlighted)

When the user clicks, open the PDF at page 3 and draw a rectangle overlay at the bounding box coordinates to highlight the exact paragraph.

Highlight overlay logic
PDF viewer opens at page 3
Draw highlight rect at:
left=72pt, bottom=700pt, right=540pt, top=730pt
This highlights the paragraph the answer came from.

The full pipeline

PDF to clickable citation pipeline
PDF file
OpenDataLoader convert(format="json")
JSON with elements → each element has type, content, page number, bounding box
Chunk by element → store bbox + page as metadata
Embed chunks → store in vector DB with metadata
User asks question → retrieve chunks → extract metadata
Render answer with "Source: filename.pdf, Page N" + clickable PDF highlight

Why OpenDataLoader is unique here

“No other open-source parser provides bounding boxes for every element by default.” This is a direct quote from the OpenDataLoader FAQ.

  • Docling — outputs Markdown/JSON without coordinates
  • Marker — no bounding box output
  • PyMuPDF4LLM — no element-level coordinates
  • Unstructured.io — partial bounding box support, not for every element

Without bounding boxes, you can only cite at the page level (“See page 3”). With bounding boxes, you can cite at the element level (“See the third paragraph on page 3 — right here”).

Summary

In this post, I showed how to implement “click to source” citations in RAG answers using OpenDataLoader PDF’s bounding boxes. The key point is storing each element’s page number and bounding box as chunk metadata, then using those coordinates to highlight the exact source location in the PDF viewer.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments