Skip to content

How /last30days Merges the Same Story Across Reddit, X, and YouTube: Cross-Source Cluster Merging

Problem

When a major story breaks, it appears simultaneously across multiple platforms. A Reddit thread, an X post from the artist, a YouTube reaction video, a TikTok clip, and a Hacker News discussion — all covering the same event. A naive multi-source search shows 5 separate entries for the same story. I’d have to read each one to realize they’re all the same thing.

The Solution

/last30days v3 uses entity-based overlap detection to merge these into one cluster. When Kanye’s Wireless Festival cancellation was announced on Reddit, discussed on X, and reported by YouTube reaction channels, the clustering engine recognized them as the same story and showed one cluster instead of three separate items.

Here’s the before-and-after:

v2 vs v3 cluster merging
v2 behavior (no clustering) — 3 separate items:
1. [reddit] Kanye Wireless Festival Cancelled (r/hiphopheads, 2.3k upvotes)
2. [x] @kanyewest: Wireless Festival cancelled, venue issues
3. [youtube] Kanye West Wireless Festival Cancelled Reaction
v3 behavior (clustering) — 1 merged cluster:
Kanye's Wireless Festival Cancellation
Sources: Reddit (2.3k upvotes), X (@kanyewest), YouTube (reaction videos)

How It Works

After all sources return results, the pipeline passes them through the cluster module:

  1. Named Entity Extraction: Each result is analyzed for core entities — people, companies, events, locations mentioned
  2. Overlap Detection: Results that share the same core entities are grouped together, even when the titles use completely different language
  3. Score Aggregation: Each cluster gets a combined score from its constituent items
  4. Synthesis: The synthesizer treats each cluster as one story, citing sources from multiple platforms within the same narrative

The pipeline imports cluster.py and dedupe.py from the lib package to handle this. The cluster module (cluster_candidates) identifies matching candidates, and the dedupe module removes redundant entries.

Why This Matters

Cross-source clustering is what makes the research brief readable instead of a firehose. One cluster per story, with the strongest evidence from each platform woven together. When you read the synthesis, you see the complete picture — not 5 separate tabs you need to reconcile yourself.

Without clustering, researching a trending topic across 10 sources returns a wall of seemingly unrelated items. With clustering, you get a structured brief where each story appears once with all its supporting evidence from every platform.

Summary

In this post, I explained how /last30days’s cross-source cluster merging works. The key point is that entity-based overlap detection turns a multi-platform firehose into a structured brief — one story, one cluster, multiple source perspectives woven together.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments