How World Monitor Aggregates 435+ News Feeds with AI-Powered Analysis
The Problem with Manual News Monitoring
I used to spend hours each morning opening dozens of tabs: BBC for world news, Al Jazeera for Middle East coverage, Reuters for wire reports, Defense One for military analysis, and on and on. By the time I’d scanned everything, half my morning was gone and I still felt like I was missing important stories.
The real problem wasn’t the time—it was the signal-to-noise ratio. Every source had its own biases, its own editorial priorities, and its own blind spots. I needed a way to aggregate multiple sources, deduplicate related stories, and focus on what actually mattered.
World Monitor solves this by aggregating 435+ curated RSS feeds across 15 categories, clustering similar stories together, and applying AI-powered analysis to surface what’s important. Here’s how the news aggregation system works.
The Feed Architecture
The first decision I made was to curate sources rather than crawl everything. Quality over quantity. The feeds are organized into 15 categories:
World / Geopolitical → BBC, Reuters, AP, Guardian, NPR, PoliticoMiddle East / MENA → Al Jazeera, BBC ME, Guardian ME, Al ArabiyaAfrica → BBC Africa, News24, Google News aggregationLatin America → BBC Latin America, Guardian AmericasAsia-Pacific → BBC Asia, South China Morning PostEnergy & Resources → Google News (oil/gas, nuclear, mining)Technology → Hacker News, Ars Technica, The Verge, MIT Tech ReviewAI / ML → ArXiv, VentureBeat AI, MIT Tech ReviewFinance → CNBC, MarketWatch, Financial Times, Yahoo FinanceGovernment → White House, State Dept, Pentagon, Treasury, FedIntel Feed → Defense One, Breaking Defense, BellingcatThink Tanks → Foreign Policy, Atlantic Council, CSIS, RANDCrisis Watch → International Crisis Group, IAEA, WHO, UNHCRRegional Sources → Xinhua, TASS, Kyiv Independent, Moscow TimesEach category serves a specific intelligence purpose. The Intel Feed category, for example, pulls from defense-focused sources that wouldn’t appear in mainstream news. The Think Tanks category captures analysis that takes days or weeks to produce, not hours.
Source Filtering: The Feature I Didn’t Know I Needed
After using the system for a few weeks, I realized I was ignoring certain sources. Some were too sensational. Others had paywalls. A few just didn’t align with my focus areas.
I added a source filtering system that lets me toggle individual sources on or off:
┌─────────────────────────────────────────────────────────────┐│ SOURCES [Select All] ││ [Select None] │├─────────────────────────────────────────────────────────────┤│ Search: [________________________] │├─────────────────────────────────────────────────────────────┤│ [x] BBC News ││ [x] Reuters ││ [ ] Daily Mail ← disabled ││ [x] Al Jazeera ││ [x] Defense One ││ ... │├─────────────────────────────────────────────────────────────┤│ 45/77 sources enabled │└─────────────────────────────────────────────────────────────┘The key implementation detail: disabled sources are filtered at fetch time, not display time. This means I’m not wasting bandwidth or processing power on sources I don’t want. Settings persist to localStorage, so my preferences survive page refreshes.
When a panel has all its sources disabled, it shows a message instead of an empty state. This prevents confusion about whether the system is broken or just filtered.
The Clustering Algorithm: Deduplication That Actually Works
The biggest problem with aggregating 435+ feeds is duplication. The same story appears in BBC, Reuters, AP, and Guardian within hours of each other. I needed a way to group related articles without creating false positives.
I implemented Jaccard similarity clustering with a 0.6 threshold:
Raw Headlines from 435+ Feeds │ ▼┌─────────────────────────────────────────────────────────────┐│ Preprocessing ││ - Lowercase normalization ││ - Remove common stop words ││ - Tokenize into word sets │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ Jaccard Similarity Calculation ││ J(A,B) = |A ∩ B| / |A ∪ B| ││ Threshold: 0.6 (tuned for news headlines) │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ Cluster Formation ││ - Group headlines with similarity > 0.6 ││ - Select canonical headline (earliest or most complete) ││ - Store cluster metadata for UI display │└─────────────────────────────────────────────────────────────┘The 0.6 threshold was the result of trial and error. Lower values created false clusters (unrelated stories grouped together). Higher values missed legitimate duplicates (the same story with slightly different headlines).
The clustering runs in analysis.worker.ts as a Web Worker, keeping the main thread responsive. Cross-domain correlation detection identifies when the same story appears across different categories—for example, a tech story about AI that also has financial implications.
Why Clustering Matters for AI Processing
Here’s something I didn’t anticipate: clustering dramatically reduces AI costs. Before clustering, I was sending hundreds of duplicate headlines to the LLM for summarization. After clustering, the prompt size dropped by 20-40%.
Before Clustering: 500 headlines × ~50 tokens = 25,000 tokens per summary
After Clustering: 150 unique clusters × ~50 tokens = 7,500 tokens per summary
Savings: 70% reduction in prompt tokensThis isn’t just about cost—it’s about quality. When the LLM sees the same story five times, it tends to over-weight that story in the summary. Clustering ensures each story gets equal consideration.
Custom Monitors: Personalized Keyword Alerts
The source filtering handles which sources I trust. But I also needed a way to track specific topics across all sources. I implemented custom monitors:
Monitor: "nvidia, gpu, chip shortage" │ ├── Assigned unique color (auto-generated) │ ├── Scans all incoming headlines for matches │ ├── Highlights matching articles in Monitor panel │ └── Matching articles in clusters inherit monitor colorI have monitors set up for:
- Specific companies I’m tracking
- Geographic regions I’m focused on
- Technical topics I’m researching
- People I’m following
The monitors persist via localStorage, so they survive page refreshes. Each monitor gets a unique color, making it easy to spot relevant articles at a glance.
Live News Streams: Television in the Browser
Sometimes I want background news while I’m working. I embedded YouTube live streams with channel switching:
Bloomberg → Business & financial newsSky News → UK & internationalEuronews → European perspectiveDW News → German internationalFrance 24 → French global newsAl Arabiya → Middle East (Arabic)Al Jazeera → Middle East & internationalThe implementation uses the YouTube IFrame Player API rather than raw iframes. This gives me programmatic control over playback:
// Persistent player - no reload on mute/play/channel changeplayer.mute();player.setVolume(50);player.loadVideoById(channelVideoId);
// Idle detection - pause when tab hidden or 5 min idledocument.addEventListener('visibilitychange', () => { if (document.hidden) { player.pauseVideo(); }});The player persists across channel changes, avoiding the jarring reload that comes with raw iframe embeds. When the tab is hidden or I’ve been idle for 5 minutes, playback pauses automatically.
Activity Tracking: What’s New vs. What I’ve Seen
With hundreds of headlines flowing through the system, I needed a way to track what I’d already viewed. I implemented a three-tier activity system:
┌─────────────────────────────────────────────────────────────┐│ NEW Badge │ 2 minutes │ Bright badge on new items│├─────────────────────────────────────────────────────────────┤│ Glow Highlight │ 30 seconds │ Animation draws attention │├─────────────────────────────────────────────────────────────┤│ Panel Badge │ Until viewed │ Count in collapsed panels│└─────────────────────────────────────────────────────────────┘The “seen” detection uses IntersectionObserver:
const observer = new IntersectionObserver((entries) => { entries.forEach(entry => { if (entry.intersectionRatio > 0.5) { // Item is >50% visible const visibleTime = Date.now(); setTimeout(() => { if (stillVisible && visibleTime > 500) { markAsSeen(entry.target); } }, 500); } });}, { threshold: 0.5 });An item needs to be more than 50% visible for more than 500ms to be marked as seen. This prevents accidental marks from scrolling past quickly. Each panel maintains independent activity state.
Regional Intelligence Panels
The 15 feed categories map to regional intelligence panels:
┌─────────────────────────────────────────────────────────────┐│ Middle East │ MENA region ││ │ Israel-Gaza, Iran, Gulf states, Red Sea │├─────────────────────────────────────────────────────────────┤│ Africa │ Sub-Saharan focus ││ │ Sahel instability, coups, insurgencies │├─────────────────────────────────────────────────────────────┤│ Latin America │ Central & South America ││ │ Venezuela, drug trafficking │├─────────────────────────────────────────────────────────────┤│ Asia-Pacific │ East & Southeast Asia ││ │ China-Taiwan, Korean peninsula │├─────────────────────────────────────────────────────────────┤│ Energy │ Global energy markets ││ │ Oil markets, nuclear, mining │└─────────────────────────────────────────────────────────────┘Each panel shows headlines from its assigned sources, with clustering applied within the panel. I can expand a panel to see more detail or collapse it to just see the count badge.
Data Export: Taking Intelligence Offline
Sometimes I need to analyze the data outside the browser. The export feature generates CSV and JSON snapshots:
CSV Export: - Headline, source, timestamp, category, threat level - Importable into Excel, Google Sheets, Python pandas
JSON Export: - Full article metadata - Cluster relationships - AI-generated summaries - Entity extractionsThere’s also a historical playback feature that loads snapshots from the past 7 days. The system automatically cleans up snapshots older than 7 days to manage storage.
What I Learned Building This
After running World Monitor for several months, a few insights stand out:
-
Curation beats crawling. 435 well-chosen sources provide better signal than 10,000 random feeds. I spent more time curating sources than I expected, but the quality improvement is worth it.
-
Clustering is essential for AI. The 20-40% token reduction from deduplication isn’t just cost savings—it improves summary quality by preventing over-weighting of duplicate stories.
-
Source filtering needs to be granular. The ability to disable individual sources at fetch time, not display time, matters for both bandwidth and processing efficiency.
-
Activity tracking reduces anxiety. Knowing what I’ve seen versus what’s new eliminates the “did I already read this?” mental overhead.
-
Regional panels enable focus. I can monitor the Middle East closely while keeping an eye on other regions. The panel structure matches how I actually think about global events.
In This Post
In this post, I showed how World Monitor aggregates 435+ RSS feeds into actionable intelligence. The key components are: curated source selection across 15 categories, Jaccard similarity clustering for deduplication, granular source filtering at fetch time, custom keyword monitors for personalized tracking, and activity tracking to manage information overload. The clustering algorithm reduces AI token usage by 20-40% while improving summary quality.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 World Monitor
- 👨💻 Jaccard Similarity
- 👨💻 RSS Specification
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments