Ragmail

2026-01-07 17:09:35

RAGMail Development Journey

A point-by-point chronicle of building a local-first RAG email chatbot, from concept to working application.

Project Vision
Core Architecture Decisions
Phase 1: Foundation
Phase 2: Email Ingestion Pipeline
Phase 3: Vector Search & Retrieval
Phase 4: Chatbot Interface
Phase 5: Mail.app Integration
Phase 6: Web Interface
Phase 7: Performance Optimization
Phase 8: Date-Aware Search
Key Challenges & Solutions
Lessons Learned

1. Project Vision

The Problem

Searching through years of email is painful
Existing email search is keyword-based, not semantic
Cloud-based AI solutions require uploading personal data
No good local-first solution existed for conversational email search

The Goal

Build a completely local, privacy-first chatbot that:

Keeps all email data on your machine
Uses AI for semantic understanding, not just keyword matching
Answers natural language questions about your email history
Integrates seamlessly with Apple Mail.app

Privacy Principles

Email content never leaves the local filesystem
Only embeddings (numerical vectors) stored in local vector database
Only minimal context sent to OpenAI for chat responses
No cloud databases, no external storage

2. Core Architecture Decisions

Technology Stack Selection

Component	Technology	Rationale
Vector Database	Qdrant (Docker)	Fast, local, excellent Python SDK
Embeddings	OpenAI text-embedding-3-large	High quality, 1536 dimensions
Chat Model	OpenAI GPT-4	Best reasoning capability
Backend	Python 3.11	Rich ecosystem, easy prototyping
Web Framework	Flask	Lightweight, minimal overhead
Email Storage	EML files	Standard format, works with Mail.app

Data Flow Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  Mail.app   │────▶│  EML Files   │────▶│   Parser    │
│ (AppleScript)│     │ (data/incoming)│   │(parse_mbox) │
└─────────────┘     └──────────────┘     └──────┬──────┘
                                                │
                    ┌──────────────┐     ┌──────▼──────┐
                    │   Chunker    │◀────│ Email Dict  │
                    │ (512 tokens) │     │  (subject,  │
                    └──────┬───────┘     │   body...)  │
                           │             └─────────────┘
                    ┌──────▼───────┐
                    │  Embedder    │
                    │  (OpenAI)    │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │   Qdrant     │
                    │ (vector DB)  │
                    └──────────────┘

3. Phase 1: Foundation

Step 1: Project Structure

Created a modular Python project with clear separation of concerns:

scripts/
├── config.py          # Configuration loading
├── utils.py           # Shared utilities
├── qdrant_db.py       # Database connection
├── parse_mbox.py      # Email parsing
├── chunker.py         # Text chunking
├── embedder.py        # Vector generation
├── indexer.py         # Main pipeline
├── retriever.py       # Search logic
└── chatbot.py         # User interface

Step 2: Configuration System

config.yaml for application settings
.env for secrets (API keys)
Centralized config loader with sensible defaults

Step 3: Docker Setup

Created docker-compose.yml for Qdrant:

services:
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - ./data/qdrant_storage:/qdrant/storage

Step 4: Dependencies

Key packages in requirements.txt:

qdrant-client - Vector database client
openai - Embeddings and chat
beautifulsoup4 - HTML email parsing
tiktoken - Token counting
tenacity - Retry logic
rich - Terminal UI
flask - Web interface (added later)

4. Phase 2: Email Ingestion Pipeline

Step 1: Email Parser (`parse_mbox.py`)

Built a robust EML file parser that extracts:

Message-ID (unique identifier)
Subject, From, To, CC
Date (parsed to Unix timestamp)
Body (plain text preferred, HTML converted if needed)
Attachments (metadata only)

Key Challenges:

MIME multipart messages with nested parts
Various character encodings (UTF-8, ISO-8859-1, etc.)
HTML emails needing conversion to plain text
Malformed headers in real-world emails

Step 2: Text Chunker (`chunker.py`)

Implemented intelligent text splitting:

512 tokens per chunk (optimal for embeddings)
50 token overlap (maintains context across chunks)
Respects sentence boundaries where possible
Uses tiktoken for accurate token counting

Step 3: Embedder (`embedder.py`)

OpenAI embedding generation with production considerations:

Batch processing (up to 2048 texts per API call)
Rate limit handling with exponential backoff
Cost tracking (log token usage)
Retry logic using tenacity

Step 4: Indexer (`indexer.py`)

Main orchestration pipeline:

def index_email(eml_file):
    # 1. Parse email
    email = parse_eml(eml_file)

    # 2. Chunk the body
    chunks = chunk_text(email['body'])

    # 3. Generate embeddings
    vectors = embed_batch(chunks)

    # 4. Store in Qdrant
    points = [
        {
            'id': generate_uuid(),
            'vector': vector,
            'payload': {
                'email_id': email['id'],
                'subject': email['subject'],
                'from': email['from'],
                'date': email['date'],
                'chunk_index': i,
                'chunk_text': chunk
            }
        }
        for i, (chunk, vector) in enumerate(zip(chunks, vectors))
    ]
    qdrant.upsert(points)

5. Phase 3: Vector Search & Retrieval

Qdrant Collection Schema

Created collection with optimized settings:

collection_config = {
    'name': 'emails',
    'vector_size': 1536,
    'distance': 'Cosine',
    'payload_schema': {
        'email_id': 'keyword',
        'subject': 'text',
        'from': 'keyword',
        'date': 'integer',  # Unix timestamp for filtering
        'chunk_text': 'text'
    }
}

Retriever Implementation (`retriever.py`)

Semantic search with smart result handling:

Query embedding - Convert user question to vector
Vector search - Find top-k similar chunks
Deduplication - Group by email_id, keep best chunk per email
Filtering - Support date ranges, sender filters
Result formatting - Return structured email metadata

Search Parameters

Default top-k: 8 chunks
Score threshold: 0.3 (filter low-quality matches)
Deduplication: Best chunk per email

6. Phase 4: Chatbot Interface

CLI Chatbot (`chatbot.py`)

Interactive terminal interface with features:

Natural language queries - "What did John say about the project?"
Special commands:
- /open <n> - Open email in Mail.app
- /last - Show last search results
- /help - Show available commands
- /quit - Exit chatbot

System Prompt Engineering

Crafted a prompt that instructs GPT-4 to:

Reference specific emails with numbers [1], [2], etc.
Include dates and senders in responses
Admit when information isn't in the retrieved context
Stay focused on email content

Context Window Management

Maximum 4000 tokens for email context
Truncate older/less relevant chunks if over limit
Always include system prompt and user query

7. Phase 5: Mail.app Integration

Challenge: Opening Emails

Initial approach: Use AppleScript to search Mail.app by Message-ID

Problem: Searching 70,000+ emails took 20+ seconds, often timing out

Solution: Direct EML File Opening

Changed strategy:

Store the EML filename as the email_id in Qdrant
When user requests /open, locate the EML file
Use open -a Mail <filename>.eml to open directly

Result: Opens in <1 second, no Mail.app database search

AppleScript Export

Created AppleScript to export emails from Mail.app:

tell application "Mail"
    set theMessages to messages of mailbox "INBOX"
    repeat with theMessage in theMessages
        set emlContent to source of theMessage
        -- Save to data/incoming_emails/
    end repeat
end tell

Launcher Apps

Created double-clickable .app bundles:

RAGMail.app - Launches CLI chatbot
RAGMail Web.app - Launches web interface

8. Phase 6: Web Interface

Why Add Web UI?

More accessible than terminal for casual use
Better display of email metadata
Easier to show clickable links
Visual management of indexing/export

Architecture Decision

Keep CLI and Web completely separate:

scripts/chatbot.py - Standalone CLI
web/app.py - Flask backend
Both share the same Qdrant database and config

Flask Backend (`web/app.py`)

REST API endpoints:

Endpoint	Method	Purpose
`/api/chat`	POST	Send message, get AI response
`/api/stats`	GET	Collection statistics
`/api/export`	POST	Trigger Mail.app export
`/api/index`	POST	Start background indexing
`/api/index/status`	GET	Poll indexing progress
`/api/tracker/stats`	GET	Indexed files statistics

Frontend

Single-page app with:

Chat message display with sources
Clickable email links (opens in Mail.app)
Management panel for export/indexing
Real-time statistics display

9. Phase 7: Performance Optimization

Problem: Slow Re-indexing

Initial indexing checked Qdrant for every file:

9,518 files × 1 API call each = 5+ minutes just to determine what's new

Solution: SQLite File Tracker

Created indexed_files_tracker.py:

class IndexedFilesTracker:
    """Track which files have been indexed locally."""

    def __init__(self, db_path):
        self.conn = sqlite3.connect(db_path)
        self._create_table()

    def is_indexed(self, filename):
        """Check if file already indexed (O(1) lookup)."""
        cursor = self.conn.execute(
            "SELECT 1 FROM indexed_files WHERE filename = ?",
            (filename,)
        )
        return cursor.fetchone() is not None

    def mark_indexed(self, filename, email_id):
        """Record that file has been indexed."""
        self.conn.execute(
            "INSERT OR REPLACE INTO indexed_files VALUES (?, ?, ?)",
            (filename, email_id, datetime.now().isoformat())
        )
        self.conn.commit()

Result: Skip check for 9,518 files drops from 5+ minutes to <1 second

Migration Script

Created populate_tracker.py to backfill tracker with existing indexed emails.

10. Phase 8: Date-Aware Search

Problem: Natural Language Dates

Users ask questions like:

"What emails did I get last week?"
"Find messages from November 2024"
"Show me yesterday's emails"

The semantic search found relevant content but didn't filter by date.

Solution: Date Parser (`date_parser.py`)

Natural language date extraction:

def parse_date_query(query: str) -> tuple[datetime, datetime]:
    """Extract date range from natural language query."""

    patterns = {
        'today': (start_of_today, end_of_today),
        'yesterday': (start_of_yesterday, end_of_yesterday),
        'last week': (7_days_ago, now),
        'last month': (start_of_last_month, end_of_last_month),
        'this week': (start_of_week, now),
        # ... more patterns
    }

Integration with Search

Modified retriever to accept date filters:

def search_emails(query, date_from=None, date_to=None):
    # Parse dates from query
    if date_from is None:
        date_from, date_to = parse_date_query(query)

    # Build Qdrant filter
    filters = []
    if date_from:
        filters.append({'key': 'date', 'range': {'gte': date_from.timestamp()}})
    if date_to:
        filters.append({'key': 'date', 'range': {'lte': date_to.timestamp()}})

    # Search with filter
    results = qdrant.search(query_vector, filter=filters)

11. Key Challenges & Solutions

Challenge 1: Email ID Mismatch

Problem: Mail.app's internal message ID differs from the Message-ID header

Solution: Use EML filename as the canonical ID. All systems (Qdrant, tracker, opener) use the same filename-based ID.

Challenge 2: API Version Compatibility

Problem: Qdrant client v1.16 warned about server v1.12.6 compatibility

Solution: Pin specific compatible versions, accept warnings for minor mismatches, test critical operations.

Challenge 3: Rate Limiting

Problem: OpenAI embedding API has rate limits

Solution:

Batch requests (up to 2048 texts per call)
Exponential backoff with tenacity
Progress bars for long operations
Resume capability for interrupted jobs

Challenge 4: Large Email Archives

Problem: 70,000+ emails in Mail.app, slow operations