Ragmail

2026-01-07 17:09:35

RAGMail Development Journey

A point-by-point chronicle of building a local-first RAG email chatbot, from concept to working application.


Table of Contents

  1. Project Vision
  2. Core Architecture Decisions
  3. Phase 1: Foundation
  4. Phase 2: Email Ingestion Pipeline
  5. Phase 3: Vector Search & Retrieval
  6. Phase 4: Chatbot Interface
  7. Phase 5: Mail.app Integration
  8. Phase 6: Web Interface
  9. Phase 7: Performance Optimization
  10. Phase 8: Date-Aware Search
  11. Key Challenges & Solutions
  12. Lessons Learned

1. Project Vision

The Problem

The Goal

Build a completely local, privacy-first chatbot that:

Privacy Principles


2. Core Architecture Decisions

Technology Stack Selection

Component Technology Rationale
Vector Database Qdrant (Docker) Fast, local, excellent Python SDK
Embeddings OpenAI text-embedding-3-large High quality, 1536 dimensions
Chat Model OpenAI GPT-4 Best reasoning capability
Backend Python 3.11 Rich ecosystem, easy prototyping
Web Framework Flask Lightweight, minimal overhead
Email Storage EML files Standard format, works with Mail.app

Data Flow Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  Mail.app   │────▶│  EML Files   │────▶│   Parser    │
│ (AppleScript)│     │ (data/incoming)│   │(parse_mbox) │
└─────────────┘     └──────────────┘     └──────┬──────┘
                                                │
                    ┌──────────────┐     ┌──────▼──────┐
                    │   Chunker    │◀────│ Email Dict  │
                    │ (512 tokens) │     │  (subject,  │
                    └──────┬───────┘     │   body...)  │
                           │             └─────────────┘
                    ┌──────▼───────┐
                    │  Embedder    │
                    │  (OpenAI)    │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │   Qdrant     │
                    │ (vector DB)  │
                    └──────────────┘

3. Phase 1: Foundation

Step 1: Project Structure

Created a modular Python project with clear separation of concerns:

scripts/
├── config.py          # Configuration loading
├── utils.py           # Shared utilities
├── qdrant_db.py       # Database connection
├── parse_mbox.py      # Email parsing
├── chunker.py         # Text chunking
├── embedder.py        # Vector generation
├── indexer.py         # Main pipeline
├── retriever.py       # Search logic
└── chatbot.py         # User interface

Step 2: Configuration System

Step 3: Docker Setup

Created docker-compose.yml for Qdrant:

services:
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - ./data/qdrant_storage:/qdrant/storage

Step 4: Dependencies

Key packages in requirements.txt:


4. Phase 2: Email Ingestion Pipeline

Step 1: Email Parser (parse_mbox.py)

Built a robust EML file parser that extracts:

Key Challenges:

Step 2: Text Chunker (chunker.py)

Implemented intelligent text splitting:

Step 3: Embedder (embedder.py)

OpenAI embedding generation with production considerations:

Step 4: Indexer (indexer.py)

Main orchestration pipeline:

def index_email(eml_file):
    # 1. Parse email
    email = parse_eml(eml_file)

    # 2. Chunk the body
    chunks = chunk_text(email['body'])

    # 3. Generate embeddings
    vectors = embed_batch(chunks)

    # 4. Store in Qdrant
    points = [
        {
            'id': generate_uuid(),
            'vector': vector,
            'payload': {
                'email_id': email['id'],
                'subject': email['subject'],
                'from': email['from'],
                'date': email['date'],
                'chunk_index': i,
                'chunk_text': chunk
            }
        }
        for i, (chunk, vector) in enumerate(zip(chunks, vectors))
    ]
    qdrant.upsert(points)

5. Phase 3: Vector Search & Retrieval

Qdrant Collection Schema

Created collection with optimized settings:

collection_config = {
    'name': 'emails',
    'vector_size': 1536,
    'distance': 'Cosine',
    'payload_schema': {
        'email_id': 'keyword',
        'subject': 'text',
        'from': 'keyword',
        'date': 'integer',  # Unix timestamp for filtering
        'chunk_text': 'text'
    }
}

Retriever Implementation (retriever.py)

Semantic search with smart result handling:

  1. Query embedding - Convert user question to vector
  2. Vector search - Find top-k similar chunks
  3. Deduplication - Group by email_id, keep best chunk per email
  4. Filtering - Support date ranges, sender filters
  5. Result formatting - Return structured email metadata

Search Parameters


6. Phase 4: Chatbot Interface

CLI Chatbot (chatbot.py)

Interactive terminal interface with features:

System Prompt Engineering

Crafted a prompt that instructs GPT-4 to:

Context Window Management


7. Phase 5: Mail.app Integration

Challenge: Opening Emails

Initial approach: Use AppleScript to search Mail.app by Message-ID

Problem: Searching 70,000+ emails took 20+ seconds, often timing out

Solution: Direct EML File Opening

Changed strategy:

  1. Store the EML filename as the email_id in Qdrant
  2. When user requests /open, locate the EML file
  3. Use open -a Mail <filename>.eml to open directly

Result: Opens in <1 second, no Mail.app database search

AppleScript Export

Created AppleScript to export emails from Mail.app:

tell application "Mail"
    set theMessages to messages of mailbox "INBOX"
    repeat with theMessage in theMessages
        set emlContent to source of theMessage
        -- Save to data/incoming_emails/
    end repeat
end tell

Launcher Apps

Created double-clickable .app bundles:


8. Phase 6: Web Interface

Why Add Web UI?

Architecture Decision

Keep CLI and Web completely separate:

Flask Backend (web/app.py)

REST API endpoints:

Endpoint Method Purpose
/api/chat POST Send message, get AI response
/api/stats GET Collection statistics
/api/export POST Trigger Mail.app export
/api/index POST Start background indexing
/api/index/status GET Poll indexing progress
/api/tracker/stats GET Indexed files statistics

Frontend

Single-page app with:


9. Phase 7: Performance Optimization

Problem: Slow Re-indexing

Initial indexing checked Qdrant for every file:

Solution: SQLite File Tracker

Created indexed_files_tracker.py:

class IndexedFilesTracker:
    """Track which files have been indexed locally."""

    def __init__(self, db_path):
        self.conn = sqlite3.connect(db_path)
        self._create_table()

    def is_indexed(self, filename):
        """Check if file already indexed (O(1) lookup)."""
        cursor = self.conn.execute(
            "SELECT 1 FROM indexed_files WHERE filename = ?",
            (filename,)
        )
        return cursor.fetchone() is not None

    def mark_indexed(self, filename, email_id):
        """Record that file has been indexed."""
        self.conn.execute(
            "INSERT OR REPLACE INTO indexed_files VALUES (?, ?, ?)",
            (filename, email_id, datetime.now().isoformat())
        )
        self.conn.commit()

Result: Skip check for 9,518 files drops from 5+ minutes to <1 second

Migration Script

Created populate_tracker.py to backfill tracker with existing indexed emails.


Problem: Natural Language Dates

Users ask questions like:

The semantic search found relevant content but didn't filter by date.

Solution: Date Parser (date_parser.py)

Natural language date extraction:

def parse_date_query(query: str) -> tuple[datetime, datetime]:
    """Extract date range from natural language query."""

    patterns = {
        'today': (start_of_today, end_of_today),
        'yesterday': (start_of_yesterday, end_of_yesterday),
        'last week': (7_days_ago, now),
        'last month': (start_of_last_month, end_of_last_month),
        'this week': (start_of_week, now),
        # ... more patterns
    }

Modified retriever to accept date filters:

def search_emails(query, date_from=None, date_to=None):
    # Parse dates from query
    if date_from is None:
        date_from, date_to = parse_date_query(query)

    # Build Qdrant filter
    filters = []
    if date_from:
        filters.append({'key': 'date', 'range': {'gte': date_from.timestamp()}})
    if date_to:
        filters.append({'key': 'date', 'range': {'lte': date_to.timestamp()}})

    # Search with filter
    results = qdrant.search(query_vector, filter=filters)

11. Key Challenges & Solutions

Challenge 1: Email ID Mismatch

Problem: Mail.app's internal message ID differs from the Message-ID header

Solution: Use EML filename as the canonical ID. All systems (Qdrant, tracker, opener) use the same filename-based ID.

Challenge 2: API Version Compatibility

Problem: Qdrant client v1.16 warned about server v1.12.6 compatibility

Solution: Pin specific compatible versions, accept warnings for minor mismatches, test critical operations.

Challenge 3: Rate Limiting

Problem: OpenAI embedding API has rate limits

Solution:

Challenge 4: Large Email Archives

Problem: 70,000+ emails in Mail.app, slow operations

Solution:

Challenge 5: Privacy in Git

Problem: Need to share code without exposing personal emails

Solution: Comprehensive .gitignore:

data/
*.eml
*.mbox
*.db
*.sqlite
logs/
.env

12. Lessons Learned

Technical Lessons

  1. Local-first is complex but worth it

    • Docker simplifies local database deployment
    • Trade-off: More setup steps for users
  2. Batch operations are essential

    • Individual API calls don't scale
    • Always design for bulk processing
  3. Track state locally

    • SQLite tracker dramatically improved performance
    • Don't rely on remote queries for local state
  4. Date handling is surprisingly hard

    • Timezone issues everywhere
    • Unix timestamps are the safest interchange format
  5. Test with real data early

    • Synthetic test emails hide encoding issues
    • Real mailboxes have weird edge cases

Architecture Lessons

  1. Modularity pays off

    • Each script does one thing
    • Easy to test, debug, and extend
  2. Keep CLI and Web separate

    • Different use cases, different interfaces
    • Shared backend logic via imports
  3. Design for resume/retry

    • Long operations will fail
    • Track progress, allow continuation

Product Lessons

  1. Privacy is a feature

    • Users care about where their data goes
    • Local-first opens doors cloud solutions can't
  2. Semantic search changes everything

    • "Find that email about the thing" actually works
    • Natural language is more intuitive than filters
  3. Integration matters

    • Opening emails in Mail.app is the magic moment
    • Search is only useful if you can act on results

Final Statistics

Metric Value
Total emails indexed 9,518
Vector points in Qdrant 23,866
Average chunks per email ~2.5
Index time (full) ~25 minutes
Index time (incremental) <1 second
Search latency ~300ms
End-to-end response ~2 seconds

Future Possibilities