Ragmail
2026-01-07 17:09:35RAGMail Development Journey
A point-by-point chronicle of building a local-first RAG email chatbot, from concept to working application.
Table of Contents
- Project Vision
- Core Architecture Decisions
- Phase 1: Foundation
- Phase 2: Email Ingestion Pipeline
- Phase 3: Vector Search & Retrieval
- Phase 4: Chatbot Interface
- Phase 5: Mail.app Integration
- Phase 6: Web Interface
- Phase 7: Performance Optimization
- Phase 8: Date-Aware Search
- Key Challenges & Solutions
- Lessons Learned
1. Project Vision
The Problem
- Searching through years of email is painful
- Existing email search is keyword-based, not semantic
- Cloud-based AI solutions require uploading personal data
- No good local-first solution existed for conversational email search
The Goal
Build a completely local, privacy-first chatbot that:
- Keeps all email data on your machine
- Uses AI for semantic understanding, not just keyword matching
- Answers natural language questions about your email history
- Integrates seamlessly with Apple Mail.app
Privacy Principles
- Email content never leaves the local filesystem
- Only embeddings (numerical vectors) stored in local vector database
- Only minimal context sent to OpenAI for chat responses
- No cloud databases, no external storage
2. Core Architecture Decisions
Technology Stack Selection
| Component | Technology | Rationale |
|---|---|---|
| Vector Database | Qdrant (Docker) | Fast, local, excellent Python SDK |
| Embeddings | OpenAI text-embedding-3-large | High quality, 1536 dimensions |
| Chat Model | OpenAI GPT-4 | Best reasoning capability |
| Backend | Python 3.11 | Rich ecosystem, easy prototyping |
| Web Framework | Flask | Lightweight, minimal overhead |
| Email Storage | EML files | Standard format, works with Mail.app |
Data Flow Architecture
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Mail.app │────▶│ EML Files │────▶│ Parser │
│ (AppleScript)│ │ (data/incoming)│ │(parse_mbox) │
└─────────────┘ └──────────────┘ └──────┬──────┘
│
┌──────────────┐ ┌──────▼──────┐
│ Chunker │◀────│ Email Dict │
│ (512 tokens) │ │ (subject, │
└──────┬───────┘ │ body...) │
│ └─────────────┘
┌──────▼───────┐
│ Embedder │
│ (OpenAI) │
└──────┬───────┘
│
┌──────▼───────┐
│ Qdrant │
│ (vector DB) │
└──────────────┘
3. Phase 1: Foundation
Step 1: Project Structure
Created a modular Python project with clear separation of concerns:
scripts/
├── config.py # Configuration loading
├── utils.py # Shared utilities
├── qdrant_db.py # Database connection
├── parse_mbox.py # Email parsing
├── chunker.py # Text chunking
├── embedder.py # Vector generation
├── indexer.py # Main pipeline
├── retriever.py # Search logic
└── chatbot.py # User interface
Step 2: Configuration System
config.yamlfor application settings.envfor secrets (API keys)- Centralized config loader with sensible defaults
Step 3: Docker Setup
Created docker-compose.yml for Qdrant:
services:
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
- "6334:6334"
volumes:
- ./data/qdrant_storage:/qdrant/storage
Step 4: Dependencies
Key packages in requirements.txt:
qdrant-client- Vector database clientopenai- Embeddings and chatbeautifulsoup4- HTML email parsingtiktoken- Token countingtenacity- Retry logicrich- Terminal UIflask- Web interface (added later)
4. Phase 2: Email Ingestion Pipeline
Step 1: Email Parser (parse_mbox.py)
Built a robust EML file parser that extracts:
- Message-ID (unique identifier)
- Subject, From, To, CC
- Date (parsed to Unix timestamp)
- Body (plain text preferred, HTML converted if needed)
- Attachments (metadata only)
Key Challenges:
- MIME multipart messages with nested parts
- Various character encodings (UTF-8, ISO-8859-1, etc.)
- HTML emails needing conversion to plain text
- Malformed headers in real-world emails
Step 2: Text Chunker (chunker.py)
Implemented intelligent text splitting:
- 512 tokens per chunk (optimal for embeddings)
- 50 token overlap (maintains context across chunks)
- Respects sentence boundaries where possible
- Uses
tiktokenfor accurate token counting
Step 3: Embedder (embedder.py)
OpenAI embedding generation with production considerations:
- Batch processing (up to 2048 texts per API call)
- Rate limit handling with exponential backoff
- Cost tracking (log token usage)
- Retry logic using
tenacity
Step 4: Indexer (indexer.py)
Main orchestration pipeline:
def index_email(eml_file):
# 1. Parse email
email = parse_eml(eml_file)
# 2. Chunk the body
chunks = chunk_text(email['body'])
# 3. Generate embeddings
vectors = embed_batch(chunks)
# 4. Store in Qdrant
points = [
{
'id': generate_uuid(),
'vector': vector,
'payload': {
'email_id': email['id'],
'subject': email['subject'],
'from': email['from'],
'date': email['date'],
'chunk_index': i,
'chunk_text': chunk
}
}
for i, (chunk, vector) in enumerate(zip(chunks, vectors))
]
qdrant.upsert(points)
5. Phase 3: Vector Search & Retrieval
Qdrant Collection Schema
Created collection with optimized settings:
collection_config = {
'name': 'emails',
'vector_size': 1536,
'distance': 'Cosine',
'payload_schema': {
'email_id': 'keyword',
'subject': 'text',
'from': 'keyword',
'date': 'integer', # Unix timestamp for filtering
'chunk_text': 'text'
}
}
Retriever Implementation (retriever.py)
Semantic search with smart result handling:
- Query embedding - Convert user question to vector
- Vector search - Find top-k similar chunks
- Deduplication - Group by email_id, keep best chunk per email
- Filtering - Support date ranges, sender filters
- Result formatting - Return structured email metadata
Search Parameters
- Default top-k: 8 chunks
- Score threshold: 0.3 (filter low-quality matches)
- Deduplication: Best chunk per email
6. Phase 4: Chatbot Interface
CLI Chatbot (chatbot.py)
Interactive terminal interface with features:
- Natural language queries - "What did John say about the project?"
- Special commands:
/open <n>- Open email in Mail.app/last- Show last search results/help- Show available commands/quit- Exit chatbot
System Prompt Engineering
Crafted a prompt that instructs GPT-4 to:
- Reference specific emails with numbers [1], [2], etc.
- Include dates and senders in responses
- Admit when information isn't in the retrieved context
- Stay focused on email content
Context Window Management
- Maximum 4000 tokens for email context
- Truncate older/less relevant chunks if over limit
- Always include system prompt and user query
7. Phase 5: Mail.app Integration
Challenge: Opening Emails
Initial approach: Use AppleScript to search Mail.app by Message-ID
Problem: Searching 70,000+ emails took 20+ seconds, often timing out
Solution: Direct EML File Opening
Changed strategy:
- Store the EML filename as the email_id in Qdrant
- When user requests
/open, locate the EML file - Use
open -a Mail <filename>.emlto open directly
Result: Opens in <1 second, no Mail.app database search
AppleScript Export
Created AppleScript to export emails from Mail.app:
tell application "Mail"
set theMessages to messages of mailbox "INBOX"
repeat with theMessage in theMessages
set emlContent to source of theMessage
-- Save to data/incoming_emails/
end repeat
end tell
Launcher Apps
Created double-clickable .app bundles:
RAGMail.app- Launches CLI chatbotRAGMail Web.app- Launches web interface
8. Phase 6: Web Interface
Why Add Web UI?
- More accessible than terminal for casual use
- Better display of email metadata
- Easier to show clickable links
- Visual management of indexing/export
Architecture Decision
Keep CLI and Web completely separate:
scripts/chatbot.py- Standalone CLIweb/app.py- Flask backend- Both share the same Qdrant database and config
Flask Backend (web/app.py)
REST API endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/api/chat |
POST | Send message, get AI response |
/api/stats |
GET | Collection statistics |
/api/export |
POST | Trigger Mail.app export |
/api/index |
POST | Start background indexing |
/api/index/status |
GET | Poll indexing progress |
/api/tracker/stats |
GET | Indexed files statistics |
Frontend
Single-page app with:
- Chat message display with sources
- Clickable email links (opens in Mail.app)
- Management panel for export/indexing
- Real-time statistics display
9. Phase 7: Performance Optimization
Problem: Slow Re-indexing
Initial indexing checked Qdrant for every file:
- 9,518 files × 1 API call each = 5+ minutes just to determine what's new
Solution: SQLite File Tracker
Created indexed_files_tracker.py:
class IndexedFilesTracker:
"""Track which files have been indexed locally."""
def __init__(self, db_path):
self.conn = sqlite3.connect(db_path)
self._create_table()
def is_indexed(self, filename):
"""Check if file already indexed (O(1) lookup)."""
cursor = self.conn.execute(
"SELECT 1 FROM indexed_files WHERE filename = ?",
(filename,)
)
return cursor.fetchone() is not None
def mark_indexed(self, filename, email_id):
"""Record that file has been indexed."""
self.conn.execute(
"INSERT OR REPLACE INTO indexed_files VALUES (?, ?, ?)",
(filename, email_id, datetime.now().isoformat())
)
self.conn.commit()
Result: Skip check for 9,518 files drops from 5+ minutes to <1 second
Migration Script
Created populate_tracker.py to backfill tracker with existing indexed emails.
10. Phase 8: Date-Aware Search
Problem: Natural Language Dates
Users ask questions like:
- "What emails did I get last week?"
- "Find messages from November 2024"
- "Show me yesterday's emails"
The semantic search found relevant content but didn't filter by date.
Solution: Date Parser (date_parser.py)
Natural language date extraction:
def parse_date_query(query: str) -> tuple[datetime, datetime]:
"""Extract date range from natural language query."""
patterns = {
'today': (start_of_today, end_of_today),
'yesterday': (start_of_yesterday, end_of_yesterday),
'last week': (7_days_ago, now),
'last month': (start_of_last_month, end_of_last_month),
'this week': (start_of_week, now),
# ... more patterns
}
Integration with Search
Modified retriever to accept date filters:
def search_emails(query, date_from=None, date_to=None):
# Parse dates from query
if date_from is None:
date_from, date_to = parse_date_query(query)
# Build Qdrant filter
filters = []
if date_from:
filters.append({'key': 'date', 'range': {'gte': date_from.timestamp()}})
if date_to:
filters.append({'key': 'date', 'range': {'lte': date_to.timestamp()}})
# Search with filter
results = qdrant.search(query_vector, filter=filters)
11. Key Challenges & Solutions
Challenge 1: Email ID Mismatch
Problem: Mail.app's internal message ID differs from the Message-ID header
Solution: Use EML filename as the canonical ID. All systems (Qdrant, tracker, opener) use the same filename-based ID.
Challenge 2: API Version Compatibility
Problem: Qdrant client v1.16 warned about server v1.12.6 compatibility
Solution: Pin specific compatible versions, accept warnings for minor mismatches, test critical operations.
Challenge 3: Rate Limiting
Problem: OpenAI embedding API has rate limits
Solution:
- Batch requests (up to 2048 texts per call)
- Exponential backoff with
tenacity - Progress bars for long operations
- Resume capability for interrupted jobs
Challenge 4: Large Email Archives
Problem: 70,000+ emails in Mail.app, slow operations
Solution:
- File-level tracking (skip already-indexed)
- Batch processing with progress reporting
- Background indexing in web UI
- Direct file opening instead of Mail.app search
Challenge 5: Privacy in Git
Problem: Need to share code without exposing personal emails
Solution: Comprehensive .gitignore:
data/
*.eml
*.mbox
*.db
*.sqlite
logs/
.env
12. Lessons Learned
Technical Lessons
-
Local-first is complex but worth it
- Docker simplifies local database deployment
- Trade-off: More setup steps for users
-
Batch operations are essential
- Individual API calls don't scale
- Always design for bulk processing
-
Track state locally
- SQLite tracker dramatically improved performance
- Don't rely on remote queries for local state
-
Date handling is surprisingly hard
- Timezone issues everywhere
- Unix timestamps are the safest interchange format
-
Test with real data early
- Synthetic test emails hide encoding issues
- Real mailboxes have weird edge cases
Architecture Lessons
-
Modularity pays off
- Each script does one thing
- Easy to test, debug, and extend
-
Keep CLI and Web separate
- Different use cases, different interfaces
- Shared backend logic via imports
-
Design for resume/retry
- Long operations will fail
- Track progress, allow continuation
Product Lessons
-
Privacy is a feature
- Users care about where their data goes
- Local-first opens doors cloud solutions can't
-
Semantic search changes everything
- "Find that email about the thing" actually works
- Natural language is more intuitive than filters
-
Integration matters
- Opening emails in Mail.app is the magic moment
- Search is only useful if you can act on results
Final Statistics
| Metric | Value |
|---|---|
| Total emails indexed | 9,518 |
| Vector points in Qdrant | 23,866 |
| Average chunks per email | ~2.5 |
| Index time (full) | ~25 minutes |
| Index time (incremental) | <1 second |
| Search latency | ~300ms |
| End-to-end response | ~2 seconds |
Future Possibilities
- [ ] Thread-aware search (group related emails)
- [ ] Attachment content indexing (PDFs, docs)
- [ ] Multi-account support
- [ ] Local LLM option (Ollama integration)
- [ ] Email summarization
- [ ] Scheduled background indexing