Back to BlogAI

RAG Systems in Production: Lessons Learned

AI Team
Nov 10, 2024
10 min read

RAG Systems in Production: Lessons Learned


Retrieval-Augmented Generation (RAG) has become the go-to architecture for building LLM applications that need access to proprietary or up-to-date information. After deploying RAG systems for multiple clients, we've learned valuable lessons about what works in production and what doesn't.


What is RAG?


RAG combines the generative capabilities of large language models with a retrieval system that fetches relevant context from a knowledge base. Instead of relying solely on the LLM's training data, RAG systems:


1. Take a user query

2. Retrieve relevant documents from a vector database

3. Pass those documents as context to the LLM

4. Generate a response grounded in the retrieved information


Our Production Use Cases


We've deployed RAG systems for:

  • **Customer support chatbots** answering product questions
  • **Internal knowledge bases** helping employees find information
  • **Code documentation assistants** for developer tools
  • **Legal document analysis** for contract review

  • Challenge 1: Retrieval Quality


    **The Problem:** The LLM is only as good as the documents you retrieve. Poor retrieval = poor answers, no matter how capable your LLM is.


    What We Learned:


    **Chunk Size Matters:** We experimented with chunk sizes from 128 to 2048 tokens. Sweet spot for most applications: 512-768 tokens with 128-token overlap. Smaller chunks lose context; larger chunks dilute relevance.


    **Hybrid Search Wins:** Combining semantic search (embeddings) with keyword search (BM25) consistently outperformed either approach alone. We use a weighted combination: 70% semantic, 30% keyword.


    **Metadata Filtering is Critical:** Adding metadata (date, category, source, author) and allowing filtered retrieval dramatically improved results. Example: "Show me documentation from 2024" or "Search only engineering docs."


    **Reranking Improves Precision:** After initial retrieval, we use a cross-encoder reranker (e.g., Cohere rerank) to reorder results. This typically improves relevance by 20-30%.


    Challenge 2: Context Window Management


    **The Problem:** LLMs have token limits. With multiple retrieved documents, you can quickly exceed the context window.


    Our Solution:


    1. **Retrieve More, Use Less:** Retrieve 20 documents but only pass top 5 to the LLM after reranking

    2. **Dynamic Context:** Adjust number of documents based on query complexity

    3. **Compression:** Use summarization for long documents before passing to LLM

    4. **Streaming Context:** For long conversations, we implemented "rolling context" that maintains relevance


    Challenge 3: Hallucination Prevention


    **The Problem:** Even with RAG, LLMs sometimes generate information not in the retrieved documents.


    Our Mitigation Strategies:


    **Explicit Instructions:** System prompts that emphasize "only use information from provided documents" and "say 'I don't know' if information isn't available."


    **Citation Tracking:** We return source document IDs with each response, allowing users to verify claims. Example:

    "According to the pricing page [doc-123], our Enterprise plan starts at $500/month."


    **Confidence Scoring:** We implemented a post-processing step that scores whether the response is supported by retrieved documents. Low-confidence responses trigger human review.


    **Answer Verification:** For critical applications (e.g., medical, legal), we use a second LLM call to verify the response against source documents.


    Challenge 4: Cold Start and Latency


    **The Problem:** Production RAG systems need to be fast. Users won't wait 10+ seconds for a response.


    Optimizations:


    **Vector Database Selection:** We tested Pinecone, Weaviate, and Qdrant. Pinecone provided best p99 latency (<100ms for most queries).


    **Caching:** Aggressive caching of embeddings and common queries. Cache hit rate >60% for typical support chatbots.


    **Parallel Processing:** Run retrieval and reranking in parallel when possible.


    **Streaming Response:** Start streaming LLM response while retrieval is still finishing (when using multiple retrieval sources).


    Our median response time: 1.2 seconds (retrieval + generation + streaming).


    Challenge 5: Keeping Knowledge Fresh


    **The Problem:** Documentation and knowledge bases change frequently. Stale information leads to incorrect answers.


    Our Update Strategy:


    **Incremental Updates:** When a document changes, we only re-embed and re-index that document, not the entire knowledge base.


    **Change Detection:** Monitor source systems (CMS, documentation sites, databases) for updates. Trigger re-indexing automatically.


    **Version Control:** Keep multiple versions of documents in the vector database with timestamp metadata. Retrieve from the most recent version by default.


    **Manual Verification:** For critical updates (e.g., pricing changes), human verification before indexing.


    Challenge 6: Cost Management


    **The Problem:** RAG systems can get expensive: vector database hosting + embedding costs + LLM inference costs.


    Cost Optimization:


    **Semantic Caching:** Cache embeddings for common queries. Reduced embedding costs by 65%.


    **Smaller Models Where Possible:** Use GPT-3.5 for simple queries, GPT-4 only when needed. We built a classifier to route queries.


    **Batch Processing:** Batch embed documents rather than one-at-a-time. 5-10× faster and cheaper.


    **Smart Retrieval:** Don't retrieve from the vector database for every query. Simple FAQs can be handled with a cached lookup.


    Best Practices Summary


    1. **Invest in data quality:** Clean, well-structured, properly chunked documents are more important than model size

    2. **Measure everything:** Track retrieval precision, answer quality, latency, and user satisfaction

    3. **Human-in-the-loop:** Especially for high-stakes applications, have a human review or approval step

    4. **Start simple:** Begin with basic semantic search before adding reranking, hybrid search, etc.

    5. **Continuous improvement:** Collect feedback, analyze failures, and retrain/refine regularly


    Conclusion


    RAG is powerful but non-trivial to implement well. The challenges are mostly in the retrieval and data pipeline, not the LLM itself. Focus on getting high-quality, relevant context to the LLM, and the generated responses will be significantly better.


    After deploying RAG systems for over a year, we're convinced it's the right architecture for most LLM applications that need access to proprietary or current information. But success requires careful attention to retrieval quality, latency, and ongoing maintenance.


    A

    AI Team

    The AI Team at Senpai Software shares insights and best practices from real-world software development projects.