This project implements a flexible and extensible Retrieval Augmented Generation (RAG) pipeline designed for question-answering over custom PDF documents. It incorporates advanced retrieval techniques such as BM25, Vector Search, and Ensemble Retrieval with an optional Cross-Encoder Reranker to enhance the relevance of retrieved contexts. The pipeline also includes a robust evaluation framework to measure performance using BLEU, ROUGE, and reference accuracy metrics.
Detailed informations: RAG report.pdf
- Configurable RAG Pipeline: Easily customize LLM, embedding models, chunking strategies, and retrieval parameters via command-line arguments.
- Hybrid Retrieval: Combines BM25 (sparse retrieval) and Vector Search (dense retrieval) using an Ensemble Retriever for comprehensive document recall.
- Context Reranking: Integrates a Cross-Encoder Reranker (e.g.,
BAAI/bge-reranker-baseorbge-reranker-v2-m3) to re-rank retrieved documents, prioritizing the most relevant chunks for the LLM. - Persistent Vector Store: Utilizes ChromaDB to store document embeddings, allowing for efficient reloading and avoiding redundant embedding generation.
- Comprehensive Evaluation: Automatically evaluates the RAG pipeline's performance using:
- BLEU Score: Measures the fluency and adequacy of generated answers against gold standards.
- ROUGE-1 & ROUGE-2 Recall: Evaluates content overlap (recall) between generated and reference answers.
- Reference Accuracy: Checks if the retrieved context's source page matches the gold reference page.
- Jieba Integration: Supports Chinese text processing by integrating Jieba for tokenization in document loading, splitting, and NLP evaluation metrics.
- Detailed Logging: Saves hyperparameters and evaluation results for each run in a timestamped JSON file, facilitating experiment tracking and comparison.
Follow these steps to set up and run the RAG pipeline.
- Python 3.9+
pippackage manager
- Clone the repository:
git clone https://github.com/your-username/your-rag-project.git cd your-rag-project - Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
- Install the required dependencies:
pip install -r requirements.txt
- Download NLTK data:
Run the following in your Python environment or add it to your setup script:
import nltk nltk.download('punkt') # For tokenization in BLEU
-
Environment Variables: Create a
.envfile in the root directory to store your OpenAI API key and base URL (if using a self-hosted or different endpoint forChatOpenAI).OPENAI_API_KEY="your_openai_api_key_here" OPENAI_BASE_URL="https://your_llm_api_base_url_here/v1" # e.g., for self-hosted LLMIf you're using OpenAI's default API,
OPENAI_BASE_URLmight not be strictly necessary, but it's good practice to include it if yourChatOpenAIsetup relies on it. -
Prepare your data:
- Place your PDF document in the project directory. The default is
./汽车介绍手册.pdf. - Create a JSON file with your QA pairs for evaluation. The default is
./QA_pairs.json. The format should be a list of dictionaries, like this:[ { "question": "操作多媒体娱乐功能需要注意什么?", "answer": "确保将车辆停驻在安全地点,将挡位切换至驻车挡(P)并使用驻车制动", "reference": "page_5" }, { "question": "Lynk & Co App 要多少钱?", "answer": "结合给定的资料,无法回答问题。", "reference": "page_unknown" } ]question: The query to be posed to the RAG system.answer: The gold standard answer for the question.reference: The page number(s) where the answer can be found in the original PDF (e.g., "page_5", "page_10,page_12"). Use "page_unknown" if the answer is not in the document.
- Place your PDF document in the project directory. The default is
You can run the main.py script directly from the command line, customizing parameters as needed. A test.sh script is provided for convenience.
bash test.shThis script will execute main.py with predefined arguments, as shown in your test.sh:
python main.py \
--pdf_path "./汽车介绍手册.pdf" \
--questions_file "./QA_pairs.json" \
--llm_model "c101-qwen25-72b" \
--embedding_model "Qwen3-Embedding-4B" \
--chunk_size 300 \
--chunk_overlap 100 \
--bm25_k 3 \
--vector_k 3 \
--ensemble_weights "0.5,0.5" \
--reranker_model "bge-reranker-v2-m3" \
--reranker_top_n 3You can also run main.py directly and specify your own parameters:
python main.py \
--pdf_path "./your_document.pdf" \
--questions_file "./your_qa_data.json" \
--llm_model "gpt-3.5-turbo" \
--embedding_model "all-MiniLM-L6-v2" \
--chunk_size 500 \
--chunk_overlap 50 \
--bm25_k 5 \
--vector_k 10 \
--ensemble_weights "0.3,0.7" \
--reranker_model "cross-encoder/ms-marco-TinyBERT-L-2" \
--reranker_top_n 5.
├── config.py # Configuration class for hyperparameters
├── rag_component.py # Core RAG pipeline components (document loading, splitting, vector store, retrievers, RAG chain)
├── evaluator.py # Evaluation logic (batch answering, NLP metrics, reference accuracy, result saving)
├── main.py # Main entry point, orchestrates the RAG pipeline and evaluation
├── test.sh # Example shell script to run the pipeline with specific arguments
├── .env # Environment variables (e.g., API keys - add to .gitignore)
├── requirements.txt # Python dependencies
├── QA_pairs.json # Example QA pairs for evaluation
├── 汽车介绍手册.pdf # Example PDF document
├── chroma_db/ # Directory for persistent ChromaDB vector stores (automatically generated)
│ └── ...
└── rag_results/ # Directory for evaluation results (detailed answers and summary JSONs)
└── summary_YYYYMMDD_HHMMSS.json
└── answers_detailed_YYYYMMDD_HHMMSS.json
The Config class centralizes all hyperparameters, making it easy to manage and experiment with different settings.
- LLM_MODEL_NAME: Name of the large language model to use (e.g.,
c101-qwen25-72b,gpt-4). - EMBEDDINGS_MODEL_NAME: Name of the HuggingFace embedding model (e.g.,
m3e-base,Qwen3-Embedding-4B). - CHUNK_SIZE: Maximum size of text chunks after document splitting.
- CHUNK_OVERLAP: Overlap between consecutive text chunks.
- PDF_FILE_PATH: Path to the input PDF document.
- PERSIST_DIRECTORY: Directory where ChromaDB will store vector embeddings. Dynamically generated based on embedding model, chunk size, and overlap.
- BM25_K: Number of documents to retrieve using BM25.
- VECTOR_K: Number of documents to retrieve using Vector Search.
- ENSEMBLE_WEIGHTS: Weights for the Ensemble Retriever, e.g.,
[0.5, 0.5]for equal weighting. - RERANKER_MODEL_PATH: Path or name of the Cross-Encoder reranker model (e.g.,
BAAI/bge-reranker-base). - RERANKER_TOP_N: Number of top documents to keep after reranking.
- QUESTIONS_FILE: Path to the JSON file containing evaluation questions and gold answers.
- ANSWERS_OUTPUT_FILE: Temporary path for detailed answers. The actual output files in
rag_results/will be timestamped.
After each run, the system will output the evaluation results to the console and save detailed and summary JSON files in the rag_results/ directory.
--- RAG 管道配置 ---
LLM_MODEL_NAME: c101-qwen25-72b
EMBEDDINGS_MODEL_NAME: Qwen3-Embedding-4B
CHUNK_SIZE: 300
CHUNK_OVERLAP: 100
PDF_FILE_PATH: ./汽车介绍手册.pdf
PERSIST_DIRECTORY: ./chroma_db/Qwen3-Embedding-4B_cs300_co100
BM25_K: 3
VECTOR_K: 3
ENSEMBLE_WEIGHTS: [0.5, 0.5]
RERANKER_MODEL_PATH: bge-reranker-v2-m3
RERANKER_TOP_N: 3
QUESTIONS_FILE: ./QA_pairs.json
ANSWERS_OUTPUT_FILE: ./answers_temp.json
----------------------------------
Loaded 15 pages from 汽车介绍手册.pdf
Split documents into 78 chunks with chunk_size=300, overlap=100.
Creating and persisting new vector store to ./chroma_db/Qwen3-Embedding-4B_cs300_co100
Setting up retrievers...
BM25 Retriever initialized with k=3
Vector Retriever initialized with k=3
Ensemble Retriever initialized with weights=[0.5, 0.5]
Initializing Reranker model: bge-reranker-v2-m3
Retriever with reranker set up. Top N after reranking: 3
RAG chain set up.
Starting batch answering for 20 questions...
Processing question 1/20: 操作多媒体娱乐功能需要注意什么?...
Processing question 10/20: 什么是主动安全?...
Processing question 20/20: 如何开启巡航控制?...
Batch answering complete. Results saved to rag_results/answers_detailed_20240723_103045.json
--- Evaluation Results ---
平均 BLEU 分数: 0.2543
平均 ROUGE-1 Recall 分数: 0.3876
平均 ROUGE-2 Recall 分数: 0.1522
引用页码命中率: 85.00%
Run summary and hyperparameters saved to: rag_results/summary_20240723_103045.json
The rag_results/ directory will contain JSON files like:
summary_YYYYMMDD_HHMMSS.json: Contains the hyperparameters used for the run and the aggregated evaluation scores (BLEU, ROUGE-1 Recall, ROUGE-2 Recall, Reference Accuracy).answers_detailed_YYYYMMDD_HHMMSS.json: Contains each question, its gold answer, the LLM's generated answer, and the retrieved reference page, allowing for detailed analysis.
Contributions are welcome! If you have suggestions for improvements, new features, or bug fixes, please open an issue or submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
- LangChain for providing the framework for building LLM applications.
- Hugging Face Transformers for access to various pre-trained models.
- Chroma for the vector database.
- Jieba for Chinese text segmentation.
- ROUGE-Chinese for Chinese ROUGE evaluation.
This README provides a comprehensive overview, setup instructions, and details about your project, making it easy for others to understand, use, and contribute to your RAG pipeline.