RAG Pipeline for Document Q&A

This project implements a flexible and extensible Retrieval Augmented Generation (RAG) pipeline designed for question-answering over custom PDF documents. It incorporates advanced retrieval techniques such as BM25, Vector Search, and Ensemble Retrieval with an optional Cross-Encoder Reranker to enhance the relevance of retrieved contexts. The pipeline also includes a robust evaluation framework to measure performance using BLEU, ROUGE, and reference accuracy metrics.

Detailed informations: RAG report.pdf

✨ Features

Configurable RAG Pipeline: Easily customize LLM, embedding models, chunking strategies, and retrieval parameters via command-line arguments.
Hybrid Retrieval: Combines BM25 (sparse retrieval) and Vector Search (dense retrieval) using an Ensemble Retriever for comprehensive document recall.
Context Reranking: Integrates a Cross-Encoder Reranker (e.g., BAAI/bge-reranker-base or bge-reranker-v2-m3) to re-rank retrieved documents, prioritizing the most relevant chunks for the LLM.
Persistent Vector Store: Utilizes ChromaDB to store document embeddings, allowing for efficient reloading and avoiding redundant embedding generation.
Comprehensive Evaluation: Automatically evaluates the RAG pipeline's performance using:
- BLEU Score: Measures the fluency and adequacy of generated answers against gold standards.
- ROUGE-1 & ROUGE-2 Recall: Evaluates content overlap (recall) between generated and reference answers.
- Reference Accuracy: Checks if the retrieved context's source page matches the gold reference page.
Jieba Integration: Supports Chinese text processing by integrating Jieba for tokenization in document loading, splitting, and NLP evaluation metrics.
Detailed Logging: Saves hyperparameters and evaluation results for each run in a timestamped JSON file, facilitating experiment tracking and comparison.

🚀 Getting Started

Follow these steps to set up and run the RAG pipeline.

Prerequisites

Python 3.9+
pip package manager

Installation

Clone the repository:

git clone https://github.com/your-username/your-rag-project.git
cd your-rag-project

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required dependencies:
```
pip install -r requirements.txt
```
Download NLTK data: Run the following in your Python environment or add it to your setup script:
```
import nltk
nltk.download('punkt') # For tokenization in BLEU
```

Configuration

Environment Variables: Create a .env file in the root directory to store your OpenAI API key and base URL (if using a self-hosted or different endpoint for ChatOpenAI).
```
OPENAI_API_KEY="your_openai_api_key_here"
OPENAI_BASE_URL="https://your_llm_api_base_url_here/v1" # e.g., for self-hosted LLM
```
If you're using OpenAI's default API, OPENAI_BASE_URL might not be strictly necessary, but it's good practice to include it if your ChatOpenAI setup relies on it.
Prepare your data:
- Place your PDF document in the project directory. The default is ./汽车介绍手册.pdf.
- Create a JSON file with your QA pairs for evaluation. The default is ./QA_pairs.json. The format should be a list of dictionaries, like this:
```
[
  {
    "question": "操作多媒体娱乐功能需要注意什么？",
    "answer": "确保将车辆停驻在安全地点，将挡位切换至驻车挡（P）并使用驻车制动",
    "reference": "page_5"
  },
  {
    "question": "Lynk & Co App 要多少钱？",
    "answer": "结合给定的资料，无法回答问题。",
    "reference": "page_unknown"
  }
]
```
  - question: The query to be posed to the RAG system.
  - answer: The gold standard answer for the question.
  - reference: The page number(s) where the answer can be found in the original PDF (e.g., "page_5", "page_10,page_12"). Use "page_unknown" if the answer is not in the document.

Running the Pipeline

You can run the main.py script directly from the command line, customizing parameters as needed. A test.sh script is provided for convenience.

Using `test.sh` (Recommended for default runs)

bash test.sh

This script will execute main.py with predefined arguments, as shown in your test.sh:

python main.py \
    --pdf_path "./汽车介绍手册.pdf" \
    --questions_file "./QA_pairs.json" \
    --llm_model "c101-qwen25-72b" \
    --embedding_model "Qwen3-Embedding-4B" \
    --chunk_size 300 \
    --chunk_overlap 100 \
    --bm25_k 3 \
    --vector_k 3 \
    --ensemble_weights "0.5,0.5" \
    --reranker_model "bge-reranker-v2-m3" \
    --reranker_top_n 3

Manual Execution with Custom Arguments

You can also run main.py directly and specify your own parameters:

python main.py \
    --pdf_path "./your_document.pdf" \
    --questions_file "./your_qa_data.json" \
    --llm_model "gpt-3.5-turbo" \
    --embedding_model "all-MiniLM-L6-v2" \
    --chunk_size 500 \
    --chunk_overlap 50 \
    --bm25_k 5 \
    --vector_k 10 \
    --ensemble_weights "0.3,0.7" \
    --reranker_model "cross-encoder/ms-marco-TinyBERT-L-2" \
    --reranker_top_n 5

📁 Project Structure

.
├── config.py             # Configuration class for hyperparameters
├── rag_component.py      # Core RAG pipeline components (document loading, splitting, vector store, retrievers, RAG chain)
├── evaluator.py          # Evaluation logic (batch answering, NLP metrics, reference accuracy, result saving)
├── main.py               # Main entry point, orchestrates the RAG pipeline and evaluation
├── test.sh               # Example shell script to run the pipeline with specific arguments
├── .env                  # Environment variables (e.g., API keys - add to .gitignore)
├── requirements.txt      # Python dependencies
├── QA_pairs.json         # Example QA pairs for evaluation
├── 汽车介绍手册.pdf      # Example PDF document
├── chroma_db/            # Directory for persistent ChromaDB vector stores (automatically generated)
│   └── ...
└── rag_results/          # Directory for evaluation results (detailed answers and summary JSONs)
    └── summary_YYYYMMDD_HHMMSS.json
    └── answers_detailed_YYYYMMDD_HHMMSS.json

⚙️ Configuration (`config.py`)

The Config class centralizes all hyperparameters, making it easy to manage and experiment with different settings.

LLM_MODEL_NAME: Name of the large language model to use (e.g., c101-qwen25-72b, gpt-4).
EMBEDDINGS_MODEL_NAME: Name of the HuggingFace embedding model (e.g., m3e-base, Qwen3-Embedding-4B).
CHUNK_SIZE: Maximum size of text chunks after document splitting.
CHUNK_OVERLAP: Overlap between consecutive text chunks.
PDF_FILE_PATH: Path to the input PDF document.
PERSIST_DIRECTORY: Directory where ChromaDB will store vector embeddings. Dynamically generated based on embedding model, chunk size, and overlap.
BM25_K: Number of documents to retrieve using BM25.
VECTOR_K: Number of documents to retrieve using Vector Search.
ENSEMBLE_WEIGHTS: Weights for the Ensemble Retriever, e.g., [0.5, 0.5] for equal weighting.
RERANKER_MODEL_PATH: Path or name of the Cross-Encoder reranker model (e.g., BAAI/bge-reranker-base).
RERANKER_TOP_N: Number of top documents to keep after reranking.
QUESTIONS_FILE: Path to the JSON file containing evaluation questions and gold answers.
ANSWERS_OUTPUT_FILE: Temporary path for detailed answers. The actual output files in rag_results/ will be timestamped.

📊 Evaluation Results

After each run, the system will output the evaluation results to the console and save detailed and summary JSON files in the rag_results/ directory.

Example Output

--- RAG 管道配置 ---
   LLM_MODEL_NAME: c101-qwen25-72b
   EMBEDDINGS_MODEL_NAME: Qwen3-Embedding-4B
   CHUNK_SIZE: 300
   CHUNK_OVERLAP: 100
   PDF_FILE_PATH: ./汽车介绍手册.pdf
   PERSIST_DIRECTORY: ./chroma_db/Qwen3-Embedding-4B_cs300_co100
   BM25_K: 3
   VECTOR_K: 3
   ENSEMBLE_WEIGHTS: [0.5, 0.5]
   RERANKER_MODEL_PATH: bge-reranker-v2-m3
   RERANKER_TOP_N: 3
   QUESTIONS_FILE: ./QA_pairs.json
   ANSWERS_OUTPUT_FILE: ./answers_temp.json
----------------------------------
Loaded 15 pages from 汽车介绍手册.pdf
Split documents into 78 chunks with chunk_size=300, overlap=100.
Creating and persisting new vector store to ./chroma_db/Qwen3-Embedding-4B_cs300_co100
Setting up retrievers...
  BM25 Retriever initialized with k=3
  Vector Retriever initialized with k=3
  Ensemble Retriever initialized with weights=[0.5, 0.5]
  Initializing Reranker model: bge-reranker-v2-m3
Retriever with reranker set up. Top N after reranking: 3
RAG chain set up.

Starting batch answering for 20 questions...
  Processing question 1/20: 操作多媒体娱乐功能需要注意什么？...
  Processing question 10/20: 什么是主动安全？...
  Processing question 20/20: 如何开启巡航控制？...
Batch answering complete. Results saved to rag_results/answers_detailed_20240723_103045.json

--- Evaluation Results ---
平均 BLEU 分数: 0.2543
平均 ROUGE-1 Recall 分数: 0.3876
平均 ROUGE-2 Recall 分数: 0.1522
引用页码命中率: 85.00%

Run summary and hyperparameters saved to: rag_results/summary_20240723_103045.json

The rag_results/ directory will contain JSON files like:

summary_YYYYMMDD_HHMMSS.json: Contains the hyperparameters used for the run and the aggregated evaluation scores (BLEU, ROUGE-1 Recall, ROUGE-2 Recall, Reference Accuracy).
answers_detailed_YYYYMMDD_HHMMSS.json: Contains each question, its gold answer, the LLM's generated answer, and the retrieved reference page, allowing for detailed analysis.

🤝 Contributing

Contributions are welcome! If you have suggestions for improvements, new features, or bug fixes, please open an issue or submit a pull request.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgements

LangChain for providing the framework for building LLM applications.
Hugging Face Transformers for access to various pre-trained models.
Chroma for the vector database.
Jieba for Chinese text segmentation.
ROUGE-Chinese for Chinese ROUGE evaluation.

This README provides a comprehensive overview, setup instructions, and details about your project, making it easy for others to understand, use, and contribute to your RAG pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Pipeline for Document Q&A

✨ Features

🚀 Getting Started

Prerequisites

Installation

Configuration

Running the Pipeline

Using `test.sh` (Recommended for default runs)

Manual Execution with Custom Arguments

📁 Project Structure

⚙️ Configuration (`config.py`)

📊 Evaluation Results

Example Output

🤝 Contributing

📜 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
QA_pairs.json		QA_pairs.json
RAG大作业报告.pdf		RAG大作业报告.pdf
config.py		config.py
evaluator.py		evaluator.py
json_extract..py		json_extract..py
main.py		main.py
rag_component.py		rag_component.py
readme.md		readme.md
requirements.txt		requirements.txt
test.sh		test.sh
汽车介绍手册.md		汽车介绍手册.md
汽车介绍手册.pdf		汽车介绍手册.pdf

Folders and files

Latest commit

History

Repository files navigation

RAG Pipeline for Document Q&A

✨ Features

🚀 Getting Started

Prerequisites

Installation

Configuration

Running the Pipeline

Using test.sh (Recommended for default runs)

Manual Execution with Custom Arguments

📁 Project Structure

⚙️ Configuration (config.py)

📊 Evaluation Results

Example Output

🤝 Contributing

📜 License

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Using `test.sh` (Recommended for default runs)

⚙️ Configuration (`config.py`)

Packages