INSIDE: Internalization-aware LLM Serving In Dual-Speed Edge-Cloud Cache Evolution

INSIDE is an edge-cloud collaborative inference framework that enables small language models on resource-constrained edge devices to continuously internalize knowledge through a dual-path learning mechanism, achieving near-cloud accuracy at a fraction of the cost.

Project Structure

INSIDE/
├── codes/
│   ├── core/                        
│   │   ├── index.py                 
│   │   ├── router.py                
│   │   ├── retriever.py             
│   │   ├── assembler.py             
│   │   ├── learner.py               
│   │   ├── cloud_client.py         
│   │   ├── gemini_client.py         
│   │   ├── sql_prompt_generator.py  
│   │   └── pipeline.py             
│   │
│   ├── test/                        
│   │   ├── run_experiment.py        
│   │   ├── debug_experiment.py      
│   │   ├── generate_cloud_cache.py  
│   │   ├── generate_popqa_hotspot.py       
│   │   ├── repair_popqa_hotspot.py         
│   │   └── sql_prompt_generator.py  
│   │
│   └── utils/                       
│       ├── analyze_cloud_cache.py   
│       ├── count_cache_tokens.py    
│       ├── count_tokens.py          
│       ├── inspect_data.py          
│       ├── print_index.py           
│       ├── rejudge_prediction_dump.py  
│       └── tojsonl.py               
│
└── data/                        
   └── popqa/
        ├── test.tsv  
        └── popqa_hotspot.jsonl

🌍 Datasets and Tasks

We evaluate INSIDE across diverse workloads. As summarized in the following table, the tasks span multiple representative domains, including general question answering, long-form QA, mathematical reasoning, and structured code generation (Text-to-SQL).

Task	Dataset	# training samples	# test samples	Description
General QA	MS MARCO	808,731	101,093	Large-scale QA derived from Bing search logs
General QA	GooAQ	3,112,679	2,500	Large-scale QA mined from Google search logs
General QA	PopQA	11,267	3,000	QA benchmark focused on long-tail entities
General QA	PopQA_Hotspot	70,000	10,765	Synthetic benchmark reflecting realistic hotspot workloads
Long-form QA	ELI5	216,147	10,000	Long-form QA dataset to evaluate token overhead
Math Problem Solving	GSM8K	7,473	1,319	Grade-school math word problems
Text-to-SQL	Spider	7,000	1,034	Cross-domain semantic parsing and Text-to-SQL

📊 Baselines

We evaluate INSIDE against a comprehensive suite of representative retrieval, routing, and caching systems. Besides, we compare our framework with pure edge/cloud execution strategies. The baselines are shown below:

Baseline	Year	Conference / Journal	Paper
Self-RAG	2024	ICLR	Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
RouteLLM	2025	ICLR	RouteLLM: Learning to Route LLMs with Preference Data
GPTCache	2023	NLP-OSS	GPTCache: An Open-Source Semantic Cache for LLM Applications
IC-Cache	2025	SOSP	IC-Cache: Efficient Large Language Model Serving via In-Context Caching
All-Edge	-	-	All queries are processed locally by Qwen2.5-7B without retrieval
All-ICL	-	-	All queries use retrieval-augmented few-shot prompting on the edge
All-Cloud	-	-	All queries are processed by the cloud LLM (DeepSeek-V3.2)

Quick Start

Step 1: Environment Setup

Make sure you have Python 3.10+ and CUDA (for GPU acceleration) installed, then install the required dependencies:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers>=4.36 peft numpy spacy openai tqdm datasets
python -m spacy download en_core_web_sm

Step 2: Download the Model

Download Qwen2.5-7B-Instruct from HuggingFace and place it under the models/ directory.

Step 3: Download the Datasets

Download datasets and place them under data/.

Step 4: Configure the DeepSeek API Key

Open codes/core/cloud_client.py and set your DeepSeek API key.

Step 5: Configure and Run Experiments

Open codes/test/run_experiment.py and adjust the experiment configuration at the top of the file:

CURRENT_DATASET = "popqa_hotspot"              # Dataset: popqa_hotspot, MS, gooaq, eli5, gsm8k, spider
TEST_DAYS = 1                       # Number of simulation days
SAMPLES_PER_DAY = 1000              # Test samples per day
INIT_INDEX_SIZE = 70000             # Initial index size from training data

Then run the experiment:

cd codes/test
python run_experiment.py

Note: On first run, the system will build the cluster index from the training data, which may take a while. The built index is automatically saved to saved_indices/ and will be reused on subsequent runs.

Step 6: Check Results

After the experiment completes, results are stored in the log/ directory.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
codes		codes
data/popqa		data/popqa
README.MD		README.MD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

INSIDE: Internalization-aware LLM Serving In Dual-Speed Edge-Cloud Cache Evolution

Project Structure

🌍 Datasets and Tasks

📊 Baselines

Quick Start

Step 1: Environment Setup

Step 2: Download the Model

Step 3: Download the Datasets

Step 4: Configure the DeepSeek API Key

Step 5: Configure and Run Experiments

Step 6: Check Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

INSIDE: Internalization-aware LLM Serving In Dual-Speed Edge-Cloud Cache Evolution

Project Structure

🌍 Datasets and Tasks

📊 Baselines

Quick Start

Step 1: Environment Setup

Step 2: Download the Model

Step 3: Download the Datasets

Step 4: Configure the DeepSeek API Key

Step 5: Configure and Run Experiments

Step 6: Check Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages