INSIDE is an edge-cloud collaborative inference framework that enables small language models on resource-constrained edge devices to continuously internalize knowledge through a dual-path learning mechanism, achieving near-cloud accuracy at a fraction of the cost.
INSIDE/
├── codes/
│ ├── core/
│ │ ├── index.py
│ │ ├── router.py
│ │ ├── retriever.py
│ │ ├── assembler.py
│ │ ├── learner.py
│ │ ├── cloud_client.py
│ │ ├── gemini_client.py
│ │ ├── sql_prompt_generator.py
│ │ └── pipeline.py
│ │
│ ├── test/
│ │ ├── run_experiment.py
│ │ ├── debug_experiment.py
│ │ ├── generate_cloud_cache.py
│ │ ├── generate_popqa_hotspot.py
│ │ ├── repair_popqa_hotspot.py
│ │ └── sql_prompt_generator.py
│ │
│ └── utils/
│ ├── analyze_cloud_cache.py
│ ├── count_cache_tokens.py
│ ├── count_tokens.py
│ ├── inspect_data.py
│ ├── print_index.py
│ ├── rejudge_prediction_dump.py
│ └── tojsonl.py
│
└── data/
└── popqa/
├── test.tsv
└── popqa_hotspot.jsonl
We evaluate INSIDE across diverse workloads. As summarized in the following table, the tasks span multiple representative domains, including general question answering, long-form QA, mathematical reasoning, and structured code generation (Text-to-SQL).
| Task | Dataset | # training samples | # test samples | Description |
|---|---|---|---|---|
| General QA | MS MARCO | 808,731 | 101,093 | Large-scale QA derived from Bing search logs |
| General QA | GooAQ | 3,112,679 | 2,500 | Large-scale QA mined from Google search logs |
| General QA | PopQA | 11,267 | 3,000 | QA benchmark focused on long-tail entities |
| General QA | PopQA_Hotspot | 70,000 | 10,765 | Synthetic benchmark reflecting realistic hotspot workloads |
| Long-form QA | ELI5 | 216,147 | 10,000 | Long-form QA dataset to evaluate token overhead |
| Math Problem Solving | GSM8K | 7,473 | 1,319 | Grade-school math word problems |
| Text-to-SQL | Spider | 7,000 | 1,034 | Cross-domain semantic parsing and Text-to-SQL |
We evaluate INSIDE against a comprehensive suite of representative retrieval, routing, and caching systems. Besides, we compare our framework with pure edge/cloud execution strategies. The baselines are shown below:
| Baseline | Year | Conference / Journal | Paper |
|---|---|---|---|
| Self-RAG | 2024 | ICLR | Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection |
| RouteLLM | 2025 | ICLR | RouteLLM: Learning to Route LLMs with Preference Data |
| GPTCache | 2023 | NLP-OSS | GPTCache: An Open-Source Semantic Cache for LLM Applications |
| IC-Cache | 2025 | SOSP | IC-Cache: Efficient Large Language Model Serving via In-Context Caching |
| All-Edge | - | - | All queries are processed locally by Qwen2.5-7B without retrieval |
| All-ICL | - | - | All queries use retrieval-augmented few-shot prompting on the edge |
| All-Cloud | - | - | All queries are processed by the cloud LLM (DeepSeek-V3.2) |
Make sure you have Python 3.10+ and CUDA (for GPU acceleration) installed, then install the required dependencies:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers>=4.36 peft numpy spacy openai tqdm datasets
python -m spacy download en_core_web_smDownload Qwen2.5-7B-Instruct from HuggingFace and place it under the models/ directory.
Download datasets and place them under data/.
Open codes/core/cloud_client.py and set your DeepSeek API key.
Open codes/test/run_experiment.py and adjust the experiment configuration at the top of the file:
CURRENT_DATASET = "popqa_hotspot" # Dataset: popqa_hotspot, MS, gooaq, eli5, gsm8k, spider
TEST_DAYS = 1 # Number of simulation days
SAMPLES_PER_DAY = 1000 # Test samples per day
INIT_INDEX_SIZE = 70000 # Initial index size from training dataThen run the experiment:
cd codes/test
python run_experiment.pyNote: On first run, the system will build the cluster index from the training data, which may take a while. The built index is automatically saved to
saved_indices/and will be reused on subsequent runs.
After the experiment completes, results are stored in the log/ directory.