Implementation for the paper:
3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence
Hao Tang*†, Ting Huang*, Zeyu Zhang*
*Equal contribution. †Corresponding author.
If you use this repo, please cite:
@article{tang20263d,
title={3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence},
author={Tang, Hao and Huang, Ting and Zhang, Zeyu},
journal={arXiv preprint arXiv:2601.06496},
year={2026}
}
Spatial intelligence refers to the ability to perceive, reason about, and describe objects and their relationships within three-dimensional environments. 3D captioning aims to describe 3D scenes in natural language but remains challenging due to sparse point clouds and weak grounding across diverse environments.
3D CoCa v2 addresses this by unifying contrastive vision-language learning with 3D caption generation and improving robustness via test-time search (TTS). It builds on frozen CLIP priors, a spatially-aware 3D encoder, and a multimodal decoder optimized with contrastive and captioning objectives. Experiments show gains over 3D CoCa on ScanRefer and Nr3D, and strong zero-shot OOD results on TOD3Cap.
Key components:
- EPCL encoder for 3D geometry
- Frozen CLIP priors
- CoCa-style multimodal decoder with contrastive + captioning loss
- Optional TTS with heuristic or external LLM judge
Create a clean conda environment and install dependencies:
conda create --name 3DCoCa python=3.8
conda activate 3DCoCa
conda install ipython pip
pip install matplotlib opencv-python plyfile "trimesh>=2.35.39,<2.35.40" \
"networkx>=2.2,<2.3" scipy cython transformers h5py
# pointnet2
cd third_party/pointnet2
python setup.py install
cd utils
python cython_compile.py build_ext --inplaceSet dataset paths in the dataset files:
datasets/scene_scanrefer.py:DATASET_ROOT_DIR,DATASET_METADATA_DIRdatasets/scene_nr3d.py:DATASET_ROOT_DIR,DATASET_METADATA_DIR
Expected files in data/ (examples):
ScanRefer_filtered_train.json,ScanRefer_filtered_val.jsonScanRefer_filtered_train.txt,ScanRefer_filtered_val.txtnr3d_train.json,nr3d_val.json,nr3d_train.txt,nr3d_val.txt- Vocabulary JSONs (ScanRefer / Nr3D)
Set environment variables (or place data under data/tod3cap):
export TOD3CAP_ROOT=/path/to/tod3cap
export TOD3CAP_ANN=/path/to/tod3cap/annotations.json
export TOD3CAP_SPLIT=/path/to/tod3cap/splits.jsonRequired fields in annotations.json:
{
"scenes": [
{
"scene_id": "scene_0001",
"pointcloud": "pointclouds/scene_0001.npy",
"objects": [
{
"object_id": 0,
"category": "car",
"bbox": { "center": [0, 0, 0], "size": [1, 1, 1] },
"captions": ["a small car", "a vehicle on the road"]
}
]
}
],
"splits": { "train": ["scene_0001"], "val": ["scene_0001"] }
}bbox can also be { "min": [...], "max": [...] } or corners: [[...],[...]].
Point clouds are expected as .npy or .npz arrays (NxC).
ScanRefer (w/o 2D input):
python main.py --dataset scene_scanrefer --use_color --use_normal \
--num_points 40000 --detector 3D_coca \
--checkpoint_dir pretrained/3D_coca_XYZ_COLOR_NORMALNr3D (w/o 2D input):
python main.py --dataset scene_nr3d --use_color --use_normal \
--num_points 40000 --detector 3D_coca \
--checkpoint_dir pretrained/3D_coca_XYZ_COLOR_NORMAL_NR3DHelper script:
bash scripts/train_scannet.sh# w/o 2D input
python main.py --use_color --use_normal \
--test_ckpt ckpt/3DCoCa/checkpoint_best.pth --test_caption
# w/ 2D input (multiview)
python main.py --use_color --use_multiview \
--test_ckpt ckpt_2D/3DCoCa/checkpoint_best.pth --test_captionpython main.py --use_color --use_normal \
--test_ckpt ckpt/3DCoCa/checkpoint_best.pth --test_detectionEnable TTS (sample multiple captions):
python main.py --dataset scene_scanrefer --test_caption --test_ckpt ckpt/3DCoCa/checkpoint_best.pth \
--use_tts --tts_candidates 8 --tts_top_k 50 --tts_temperature 1.0Set API config (OpenAI-compatible Chat Completions):
export TTS_JUDGE_API_KEY=your_key
export TTS_JUDGE_BASE_URL=https://api.openai.com/v1/chat/completions
export TTS_JUDGE_MODEL=gpt-4o-miniRun with LLM judge:
python main.py --dataset tod3cap --test_caption --test_ckpt ckpt/3DCoCa/checkpoint_best.pth \
--use_tts --tts_judge llm --tts_judge_api_key "$TTS_JUDGE_API_KEY" \
--tts_judge_base_url "$TTS_JUDGE_BASE_URL" --tts_judge_model "$TTS_JUDGE_MODEL"We thank the authors of the following open-source repositories for their valuable code and inspiration: EPCL, CoCa, and Vote2Cap-DETR.
