Official repository for the paper "Exploring the Potential of Encoder-free Architectures in 3D LMMs".
[π Paper] [π€ HF Checkpoints for stage1]
We introduce ENEL, an Encoder-free 3D Large Language Model capable of overcoming the challenges posed by encoder-based architectures, including the inability to adapt to varying point cloud resolutions and the failure of encoder-extracted point features to meet the semantic needs of Large Language Models. Building upon PointLLM, we conduct a comprehensive investigation into how the LLM can assume the role of the 3D encoder. Based on the PointLLM dataset, our 7B model is evaluated across three benchmark tasks: generative 3D object classification, 3D object captioning, and 3D VQA, with assessments performed using GPT-4 scoring and traditional metrics.- [2023-02-13] We release the codes for training in the pre-training stage with corresponding checkpoints and the codes for evaluation.
- [2025-02-13] We release the paper of ENEL;
- π¬ Dialogue Examples
- π Overview
- π¦ Training and Evaluation
- π TODO List
- π Citation
- π License
- π Acknowledgements
Dialogue 1 |
---|
![]() |
Please refer to our paper for more results.
In https://huggingface.co/IvanTang/ENEL/tree/main, to adapt to different paths, please modify the attributes: _name_or_path in the config.json file and special_tokens_map_file in the tokenizer_config.json file.
To start:
- Clone this repository.
https://github.com/Ivan-Tang-3D/ENEL.git
cd ENEL
- Install packages
conda create -n ENEL python=3.10 -y
conda activate ENEL
pip install --upgrade pip # enable PEP 660 support
pip install -e .
# * for training
pip install ninja
pip install flash-attn
# * for chamfer_dist
git clone https://github.com/Pang-Yatian/Point-MAE.git
cd ./extensions/chamfer_dist
python setup.py install --user
- Download the two compressed files of 660K Objaverse colored point clouds here. They require about 77GB of storage space.
- Run the following command to merge the two files into one and uncompress it. This will produce a folder named
8192_npy
containing 660K point cloud files named{Objaverse_ID}_8192.npy
. Each file is a numpy array with dimensions (8192, 6), where the first three dimensions arexyz
and the last three dimensions arergb
in [0, 1] range.
cat Objaverse_660K_8192_npy_split_a* > Objaverse_660K_8192_npy.tar.gz
tar -xvf Objaverse_660K_8192_npy.tar.gz
- In
ENEL
folder, create a folderdata
and create a soft link to the uncompressed file in the directory.
cd ENEL
mkdir data
ln -s /path/to/8192_npy data/objaverse_data
- In
ENEL/data
folder, create a directory namedanno_data
. - Our instruction-following data, including both the simple-description and complex instructions, can be downloaded here. If you have difficulty downloading the data (e.g. network issue), please email the authors.
- The simple-description data has 660K samples and the complex instructions have 70K samples.
- Both training data are based on the Objaverse dataset.
- The complex instructions are generated with GPT-4.
- Put the data files in the
anno_data
directory. The directory should look like this:
ENEL/data/anno_data
βββ PointLLM_brief_description_660K_filtered.json
βββ PointLLM_brief_description_660K.json
βββ PointLLM_complex_instruction_70K.json
- Note, the
PointLLM_brief_description_660K_filtered.json
is filtered fromPointLLM_brief_description_660K.json
by removing the 3000 objects we reserved as the validation set.
- Download the referencing GT
PointLLM_brief_description_val_200_GT.json
we use for the benchmarks on Objaverse dataset here, and put it inENEL/data/anno_data
.
- In
ENEL
folder, create a directory namedcheckpoints
. - Download the pre-trained LLM:
PointLLM_7B_v1.1_init. Put them in the
checkpoints
directory.
- For stage-1 training, simply run:
cd ENEL
scripts/ENEL_train_stage1.sh
- Run the following commands to infer the results.
- Different commands for inferencing on different benchmarks:
MODEL_NAME=
LOG_SUFFIX=
LOG_DIR="/ENEL/new_eval_logs"
LOG_EDIR="/ENEL/new_eval_logs"
export PYTHONPATH="/ENEL:$PYTHONPATH"
# Object captioning on Objaverse
CUDA_VISIBLE_DEVICES=1 python pointllm/eval/eval_objaverse.py --model_name $MODEL_NAME --task_type captioning --prompt_index 2 > $LOG_EDIR/try_obj_${LOG_SUFFIX}.log 2>&1 &
# Open Vocabulary Classification on Objaverse
CUDA_VISIBLE_DEVICES=2 python pointllm/eval/eval_objaverse.py --model_name $MODEL_NAME --task_type classification --prompt_index 0 > $LOG_EDIR/try_objcls_${LOG_SUFFIX}.log 2>&1 &
- Please check the default command-line arguments of these two scripts. You can specify different prompts, data paths, and other parameters.
- After inferencing, the results will be saved in
{model_name}/evaluation
as a dict with the following format:
{
"prompt": "",
"results": [
{
"object_id": "",
"ground_truth": "",
"model_output": "",
"label_name": "" # only for classification on modelnet40
}
]
}
- Get your OpenAI API key at https://platform.openai.com/api-keys.
- Please set the OpenAI API Key in the 40th line of https://github.com/Ivan-Tang-3D/ENEL/blob/main/pointllm/eval/utils.py.
- Run the following commands to evaluate the model outputs in parallel with ChatGPT/GPT-4 (which cost approximately $1.5 to $2.2 USD).
export PYTHONPATH="/ENEL:$PYTHONPATH"
# Open Vocabulary Classification on Objaverse
python pointllm/eval/evaluator.py --results_path /path/to/model_output --model_type gpt-4-0613 --eval_type open-free-form-classification --parallel --num_workers 15
# Object captioning on Objaverse
python pointllm/eval/evaluator.py --results_path /path/to/model_output --model_type gpt-4-0613 --eval_type object-captioning --parallel --num_workers 15
- The evaluation script supports interruption and resumption. You can interrupt the evaluation process at any time by using
Ctrl+C
. This will save the temporary results. If an error occurs during the evaluation, the script will also save the current state. You can resume the evaluation from where it left off by running the same command again. - The evaluation results will be saved in
{model_name}/evaluation
as another dict. Some of the metrics are explained as follows:
"average_score": The GPT-evaluated captioning score we report in our paper.
"accuracy": The classification accuracy we report in our paper, including random choices made by ChatGPT when model outputs are vague or ambiguous and ChatGPT outputs "INVALID".
"clean_accuracy": The classification accuracy after removing those "INVALID" outputs.
"total_predictions": The number of predictions.
"correct_predictions": The number of correct predictions.
"invalid_responses": The number of "INVALID" outputs by ChatGPT.
# Some other statistics for calling OpenAI API
"prompt_tokens": The total number of tokens of the prompts for ChatGPT/GPT-4.
"completion_tokens": The total number of tokens of the completion results from ChatGPT/GPT-4.
"GPT_cost": The API cost of the whole evaluation process, in US Dollars π΅.
- For the object captioning task, run the following command to evaluate model outputs with traditional metrics including BLEU, ROUGE, METEOR, Sentence-BERT, and SimCSE.
export PYTHONPATH="/ENEL:$PYTHONPATH"
CUDA_VISIBLE_DEVICES=0 python pointllm/eval/traditional_evaluator.py --results_path /path/to/model_captioning_output
- Add training codes for stage1 with checkpoints.
- Add evaluation&inferencing codes.
- Add training codes for stage2.
If you find our work and this codebase helpful, please consider starring this repo π and cite:
@misc{tang2025exploringpotentialencoderfreearchitectures,
title={Exploring the Potential of Encoder-free Architectures in 3D LMMs},
author={Yiwen Tang and Zoey Guo and Zhuhao Wang and Ray Zhang and Qizhi Chen and Junli Liu and Delin Qu and Zhigang Wang and Dong Wang and Xuelong Li and Bin Zhao},
year={2025},
eprint={2502.09620},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.09620},
}
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.