LLM_BP

Overview

This is the official implementation of paper 'Model Generalization on Text Attribute Graphs: Principles with Large Language Models', Haoyu Wang, Shikun Liu, Rongzhe Wei, Pan Li.

Repository Structure

The repository structure is as follows:

LLM_BP (Root Directory)
│── dataset/        # Contains dataset files
│── model/          # Stores model implementation of LLM-BP and LLM-BP (appr.)
│── results/        # Contains generated results from GPT-4o (the predictions on testset) and GPT-4o-mini (predictions on homophily ratio)
│── zero_shot.py  # zero shot inference
│── few_shot.py        # few shot inference
│── run_gpt.py     # run openai GPT to predict the results by taking raw node texts
│── pred_h.py     # predict the homophily ratio r by sampling edges
│── generate_llm.py     # generate the embeddings of vanilla LLM2Vec or task-adaptive encoder
│── generate_lm.py     # generate the embeddings of sbert or Roberta
│── generate_llm_gpt.py     # generate the embeddings of text-embedding-3-large
│── README.md       # Documentation file

STEP 0.1 Environment Setup

To set up the environment, follow these steps:

conda create -n llmbp python==3.8.18 
conda activate llmbp
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install pyg_lib==0.3.1+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html 
pip install torch_scatter==2.1.2 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html 
pip install torch_sparse==0.6.18+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html 
pip install torch_cluster==1.6.3+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html 
pip install torch_spline_conv==1.2.2+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
pip install transformers==4.46.3 
pip install sentence_transformers==2.2.2
pip install dgl==2.4.0+cu121 -f https://data.dgl.ai/wheels/torch-2.1/cu121/repo.html 
pip install openai 
pip install torch_geometric==2.5.0 
pip install protobuf 
pip install accelerate

STEP 0.2 Dataset Preparation

The dataset structure should be organized as follows:

/dataset/
│── [dataset_name]/
│   │── processed_data.pt    # Contains labels and graph information
│   │── [encoder]_x.pt       # Features extracted by different encoders
│   │── categories.csv       # label name raw texts
│   │── raw_texts.pt       # raw text of each node

File Descriptions

processed_data.pt: A PyTorch file storing the processed dataset, including graph structure and node labels. Note that in heterophilic datasets, thie is named as [Dataset].pt, where Dataset could be Cornell, etc, and should be opened with DGL.
[encoder]_x.pt: Feature matrices extracted using different encoders, where [encoder] represents the encoder name.
categories.csv: raw label names.
raw_texts.pt: raw node texts. Note that in heterophilic datasets, this is named as [Dataset].csv, where Dataset can be Cornell, etc.

Dataset Naming Convention

[dataset_name] should be one of the following:

cora
citeseer
pubmed
bookhis
bookchild
sportsfit
wikics
cornell
texas
wisconsin
washington

Encoder Naming Convention

[encoder] can be one of the following:

sbert (the sentence-bert encoder)
roberta (the Roberta encoder)
llmicl_primary (the vanilla LLM2Vec)
llmicl_class_aware (the task-adaptive encoder)
llmgpt_text-embedding-3-large (the embedding api text-embedding-3-large by openai)

Ensure the datasets are placed correctly for smooth execution.

Download Pre-Calculated Embeddings and datasets

They could be found at: huggingface repository, one could directly download, place under /dataset/ folder.

STEP 1: Generating Dataset Embeddings

python generate_llm.py --dataset [DATASET] --version [VERSION]

Example:

CUDA_VISIBLE_DEVICES=0,1 python generate_llm.py --dataset cora --version class_aware

Parameters:

[DATASET]: The name of the dataset.
[VERSION]:
- primary → Vanilla LLM2Vec
- class_aware → Task-adaptive encoding

Ensure that the appropriate CUDA devices are set before running the script.

Download pre-calculated embeddings

We have enclosed the pre-calculated embeddings for the encoders in: huggingface repository, one may directly download and put them under the /dataset folder

STEP 2: Generate the predictions from GPT-4o

python run_gpt.py --mode [MODE] --model [MODEL] --dataset [DATASET]

Parameters:

[MODEL]: The model selection (e.g., 4o for GPT-4o).
[DATASET]: The name of the dataset.
[MODE]: when set as inference, it do inference and save results, when set as evaluate, it evaluate the results of the model

Download pre-calculated predictions

We have enclosed the pre-calculated predictions from GPT-4o in: huggingface repository, one may directly download and put them under the /results folder

STEP 3: Predict the homophily ratio of the dataset

python pred_r.py --mode [MODE] --dataset [DATASET] --model [MODEL]

Parameters:

[DATASET]: The name of the dataset.
[MODEL]: The model selection (e.g., 4o_mini).
[MODE]: when set as inference, it do inference and save results, when set as evaluate, it makes prediction with the model

Fill the value

Fill the predicted value in H_dict in zero_shot.py or few_shot.py

Download pre-calculated predictions

We have enclosed the pre-calculated predictions from GPT-4o-mini in: huggingface repository, one may directly download and put them under the /results folder

STEP 4: Zero-shot Inference

python zero_shot.py --dataset [DATASET] --encoder [ENCODER] --model 4o

Parameters:

[DATASET]: The name of the dataset.
[ENCODER]: The encoder model (e.g., sbert, roberta, llmicl_primary, llmicl_class_aware, llmicl_text-embedding-3-large, etc.).
4o: Specifies the use of GPT-4o as averaged class embeddings.

STEP 5: Few-shot Inference

python few_shot.py --dataset [DATASET] --encoder [ENCODER]

Parameters:

[DATASET]: The name of the dataset.
[ENCODER]: The encoder model (e.g., sbert, roberta, llmicl_primary, llmicl_class_aware, llmicl_text-embedding-3-large, etc.).

Acknowledgements

The dataset pre-processing, formats and code implementations are inspired by or built upon GLBench, Text-space graph foundation model, and LLaGA.

Citation

If you find our work helpful, please consider citing:

@article{wang2025model,
title={Model Generalization on Text Attribute Graphs: Principles with Large Language Models},
author={Wang, Haoyu and Liu, Shikun and Wei, Rongzhe and Li, Pan},
journal={arXiv preprint arXiv:2502.11836},
year={2025}
}

Name	Name	Last commit message	Last commit date
Latest commit peterwang66 Update README.md Feb 18, 2025 841e406 · Feb 18, 2025 History 3 Commits
dataset	dataset	upload codes	Feb 16, 2025
model	model	upload codes	Feb 16, 2025
results	results	upload codes	Feb 16, 2025
.gitignore	.gitignore	init repository with gitignore	Feb 16, 2025
README.md	README.md	Update README.md	Feb 18, 2025
few_shot.py	few_shot.py	upload codes	Feb 16, 2025
generate_llm.py	generate_llm.py	upload codes	Feb 16, 2025
generate_llm_gpt.py	generate_llm_gpt.py	upload codes	Feb 16, 2025
generate_lm.py	generate_lm.py	upload codes	Feb 16, 2025
pred_r.py	pred_r.py	upload codes	Feb 16, 2025
prompts.py	prompts.py	upload codes	Feb 16, 2025
run_gpt.py	run_gpt.py	upload codes	Feb 16, 2025
upload.py	upload.py	upload codes	Feb 16, 2025
zero_shot.py	zero_shot.py	upload codes	Feb 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM_BP

Overview

Repository Structure

STEP 0.1 Environment Setup

STEP 0.2 Dataset Preparation

File Descriptions

Dataset Naming Convention

Encoder Naming Convention

Download Pre-Calculated Embeddings and datasets

STEP 1: Generating Dataset Embeddings

Example:

Parameters:

Download pre-calculated embeddings

STEP 2: Generate the predictions from GPT-4o

Parameters:

Download pre-calculated predictions

STEP 3: Predict the homophily ratio of the dataset

Parameters:

Fill the value

Download pre-calculated predictions

STEP 4: Zero-shot Inference

Parameters:

STEP 5: Few-shot Inference

Parameters:

Acknowledgements

Citation

About

Releases

Packages

Languages

Graph-COM/LLM_BP

Folders and files

Latest commit

History

Repository files navigation

LLM_BP

Overview

Repository Structure

STEP 0.1 Environment Setup

STEP 0.2 Dataset Preparation

File Descriptions

Dataset Naming Convention

Encoder Naming Convention

Download Pre-Calculated Embeddings and datasets

STEP 1: Generating Dataset Embeddings

Example:

Parameters:

Download pre-calculated embeddings

STEP 2: Generate the predictions from GPT-4o

Parameters:

Download pre-calculated predictions

STEP 3: Predict the homophily ratio of the dataset

Parameters:

Fill the value

Download pre-calculated predictions

STEP 4: Zero-shot Inference

Parameters:

STEP 5: Few-shot Inference

Parameters:

Acknowledgements

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages