Skip to content

Graph-COM/LLM_BP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

841e406 · Feb 18, 2025

History

3 Commits
Feb 16, 2025
Feb 16, 2025
Feb 16, 2025
Feb 16, 2025
Feb 18, 2025
Feb 16, 2025
Feb 16, 2025
Feb 16, 2025
Feb 16, 2025
Feb 16, 2025
Feb 16, 2025
Feb 16, 2025
Feb 16, 2025
Feb 16, 2025

Repository files navigation

LLM_BP

Overview

This is the official implementation of paper 'Model Generalization on Text Attribute Graphs: Principles with Large Language Models', Haoyu Wang, Shikun Liu, Rongzhe Wei, Pan Li.

Repository Structure

The repository structure is as follows:

LLM_BP (Root Directory)
│── dataset/        # Contains dataset files
│── model/          # Stores model implementation of LLM-BP and LLM-BP (appr.)
│── results/        # Contains generated results from GPT-4o (the predictions on testset) and GPT-4o-mini (predictions on homophily ratio)
│── zero_shot.py  # zero shot inference
│── few_shot.py        # few shot inference
│── run_gpt.py     # run openai GPT to predict the results by taking raw node texts
│── pred_h.py     # predict the homophily ratio r by sampling edges
│── generate_llm.py     # generate the embeddings of vanilla LLM2Vec or task-adaptive encoder
│── generate_lm.py     # generate the embeddings of sbert or Roberta
│── generate_llm_gpt.py     # generate the embeddings of text-embedding-3-large
│── README.md       # Documentation file

STEP 0.1 Environment Setup

To set up the environment, follow these steps:

conda create -n llmbp python==3.8.18 
conda activate llmbp
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install pyg_lib==0.3.1+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html 
pip install torch_scatter==2.1.2 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html 
pip install torch_sparse==0.6.18+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html 
pip install torch_cluster==1.6.3+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html 
pip install torch_spline_conv==1.2.2+pt21cu121 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
pip install transformers==4.46.3 
pip install sentence_transformers==2.2.2
pip install dgl==2.4.0+cu121 -f https://data.dgl.ai/wheels/torch-2.1/cu121/repo.html 
pip install openai 
pip install torch_geometric==2.5.0 
pip install protobuf 
pip install accelerate

STEP 0.2 Dataset Preparation

The dataset structure should be organized as follows:

/dataset/
│── [dataset_name]/
│   │── processed_data.pt    # Contains labels and graph information
│   │── [encoder]_x.pt       # Features extracted by different encoders
│   │── categories.csv       # label name raw texts
│   │── raw_texts.pt       # raw text of each node

File Descriptions

  • processed_data.pt: A PyTorch file storing the processed dataset, including graph structure and node labels. Note that in heterophilic datasets, thie is named as [Dataset].pt, where Dataset could be Cornell, etc, and should be opened with DGL.
  • [encoder]_x.pt: Feature matrices extracted using different encoders, where [encoder] represents the encoder name.
  • categories.csv: raw label names.
  • raw_texts.pt: raw node texts. Note that in heterophilic datasets, this is named as [Dataset].csv, where Dataset can be Cornell, etc.

Dataset Naming Convention

[dataset_name] should be one of the following:

  • cora
  • citeseer
  • pubmed
  • bookhis
  • bookchild
  • sportsfit
  • wikics
  • cornell
  • texas
  • wisconsin
  • washington

Encoder Naming Convention

[encoder] can be one of the following:

  • sbert (the sentence-bert encoder)
  • roberta (the Roberta encoder)
  • llmicl_primary (the vanilla LLM2Vec)
  • llmicl_class_aware (the task-adaptive encoder)
  • llmgpt_text-embedding-3-large (the embedding api text-embedding-3-large by openai)

Ensure the datasets are placed correctly for smooth execution.

Download Pre-Calculated Embeddings and datasets

They could be found at: huggingface repository, one could directly download, place under /dataset/ folder.

STEP 1: Generating Dataset Embeddings

python generate_llm.py --dataset [DATASET] --version [VERSION]

Example:

CUDA_VISIBLE_DEVICES=0,1 python generate_llm.py --dataset cora --version class_aware

Parameters:

  • [DATASET]: The name of the dataset.
  • [VERSION]:
    • primary → Vanilla LLM2Vec
    • class_aware → Task-adaptive encoding

Ensure that the appropriate CUDA devices are set before running the script.

Download pre-calculated embeddings

We have enclosed the pre-calculated embeddings for the encoders in: huggingface repository, one may directly download and put them under the /dataset folder


STEP 2: Generate the predictions from GPT-4o

python run_gpt.py --mode [MODE] --model [MODEL] --dataset [DATASET]

Parameters:

  • [MODEL]: The model selection (e.g., 4o for GPT-4o).
  • [DATASET]: The name of the dataset.
  • [MODE]: when set as inference, it do inference and save results, when set as evaluate, it evaluate the results of the model

Download pre-calculated predictions

We have enclosed the pre-calculated predictions from GPT-4o in: huggingface repository, one may directly download and put them under the /results folder


STEP 3: Predict the homophily ratio of the dataset

python pred_r.py --mode [MODE] --dataset [DATASET] --model [MODEL]

Parameters:

  • [DATASET]: The name of the dataset.
  • [MODEL]: The model selection (e.g., 4o_mini).
  • [MODE]: when set as inference, it do inference and save results, when set as evaluate, it makes prediction with the model

Fill the value

Fill the predicted value in H_dict in zero_shot.py or few_shot.py

Download pre-calculated predictions

We have enclosed the pre-calculated predictions from GPT-4o-mini in: huggingface repository, one may directly download and put them under the /results folder

STEP 4: Zero-shot Inference

python zero_shot.py --dataset [DATASET] --encoder [ENCODER] --model 4o

Parameters:

  • [DATASET]: The name of the dataset.
  • [ENCODER]: The encoder model (e.g., sbert, roberta, llmicl_primary, llmicl_class_aware, llmicl_text-embedding-3-large, etc.).
  • 4o: Specifies the use of GPT-4o as averaged class embeddings.

STEP 5: Few-shot Inference

python few_shot.py --dataset [DATASET] --encoder [ENCODER]

Parameters:

  • [DATASET]: The name of the dataset.
  • [ENCODER]: The encoder model (e.g., sbert, roberta, llmicl_primary, llmicl_class_aware, llmicl_text-embedding-3-large, etc.).

Acknowledgements

The dataset pre-processing, formats and code implementations are inspired by or built upon GLBench, Text-space graph foundation model, and LLaGA.

Citation

If you find our work helpful, please consider citing:

@article{wang2025model,
title={Model Generalization on Text Attribute Graphs: Principles with Large Language Models},
author={Wang, Haoyu and Liu, Shikun and Wei, Rongzhe and Li, Pan},
journal={arXiv preprint arXiv:2502.11836},
year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages