[Bug]: Indexing Pipeline #1677

MarkusGutjahr · 2025-02-05T08:15:58Z

Do you need to file an issue?

I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

I was running the normal indexing to create a new graph, for around 1000 documents. The last 7 summarize calls got errors but completed within 1 retry, then it started the file-creation for entities and nodes and completed without any problems. Next the creation of the communities file began and the pipeline stopped. I got no error, it just stopped.

Since this ran for about 8 hours it would be nice, if its possible to re-use the already created cache and files.

Created files are:

create_final_documents.parquet
create_final_entities.parquet
create_final_nodes.parquet
the cache/entity_extraction folder with files
the cache/summarize_descriptions folder with files

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: ${LLM_TYPE}
model: ${GRAPHRAG_LLM_MODEL}
model_supports_json: true # recommended if this is available for your model.

parallelization:
stagger: 0.3

num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: ${EMBEDDING_TYPE}
model: ${GRAPHRAG_EMBEDDING_MODEL}
vector_store:
type: lancedb
db_uri: "graph_data/${BASE_DIR}/output/lancedb"

chunks:
size: "${CHUNK_SIZE}"
overlap: "${CHUNK_OVERLAP}"
group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
type: file # or blob
file_type: text # or csv
base_dir: "graph_data/${BASE_DIR}/input"
file_encoding: utf-8
file_pattern: "."
#file_pattern: ".\.(txt|md)$"
#file_pattern: ".\.txt$"
#file_pattern: ".\.md$"

cache:
type: file # or blob
base_dir: "graph_data/${BASE_DIR}/cache"
#base_dir: "/tmp/graph/cache"

storage:
type: file # or blob
#base_dir: "output/${timestamp}/artifacts"
base_dir: "graph_data/${BASE_DIR}/output/graph/artifacts"

reporting:
type: file # or console, blob
base_dir: "graph_data/${BASE_DIR}/output/graph/reports"

entity_extraction:
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 1

summarize_descriptions:
prompt: "prompts/summarize_descriptions.txt"
max_length: 500

claim_extraction:
enabled: false # if true, will generate covariates
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1

community_reports:
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000

cluster_graph:
max_cluster_size: 10

embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes

umap:
enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
graphml: false
raw_entities: false
top_level_nodes: false

local_search:

global_search:

Logs and screenshots

Last logs of the indexing-engine.log:
15:52:29,976 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
15:52:30,211 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "summarize" with 0 retries took 23.568345429001056. input_tokens=169, output_tokens=40
15:52:34,617 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
15:52:34,626 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "summarize" with 0 retries took 24.33390350300033. input_tokens=177, output_tokens=57
15:52:42,533 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 502 Bad Gateway"
15:52:42,535 graphrag.callbacks.file_workflow_callbacks INFO Error Invoking LLM details={'input': '\nYou are a helpful assistant responsible for generating a comprehensive summary of the data provided below.\nGiven one or two entities, and a list of descriptions, all related to the same entity or group of entities.\nPlease concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions.\nIf the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary.\nMake sure it is written in third person, and include the entity names so we have the full context.\n\n#######\n-Data-\nEntities: ["STOCK CAR BRASIL", "CURVELO"]\nDescription List: ["Stock Car Brasil races may take place at various international circuits, including Curvelo", "Stock Car Brasil races may take place in various international circuits, including those in Curvelo"]\n#######\nOutput:\n'}
15:52:42,572 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 502 Bad Gateway"
15:52:42,573 graphrag.callbacks.file_workflow_callbacks INFO Error Invoking LLM details={'input': '\nYou are a helpful assistant responsible for generating a comprehensive summary of the data provided below.\nGiven one or two entities, and a list of descriptions, all related to the same entity or group of entities.\nPlease concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions.\nIf the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary.\nMake sure it is written in third person, and include the entity names so we have the full context.\n\n#######\n-Data-\nEntities: ["FOXTEL", "CHANNEL 10"]\nDescription List: ["Channel 10 has secured a new agreement with Foxtel to broadcast Formula 1", "Channel 10 has secured a new agreement with Foxtel to broadcast Formula 1 and MotoGP events"]\n#######\nOutput:\n'}
15:52:42,594 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 502 Bad Gateway"
15:52:42,595 graphrag.callbacks.file_workflow_callbacks INFO Error Invoking LLM details={'input': '\nYou are a helpful assistant responsible for generating a comprehensive summary of the data provided below.\nGiven one or two entities, and a list of descriptions, all related to the same entity or group of entities.\nPlease concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions.\nIf the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary.\nMake sure it is written in third person, and include the entity names so we have the full context.\n\n#######\n-Data-\nEntities: ["F\u00c9D\u00c9RATION INTERNATIONALE DU SPORT AUTOMOBILE", "PIERRE UGEUX"]\nDescription List: ["Pierre Ugeux oversaw racing regulations for Formula One while serving at the Commission Sportive Internationale, which became FISA", "Pierre Ugeux was the last president of the Commission Sportive Internationale before FISA was dissolved"]\n#######\nOutput:\n'}
15:52:42,596 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 502 Bad Gateway"
15:52:42,597 graphrag.callbacks.file_workflow_callbacks INFO Error Invoking LLM details={'input': '\nYou are a helpful assistant responsible for generating a comprehensive summary of the data provided below.\nGiven one or two entities, and a list of descriptions, all related to the same entity or group of entities.\nPlease concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions.\nIf the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary.\nMake sure it is written in third person, and include the entity names so we have the full context.\n\n#######\n-Data-\nEntities: ["PETER WHITEHEAD", "GRAHAM WHITEHEAD"]\nDescription List: ["Both Graham Whitehead and Peter Whitehead are racing drivers from the United Kingdom who participated in Formula One.", "Graham Whitehead is related to Peter Whitehead and may have participated in similar racing events", "Graham Whitehead is related to Peter Whitehead, who is also a racing driver", "Peter Whitehead and Graham Whitehead are related as family members and both are racing drivers"]\n#######\nOutput:\n'}
15:52:42,598 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 502 Bad Gateway"
15:52:42,599 graphrag.callbacks.file_workflow_callbacks INFO Error Invoking LLM details={'input': '\nYou are a helpful assistant responsible for generating a comprehensive summary of the data provided below.\nGiven one or two entities, and a list of descriptions, all related to the same entity or group of entities.\nPlease concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions.\nIf the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary.\nMake sure it is written in third person, and include the entity names so we have the full context.\n\n#######\n-Data-\nEntities: ["SPEED", "FOX SPORTS 1"]\nDescription List: ["Fox Sports 1 was launched as a rebranding of Speed, requiring new carriage deals with television providers", "Speed was rebranded and replaced by Fox Sports 1, inheriting its NASCAR coverage and expanding its sports rights", "Speed was replaced by Fox Sports 1 in the U.S. on August 17, 2013", "Speed's live event programming was carried over to Fox Sports 1, indicating a direct relationship between the two networks"]\n#######\nOutput:\n'}
15:52:42,624 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 502 Bad Gateway"
15:52:42,625 graphrag.callbacks.file_workflow_callbacks INFO Error Invoking LLM details={'input': '\nYou are a helpful assistant responsible for generating a comprehensive summary of the data provided below.\nGiven one or two entities, and a list of descriptions, all related to the same entity or group of entities.\nPlease concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions.\nIf the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary.\nMake sure it is written in third person, and include the entity names so we have the full context.\n\n#######\n-Data-\nEntities: ["DAVID MURRAY", "TONY ROLT"]\nDescription List: ["Both David Murray and Tony Rolt were racing drivers in the Formula One seasons during the early 1950s", "Both David Murray and Tony Rolt were racing drivers who participated in the Formula One seasons during the early 1950s."]\n#######\nOutput:\n'}
15:52:42,656 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 502 Bad Gateway"
15:52:42,657 graphrag.callbacks.file_workflow_callbacks INFO Error Invoking LLM details={'input': '\nYou are a helpful assistant responsible for generating a comprehensive summary of the data provided below.\nGiven one or two entities, and a list of descriptions, all related to the same entity or group of entities.\nPlease concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions.\nIf the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary.\nMake sure it is written in third person, and include the entity names so we have the full context.\n\n#######\n-Data-\nEntities: ["NINO VACCARELLA", "SCUDERIA SSS REPUBLICA DI VENEZIA"]\nDescription List: ["Nino Vaccarella drove for Scuderia SSS Republica di Venezia, showcasing his talent in motorsport.", "Nino Vaccarella raced for Scuderia SSS Republica di Venezia in Formula One events"]\n#######\nOutput:\n'}
15:52:45,493 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
15:52:45,496 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "summarize" with 1 retries took 1.4465712899982464. input_tokens=188, output_tokens=59
15:52:45,589 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
15:52:45,607 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "summarize" with 1 retries took 1.7494399290008005. input_tokens=179, output_tokens=45
15:52:45,931 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
15:52:45,934 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "summarize" with 1 retries took 2.2555921010025486. input_tokens=222, output_tokens=90
15:52:46,49 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
15:52:46,50 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "summarize" with 1 retries took 1.6440738099990995. input_tokens=167, output_tokens=49
15:52:46,166 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
15:52:46,169 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "summarize" with 1 retries took 1.973351044998708. input_tokens=197, output_tokens=93
15:52:46,833 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
15:52:46,836 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "summarize" with 1 retries took 2.929334182997991. input_tokens=166, output_tokens=71
15:52:48,856 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
15:52:48,864 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "summarize" with 1 retries took 5.223053334000724. input_tokens=208, output_tokens=55
15:54:23,875 graphrag.index.run.workflow INFO dependencies for create_final_entities: ['create_base_entity_graph']
15:54:23,875 graphrag.index.run.workflow WARNING Dependency table create_base_entity_graph not found in storage: it may be a runtime-only in-memory table. If you see further errors, this may be an actual problem.
15:54:23,877 datashaper.workflow.workflow INFO executing verb create_final_entities
15:55:14,297 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_entities.parquet
15:55:14,672 graphrag.index.run.workflow INFO dependencies for create_final_nodes: ['create_base_entity_graph']
15:55:14,673 graphrag.index.run.workflow WARNING Dependency table create_base_entity_graph not found in storage: it may be a runtime-only in-memory table. If you see further errors, this may be an actual problem.
15:55:14,674 datashaper.workflow.workflow INFO executing verb create_final_nodes
15:59:08,134 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_nodes.parquet
15:59:08,435 graphrag.index.run.workflow INFO dependencies for create_final_communities: ['create_base_entity_graph']
15:59:08,435 graphrag.index.run.workflow WARNING Dependency table create_base_entity_graph not found in storage: it may be a runtime-only in-memory table. If you see further errors, this may be an actual problem.
15:59:08,436 datashaper.workflow.workflow INFO executing verb create_final_communities

The console logs of the indexing process:
🚀 create_base_entity_graph
Empty DataFrame
Columns: []
Index: []
🚀 create_final_entities
id ...
text_unit_ids
0 343608ad7d7c4fa39da9a88fd2a3b8ca ... [02319d80bb713c9f6c640f773c708741,
03dd802a1d7...
1 3db036391c0d4087ac9207cacfc5ec1f ... [01e4ec4919af52049b9d64dbacddb928,
023c53a01d5...
2 b6892a22015642818ac9cbbb9177a2d7 ... [01e4ec4919af52049b9d64dbacddb928,
1004f29fb1b...
3 1c506f44d0b241389a448c05da95b4a9 ... [01e4ec4919af52049b9d64dbacddb928,
1004f29fb1b...
4 59b4120d9d8f47deb01a48b15f6ee21d ... [01e4ec4919af52049b9d64dbacddb928,
1004f29fb1b...
... ... ...
...
51794 3ef51aa2da8e43d183fa475ba9b297b1 ...
[8bbf36533491cf03d437b764c34c2ff3]
51795 86ce6a224d0a417a915175b4e0db9996 ...
[8bbf36533491cf03d437b764c34c2ff3]
51796 5a2f78535b374c80a3122f66122dd7d0 ...
[8bbf36533491cf03d437b764c34c2ff3]
51797 bb090c45763e42ee9cebbe9a791f20ab ...
51798 c03c344d0c654f3eb185ae0e796b2542 ...

[51799 rows x 6 columns]
🚀 create_final_nodes
id human_readable_id ... x y
0 343608ad7d7c4fa39da9a88fd2a3b8ca 0 ... 0 0
1 3db036391c0d4087ac9207cacfc5ec1f 1 ... 0 0
2 b6892a22015642818ac9cbbb9177a2d7 2 ... 0 0
3 1c506f44d0b241389a448c05da95b4a9 3 ... 0 0
4 59b4120d9d8f47deb01a48b15f6ee21d 4 ... 0 0
... ... ... ... .. ..
424637 393e72eb2f3446bc9d297d00a45ead34 48149 ... 0 0
424638 e6fe4be9ed9c44b19f17315e12dc6b62 48150 ... 0 0
424639 f4ad76c0185844e4af7691cd71566350 48151 ... 0 0
424640 03c512ac03ee4ca3b144e2daed4b9316 48152 ... 0 0
424641 aeb44e155e6d440e9ec32fc0ec21f926 48153 ... 0 0

[149548 rows x 8 columns]
2025-02-04 16:01:22,190 - INFO - Finished the graph creation.

the "2025-02-04 16:01:22,190 - INFO - Finished the graph creation." is my own log which gets send after the process sops.

Additional Information

GraphRAG Version:
Operating System:
Python Version:
Related Issues:

MarkusGutjahr added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Indexing Pipeline #1677

[Bug]: Indexing Pipeline #1677

MarkusGutjahr commented Feb 5, 2025

[Bug]: Indexing Pipeline #1677

[Bug]: Indexing Pipeline #1677

Comments

MarkusGutjahr commented Feb 5, 2025

Do you need to file an issue?

Describe the bug

Steps to reproduce

Expected Behavior

GraphRAG Config Used

num_threads: 50 # the number of threads to use for parallel processing

Logs and screenshots

Additional Information