Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: <title> Always stuck here create_base_entity_graph #1602

Open
3 tasks done
Redhair957 opened this issue Jan 9, 2025 · 2 comments
Open
3 tasks done

[Issue]: <title> Always stuck here create_base_entity_graph #1602

Redhair957 opened this issue Jan 9, 2025 · 2 comments
Labels
triage Default label assignment, indicates new issue needs reviewed by a maintainer

Comments

@Redhair957
Copy link

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the issue

⠸ GraphRAG Indexer
├── Loading Input (text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_final_documents
└── create_base_entity_graph%

Steps to reproduce

No response

GraphRAG Config Used

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_CHAT_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: ${GRAPHRAG_CHAT_MODEL}
  model_supports_json: false # recommended if this is available for your model.
  # audience: "https://cognitiveservices.azure.com/.default"
  # max_tokens: 4000
  # request_timeout: 180.0
  api_base: ${GRAPHRAG_API_BASE}
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
#  temperature: 1 # temperature for sampling
#  top_p: 0.8 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  # target: required # or all
  # batch_size: 16 # the number of documents to send in a single request
  # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
  vector_store:
    type: lancedb
    db_uri: 'output/lancedb'
    container_name: default # A prefix for the vector store to create embedding containers. Default: 'default'.
    overwrite: true
  # vector_store: # configuration for AI Search
    # type: azure_ai_search
    # url: <ai_search_endpoint>
    # api_key: <api_key> # if not set, will attempt to use managed identity. Expects the `Search Index Data Contributor` RBAC role in this case.
    # audience: <optional> # if using managed identity, the audience to use for the token
    # overwrite: true # or false. Only applicable at index creation time
    # container_name: default # A prefix for the AzureAISearch to create indexes. Default: 'default'.
  llm:
    api_key: ${GRAPHRAG_CHAT_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model:  text-embedding-3-small
    api_base:  ${GRAPHRAG_API_BASE}
    # api_version: 2024-02-15-preview
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made

chunks:
  size: 600
  overlap: 50
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

#update_index_storage: # Storage to save an updated index (for incremental indexing). Enabling this performs an incremental index run
#   type: file # or blob
#   base_dir: "update_output"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "logs"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## strategy: fully override the entity extraction strategy.
  ##   type: one of graph_intelligence, graph_intelligence_json and nltk
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [业务办理项编码, 实施主体, 事项名称, 实施主体编码,权力来源]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 200

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 1000
  max_input_length: 2000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

No response

Additional Information

  • GraphRAG Version:
  • Operating System:
  • Python Version:
  • Related Issues:
@Redhair957 Redhair957 added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Jan 9, 2025
@It-Is-Ishank
Copy link

same issue
these are the few errors that i get in the indexing log

  1. INFO Error Invoking LLM
  2. ERROR error extracting graph Traceback (most recent call last): File "c:\Users\ikapur\Desktop\testing\.venv\lib\site-packages\graphrag\index\operations\extract_entities\graph_extractor.py", line 127, in __call__ result = await self._process_document(text, prompt_variables) File "c:\Users\ikapur\Desktop\testing\.venv\lib\site-packages\graphrag\index\operations\extract_entities\graph_extractor.py", line 155, in _process_document response = await self._llm( File "c:\Users\ikapur\Desktop\testing\.venv\lib\site-packages\fnllm\openai\llm\chat.py", line 83, in __call__ return await self._text_chat_llm(prompt, **kwargs) File "c:\Users\ikapur\Desktop\testing\.venv\lib\site-packages\fnllm\openai\llm\features\tools_parsing.py", line 120, in __call__ return await self._delegate(prompt, **kwargs) File "c:\Users\ikapur\Desktop\testing\.venv\lib\site-packages\fnllm\base\base.py", line 112, in __call__ return await self._invoke(prompt, **kwargs) File "c:\Users\ikapur\Desktop\testing\.venv\lib\site-packages\fnllm\base\base.py", line 128, in _invoke return await self._decorated_target(prompt, **kwargs) File "c:\Users\ikapur\Desktop\testing\.venv\lib\site-packages\fnllm\services\json.py", line 71, in invoke return await delegate(prompt, **kwargs) File "c:\Users\ikapur\Desktop\testing\.venv\lib\site-packages\fnllm\services\retryer.py", line 109, in invoke result = await execute_with_retry() File "c:\Users\ikapur\Desktop\testing\.venv\lib\site-packages\fnllm\services\retryer.py", line 106, in execute_with_retry raise RetriesExhaustedError(name, self._max_retries) fnllm.services.errors.RetriesExhaustedError: Operation 'chat' failed - 10 retries exhausted.
  3. INFO Entity Extraction Error
    ERROR error running workflow extract_graph
    Traceback (most recent call last):
    File "c:\Users\ikapur\Desktop\testing.venv\lib\site-packages\graphrag\index\run\run_workflows.py", line 166, in _run_workflows
    result = await run_workflow(
    File "c:\Users\ikapur\Desktop\testing.venv\lib\site-packages\graphrag\index\workflows\extract_graph.py", line 45, in run_workflow
    base_entity_nodes, base_relationship_edges = await extract_graph(
    File "c:\Users\ikapur\Desktop\testing.venv\lib\site-packages\graphrag\index\flows\extract_graph.py", line 33, in extract_graph
    entities, relationships = await extract_entities(
    File "c:\Users\ikapur\Desktop\testing.venv\lib\site-packages\graphrag\index\operations\extract_entities\extract_entities.py", line 136, in extract_entities
    entities = _merge_entities(entity_dfs)
    File "c:\Users\ikapur\Desktop\testing.venv\lib\site-packages\graphrag\index\operations\extract_entities\extract_entities.py", line 168, in _merge_entities
    all_entities.groupby(["title", "type"], sort=False)
    File "c:\Users\ikapur\Desktop\testing.venv\lib\site-packages\pandas\core\frame.py", line 9183, in groupby
    return DataFrameGroupBy(
    File "c:\Users\ikapur\Desktop\testing.venv\lib\site-packages\pandas\core\groupby\groupby.py", line 1329, in init
    grouper, exclusions, obj = get_grouper(
    File "c:\Users\ikapur\Desktop\testing.venv\lib\site-packages\pandas\core\groupby\grouper.py", line 1043, in get_grouper
    raise KeyError(gpr)
    KeyError: 'title'

@Lyzin
Copy link

Lyzin commented Feb 8, 2025

same issue, Is there a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Default label assignment, indicates new issue needs reviewed by a maintainer
Projects
None yet
Development

No branches or pull requests

3 participants