Skip to content

Can't import Wiki Data - Either becomes Idle without finishing or using resources, or throws a DEADLOCK IMMANENT error #695

@brett--anderson

Description

@brett--anderson

I'm trying to import the full wiki data zip, possibly just the English attributes, into the KGTK format for further analysis. The process runs for a few hours. I see from terminal that some of the processes have got up to 1.4 million lines processed (not sure of how many). While running I'm watching the system resources and seeing several kgtk processes using most of the machines memory between them. The number of kgtk processes drops over time. Now there are only two kgtk processes, neither using even 1% of memory and no CPU activity. It seems to have effectively stopped, yet the process is still displaying the last output of lines processed. So it seems it's still running, but it's ceased to do anything. It's been in this state for at least an hour.

To Reproduce
Installed KGTK under python 3.9.15 in local conda env
Downloaded zip of Wiki Data, ~70GB compressed, within the last 12 months.
activate the conda env
Run the command:

kgtk  --debug --timing --progress import-wikidata \
        -i latest-all.json.bz2 \
        --node nodefile.tsv \
        --edge edgefile.tsv \
        --qual qualfile.tsv \
        --use-mgzip-for-input True \
        --use-mgzip-for-output True \
        --use-shm True \
        --procs 6 \
        --mapper-batch-size 5 \
        --max-size-per-mapper-queue 3 \
        --single-mapper-queue True \
        --collector-batch-size 10 \
        --collector-queue-per-proc-size 3 \
        --progress-interval 50000 --fail-if-missing False

Expected behavior
The process to continue running and using system resources to indicates it's doing something, until all the wiki data has been converted to the TSV format, or some useful error is thrown.

  • OS: AWS EC2 t3.2xlarge instance: Ubuntu 22.04 LTS (32 MB memory, 10GB swap, 2TB HD, 8 vCPUs)
  • KGTK 1.5.0 (installed with conda in local venv)
  • Python 3.9.15

Additional context
Not sure if the problem is caused by memory leaks or a deadlock issue 🤷‍♂️
After manually killing the process (Ctrl+C) The output throws an error:
This three times:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs) 
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/kgtk/cli/import_wikidata.py", line 1897, in run
    action, nrows, erows, qrows, invalid_erows, invalid_qrows, header = collector_q.get()
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/pyrallel/queue.py", line 809, in get
    src_pid, msg_id, block_id, total_chunks, next_chunk_block_id = self.next_readable_msg(block, remaining_timeout) # This call might raise Empty.
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/pyrallel/queue.py", line 602, in next_readable_msg
    block_id: typing.Optional[int] = self.get_first_msg(block=block, timeout=remaining_timeout)
  File "/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/site-packages/pyrallel/queue.py", line 474, in get_first_msg
    self.msg_list_semaphore.acquire(block=block, timeout=timeout)
KeyboardInterrupt

Followed By this:
/home/ubuntu/anaconda3/envs/kgtk-env/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 76 leaked shared_memory objects to clean up at shutdown

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions