Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: append_alleles_batch #267

Open
cindywen96 opened this issue Dec 20, 2024 · 5 comments · May be fixed by #268
Open

RuntimeError: append_alleles_batch #267

cindywen96 opened this issue Dec 20, 2024 · 5 comments · May be fixed by #268

Comments

@cindywen96
Copy link

Hi I'm using simgenotype to generate N=10k admix samples from 1KG reference. The breakpoint simulation completes immediately but writing the admix.pgen file takes lots of memory and I keep getting the following error. Also it seems the number of written variants change with the number of reference variants. Could you advise on how much memory it takes and is there any limit in number of variants pgen.append_alleles_batch could handle? Thank you!

[    INFO] Using seed 1234 (sim_genotype.py:359)
[    INFO] Simulating generation 1 (sim_genotype.py:457)
[    INFO] Simulating generation 2 (sim_genotype.py:463)
[    INFO] Simulating generation 3 (sim_genotype.py:463)
[    INFO] Simulating generation 4 (sim_genotype.py:463)
[    INFO] Simulating generation 5 (sim_genotype.py:463)
[    INFO] Simulating generation 6 (sim_genotype.py:463)
[    INFO] Simulating generation 7 (sim_genotype.py:463)
[    INFO] Simulating generation 8 (sim_genotype.py:463)
[    INFO] Simulating generation 9 (sim_genotype.py:463)
[    INFO] Simulating generation 10 (sim_genotype.py:463)
[    INFO] Outputting breakpoint file ./admix/admix.1.bp (sim_genotype.py:503)
[    INFO] Outputting file ./admix/admix.1.pgen (sim_genotype.py:75)
[    INFO] Reading genotypes from 2575 samples and 209824 variants in chunks of size 209824 variants (genotype
s.py:1314)
[    INFO] Writing PVAR records (genotypes.py:1477)
[    INFO] Writing genotypes from 10000 samples and 209824 variants in chunks of size 209824 variants (genotyp
es.py:1553)
Traceback (most recent call last):
  File "/home/unix/zwen/.local/lib/python3.9/site-packages/haptools/data/genotypes.py", line 1582, in write
    pgen.append_alleles_batch(
  File "src/pgenlib/pgenlib.pyx", line 1891, in pgenlib.PgenWriter.append_alleles_batch
  File "src/pgenlib/pgenlib.pyx", line 1940, in pgenlib.PgenWriter.append_alleles_batch
RuntimeError: append_alleles_batch called with allele codes >= allele_ct_limit; you may need to construct the
PgenWriter with a higher allele_ct_limit setting

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/unix/zwen/.local/bin/haptools", line 8, in <module>
    sys.exit(main())
  File "/broad/software/free/Linux/redhat_7_x86_64/pkgs/anaconda3_2022.10/lib/python3.9/site-packages/click/co
re.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/broad/software/free/Linux/redhat_7_x86_64/pkgs/anaconda3_2022.10/lib/python3.9/site-packages/click/co
re.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/broad/software/free/Linux/redhat_7_x86_64/pkgs/anaconda3_2022.10/lib/python3.9/site-packages/click/co
re.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/broad/software/free/Linux/redhat_7_x86_64/pkgs/anaconda3_2022.10/lib/python3.9/site-packages/click/co
re.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/broad/software/free/Linux/redhat_7_x86_64/pkgs/anaconda3_2022.10/lib/python3.9/site-packages/click/co
re.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/unix/zwen/.local/lib/python3.9/site-packages/haptools/__main__.py", line 298, in simgenotype
    output_vcf(
  File "/home/unix/zwen/.local/lib/python3.9/site-packages/haptools/sim_genotype.py", line 223, in output_vcf
    gts.write()
  File "/home/unix/zwen/.local/lib/python3.9/site-packages/haptools/data/genotypes.py", line 1598, in write
    gc.collect()
  File "src/pgenlib/pgenlib.pyx", line 2061, in pgenlib.PgenWriter.__exit__
  File "src/pgenlib/pgenlib.pyx", line 2062, in pgenlib.PgenWriter.__exit__
  File "src/pgenlib/pgenlib.pyx", line 2050, in pgenlib.PgenWriter.close
RuntimeError: PgenWriter.close() called when number of written variants (209381) unequal to initially declared
 value (209824).
@aryarm aryarm linked a pull request Dec 20, 2024 that will close this issue
@aryarm
Copy link
Member

aryarm commented Dec 20, 2024

Hi @cindywen96,

First, thank you so much for reporting this issue! A few thoughts:

Regarding your issue about memory:

Internally, haptools must store a genotype matrix (numpy array) of size num_variants x num_samples x 3 bytes. So if you are trying to write out 10000 samples and 209824 variants, that will require 10000 x 209824 x 3 = 6.3 GB
An unfortunate limitation of the pgenlib library that we use for writing PGEN files is that the bytes in the array must be converted to a higher memory data type before they can be written. (Refer to chrchang/plink-ng#234). The higher memory data type requires four times the original memory. So 6.3 x 4 = 25.2 GB :(
To work around this, we've implemented a --chunk-size N parameter which can save you memory (at the expense of time) by performing the conversion to a higher memory data type for only N variants at a time. The --chunk-size parameter is currently available in other haptools commands but not simgenotype, so we've created some draft code that will add it to simgenotype: #268
Can you try out the new --chunk-size parameter and let us know if it resolves the issue? You can install the draft code like this:

pip install --upgrade --force-reinstall git+https://github.com/cast-genomics/haptools.git@feat/chunk-simgts

Regarding your issue with allele codes >= allele_ct_limit:

I think this was fixed in a recent version of haptools and perhaps the reason that this is occurring is because pip has installed a newer version of the pgenlib library with an older version of haptools. Which version of haptools and pgenlib are you using right now?

pip list | grep -E 'pgenlib|haptools'

In any case, I think this issue will also automatically get resolved when you try the draft code, which will have the newest changes to haptools anyway.

@cindywen96
Copy link
Author

Thank you so much for your prompt reply. It's very helpful.

I was using 0.4.2. I'm now updated to 0.5.0 and added --chunk-size 10000, but got a new error

[    INFO] Reading genotypes from 2575 samples and 188321 variants in chunks of size 188321 variants (genotypes.py:1334)
[    INFO] Writing PVAR records (genotypes.py:1497)
[    INFO] Writing genotypes from 10000 samples and 188321 variants in chunks of size 10000 variants (genotypes.py:1584)
Traceback (most recent call last):
  File "/home/unix/zwen/.local/lib/python3.9/site-packages/haptools/data/genotypes.py", line 1613, in write
    pgen.append_alleles_batch(
  File "src/pgenlib/pgenlib.pyx", line 1891, in pgenlib.PgenWriter.append_alleles_batch
  File "src/pgenlib/pgenlib.pyx", line 1942, in pgenlib.PgenWriter.append_alleles_batch
IndexError: index 10000 is out of bounds for axis 0 with size 10000

@aryarm
Copy link
Member

aryarm commented Dec 21, 2024

Ah! I'm sorry about that. Hmm...

Can you help me replicate the error? Can you share your model file and the version of the 1000 Genomes reference that you're using? Also, can you share the command that you're running?

@cindywen96
Copy link
Author

Sure, here is the command. 1KG ref is a subset of variants on chr1, hg38

haptools simgenotype \
    --ref_vcf hg38.unrelate.1.subset.pgen \
    --sample_info sampleSuperPop.tsv \
    --model model.dat \
    --mapdir ../genetic_map/ \
    --out admix.1.pgen \
    --seed 1234 \
    --chroms 1 \
    --chunk-size 10000

And here is test data

@aryarm
Copy link
Member

aryarm commented Dec 22, 2024

Can you also share the pvar and psam files that correspond with the pgen file? Those are also used by haptools whenever a pgen file is provided to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants