Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue of vcf_to_zarr()'s .zarr output #785

Closed
LiangdeLI opened this issue Dec 15, 2021 · 12 comments · Fixed by #852
Closed

Issue of vcf_to_zarr()'s .zarr output #785

LiangdeLI opened this issue Dec 15, 2021 · 12 comments · Fixed by #852
Labels
bug Something isn't working

Comments

@LiangdeLI
Copy link

The .zarr output of vcf_to_zarr() is smaller than before in terms of size, like for a 201MB 1000 Genomes chromosome 21 vcf file, the new result is 50MB zarr file while it was 321MB before. However, I found some problems of this new compressed version of .zarr.

For example, if I simply run the following reading data and write:

ds = sg.load_dataset("/mydata/chr21.zarr")
ds[['variant_id']].to_zarr("/mydata/output.zarr", compute=True)

It works for the old 321MB version, but the new 50MB version would raise error:

/users/Liangde/.local/lib/python3.8/site-packages/xarray/conventions.py:201: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
  warnings.warn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/users/Liangde/.local/lib/python3.8/site-packages/xarray/core/dataset.py", line 2031, in to_zarr
    return to_zarr(
  File "/users/Liangde/.local/lib/python3.8/site-packages/xarray/backends/api.py", line 1414, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/users/Liangde/.local/lib/python3.8/site-packages/xarray/backends/api.py", line 1124, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/users/Liangde/.local/lib/python3.8/site-packages/xarray/backends/zarr.py", line 555, in store
    self.set_variables(
  File "/users/Liangde/.local/lib/python3.8/site-packages/xarray/backends/zarr.py", line 634, in set_variables
    writer.add(v.data, zarr_array, region)
  File "/users/Liangde/.local/lib/python3.8/site-packages/xarray/backends/common.py", line 155, in add
    target[region] = source
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1213, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1308, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1599, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1651, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1888, in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1893, in _chunk_setitem_nosync
    cdata = self._process_for_setitem(ckey, chunk_selection, value, fields=fields)
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 1952, in _process_for_setitem
    return self._encode_chunk(chunk)
  File "/users/Liangde/.local/lib/python3.8/site-packages/zarr/core.py", line 2001, in _encode_chunk
    chunk = f.encode(chunk)
  File "numcodecs/vlen.pyx", line 106, in numcodecs.vlen.VLenUTF8.encode
TypeError: expected unicode string, found 16
@percyfal
Copy link

I have encountered this situation as well in a case where I read a vcf, convert to zarr, load the zarr dataset to perform some operations (ld pruning in this case), and then try to save which throws the error above. I did manage to identify a solution which I post here for reference. I'm pretty new to the xarray data structures so let me know if this makes sense or not.

Upon reading the zarr dataset (converted from vcf), the dtype of the Dataset Dataarrays 'variant_id', 'variant_allele' and 'sample_id' are 'O'; converting these to e.g. dtype=str does the trick. Apparently sgkit.io.vcf.vcf_to_zarr sets dtype="O" on these Dataarray objects which somehow causes problems for xarray encoding.

An alternative solution is to note that vcf_to_zarr sets the encoding in an elaborate way; printing this encoding and using it with sgkit.save_dataset also works.

I'm unsure whether resetting the dtype is the proper solution, but I note that when simulating a dataset with sgkit.simulate_genotype_call_dataset the dtypes are |S1 for 'variant_allele' and <U4 for 'sample_id'.

@jeromekelleher
Copy link
Collaborator

Thanks for digging in here @percyfal. Numpy strings are always a pain 😢

Any thoughts here on what's going on @tomwhite?

@percyfal
Copy link

No problem @jeromekelleher I have some more info if needed. For some reason the encoding is given two passes in zarr.core.Array._encode_chunk:

def _encode_chunk(self, chunk):
    # apply filters
    if self._filters:
        for f in self._filters:
            chunk = f.encode(chunk)

such that chunk is of type bytearray the second time around.

@tomwhite
Copy link
Collaborator

I agree that strings are hard to get right with NumPy - adding Zarr and Xarray makes things even more constrained.

In general, the approach we take is to store strings with dtype O and Zarr's numcodecs.VLenUTF8 encoding for variable-length strings. See #665 and https://github.com/pystatgen/sgkit/blob/f39e1b1bc3b16d05c5043ab5d445076424dad229/sgkit/io/vcfzarr_reader.py#L360-L362

#643 and pydata/xarray#5769 are related too.

An alternative solution is to note that vcf_to_zarr sets the encoding in an elaborate way; printing this encoding and using it with sgkit.save_dataset also works.

Perhaps we could make this easier to do, or automate it somehow?

@percyfal
Copy link

Hi @tomwhite thanks for the heads up! I'm revisiting this now that I have another dataset I wish to perform a pca on and then save the data. However, even loading a vcf and then trying to save it fails:

import sgkit as sg
import sgkit.io.vcf as sgvcf
sgvcf.vcf_to_zarr("data.vcf.gz", "data.vcf.gz.zarr")
ds = sg.load_dataset("data.vcf.gz.zarr")
sg.save_dataset(ds, "foo.zarr", mode="w")

...
File numcodecs/vlen.pyx:106, in numcodecs.vlen.VLenUTF8.encode()

TypeError: expected unicode string, found 16
Full trace

NB: there are some added print statements here and there

<xarray.Variable (variants: 8968, alleles: 4)>
dask.array<open_dataset-cd5ecb3964c37c4e0530eb3bdfe5306dvariant_allele, shape=(8968, 4), dtype=object, chunksize=(8968, 4), chunktype=numpy.ndarray>
Attributes:
    comment:  The possible alleles for the variant.
<xarray.Variable (samples: 15)>
dask.array<open_dataset-cd5ecb3964c37c4e0530eb3bdfe5306dsample_id, shape=(15,), dtype=object, chunksize=(15,), chunktype=numpy.ndarray>
Attributes:
    comment:  The unique identifier of the sample.
<xarray.Variable (variants: 8968)>
dask.array<open_dataset-cd5ecb3964c37c4e0530eb3bdfe5306dvariant_id, shape=(8968,), dtype=object, chunksize=(8968,), chunktype=numpy.ndarray>
Attributes:
    comment:  The unique identifier of the variant.
/home/peru/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/conventions.py:206: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
  "variable {} has data in the form of a dask array with "

 sg.save_dataset(ds, "foo.zarr", mode="w")
<xarray.Variable (variants: 8968, alleles: 4)>
dask.array<open_dataset-cd5ecb3964c37c4e0530eb3bdfe5306dvariant_allele, shape=(8968, 4), dtype=object, chunksize=(8968, 4), chunktype=numpy.ndarray>
Attributes:
    comment:  The possible alleles for the variant.
<xarray.Variable (samples: 15)>
dask.array<open_dataset-cd5ecb3964c37c4e0530eb3bdfe5306dsample_id, shape=(15,), dtype=object, chunksize=(15,), chunktype=numpy.ndarray>
Attributes:
    comment:  The unique identifier of the sample.
<xarray.Variable (variants: 8968)>
dask.array<open_dataset-cd5ecb3964c37c4e0530eb3bdfe5306dvariant_id, shape=(8968,), dtype=object, chunksize=(8968,), chunktype=numpy.ndarray>
Attributes:
    comment:  The unique identifier of the variant.
[VLenUTF8(), VLenUTF8()] VLenUTF8() <class 'bytearray'>
Weird chunk:  <class 'bytearray'>
/home/peru/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/conventions.py:206: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
  "variable {} has data in the form of a dask array with "
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [73], in <cell line: 1>()
----> 1 sg.save_dataset(ds, "foo.zarr", mode="w")

File ~/dev/python/sgkit/sgkit/io/dataset.py:41, in save_dataset(ds, store, storage_options, **kwargs)
     38 for v in ds:
     39     # Workaround for https://github.com/pydata/xarray/issues/4380
     40     ds[v].encoding.pop("chunks", None)
---> 41 ds.to_zarr(store, **kwargs)

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/core/dataset.py:2036, in Dataset.to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options)
   2033 if encoding is None:
   2034     encoding = {}
-> 2036 return to_zarr(
   2037     self,
   2038     store=store,
   2039     chunk_store=chunk_store,
   2040     storage_options=storage_options,
   2041     mode=mode,
   2042     synchronizer=synchronizer,
   2043     group=group,
   2044     encoding=encoding,
   2045     compute=compute,
   2046     consolidated=consolidated,
   2047     append_dim=append_dim,
   2048     region=region,
   2049     safe_chunks=safe_chunks,
   2050 )

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/backends/api.py:1431, in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options)
   1429 writer = ArrayWriter()
   1430 # TODO: figure out how to properly handle unlimited_dims
-> 1431 dump_to_store(dataset, zstore, writer, encoding=encoding)
   1432 writes = writer.sync(compute=compute)
   1434 if compute:

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/backends/api.py:1119, in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1116 if encoder:
   1117     variables, attrs = encoder(variables, attrs)
-> 1119 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/backends/zarr.py:533, in ZarrStore.store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    530     self.set_attributes(attributes)
    531     self.set_dimensions(variables_encoded, unlimited_dims=unlimited_dims)
--> 533 self.set_variables(
    534     variables_encoded, check_encoding_set, writer, unlimited_dims=unlimited_dims
    535 )
    536 if self._consolidate_on_close:
    537     zarr.consolidate_metadata(self.zarr_group.store)

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/backends/zarr.py:611, in ZarrStore.set_variables(self, variables, check_encoding_set, writer, unlimited_dims)
    608     zarr_array.resize(new_shape)
    610 region = tuple(write_region[dim] for dim in dims)
--> 611 writer.add(v.data, zarr_array, region)

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/backends/common.py:154, in ArrayWriter.add(self, source, target, region)
    152 else:
    153     if region:
--> 154         target[region] = source
    155     else:
    156         target[...] = source

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1224, in Array.__setitem__(self, selection, value)
   1145 """Modify data for an item or region of the array.
   1146 
   1147 Parameters
   (...)
   1220 
   1221 """
   1223 fields, selection = pop_fields(selection)
-> 1224 self.set_basic_selection(selection, value, fields=fields)

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1319, in Array.set_basic_selection(self, selection, value, fields)
   1317     return self._set_basic_selection_zd(selection, value, fields=fields)
   1318 else:
-> 1319     return self._set_basic_selection_nd(selection, value, fields=fields)

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1610, in Array._set_basic_selection_nd(self, selection, value, fields)
   1604 def _set_basic_selection_nd(self, selection, value, fields=None):
   1605     # implementation of __setitem__ for array with at least one dimension
   1606 
   1607     # setup indexer
   1608     indexer = BasicIndexer(selection, self)
-> 1610     self._set_selection(indexer, value, fields=fields)

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1682, in Array._set_selection(self, indexer, value, fields)
   1679             cv = chunk_value[item]
   1680         chunk_values.append(cv)
-> 1682 self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
   1683                      fields=fields)

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1871, in Array._chunk_setitems(self, lchunk_coords, lchunk_selection, values, fields)
   1869 def _chunk_setitems(self, lchunk_coords, lchunk_selection, values, fields=None):
   1870     ckeys = [self._chunk_key(co) for co in lchunk_coords]
-> 1871     cdatas = [self._process_for_setitem(key, sel, val, fields=fields)
   1872               for key, sel, val in zip(ckeys, lchunk_selection, values)]
   1873     values = {k: v for k, v in zip(ckeys, cdatas)}
   1874     self.chunk_store.setitems(values)

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1871, in <listcomp>(.0)
   1869 def _chunk_setitems(self, lchunk_coords, lchunk_selection, values, fields=None):
   1870     ckeys = [self._chunk_key(co) for co in lchunk_coords]
-> 1871     cdatas = [self._process_for_setitem(key, sel, val, fields=fields)
   1872               for key, sel, val in zip(ckeys, lchunk_selection, values)]
   1873     values = {k: v for k, v in zip(ckeys, cdatas)}
   1874     self.chunk_store.setitems(values)

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1963, in Array._process_for_setitem(self, ckey, chunk_selection, value, fields)
   1960         chunk[chunk_selection] = value
   1962 # encode chunk
-> 1963 return self._encode_chunk(chunk)

File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:2014, in Array._encode_chunk(self, chunk)
   2012             print(self._filters, f, type(chunk))
   2013             print("Weird chunk: ", type(chunk))
-> 2014         chunk = f.encode(chunk)
   2016 # check object encoding
   2017 if ensure_ndarray(chunk).dtype == object:

File numcodecs/vlen.pyx:106, in numcodecs.vlen.VLenUTF8.encode()

TypeError: expected unicode string, found 16

I can get around this by making an encoding function (draft) based on the code in vcf_reader:

from typing import Hashable
from xarray import Dataset
from numcodecs import PackBits

try:
    from numcodecs import Blosc

    DEFAULT_COMPRESSOR = Blosc(cname="zstd", clevel=7, shuffle=Blosc.AUTOSHUFFLE)
except ImportError:  # pragma: no cover
    warnings.warn("Cannot import Blosc, falling back to no compression", RuntimeWarning)
    DEFAULT_COMPRESSOR = None

def encode(ds: Dataset, encoding: Optional[dict] = None):
    chunk_width = 1_000
    chunk_length = 10_000
    compressor = DEFAULT_COMPRESSOR

    def get_chunk_size(dim: Hashable, size: int) -> int:
        if dim == "variants":
            return chunk_length
        elif dim == "samples":
            return chunk_width
        else:
            return size

    default_encoding = {}
    for var in ds.data_vars:
        var_chunks = tuple(
            get_chunk_size(dim, size)
            for (dim, size) in zip(ds[var].dims, ds[var].shape)
        )
        default_encoding[var] = dict(
            chunks=var_chunks, compressor=compressor
        )
        if ds[var].dtype.kind == "b":
            # ensure that booleans are not stored as int8 by xarray https://github.com/pydata/xarray/issues/4386
            ds[var].attrs["dtype"] = "bool"
            default_encoding[var]["filters"] = [PackBits()]

    # values from function args (encoding) take precedence over default_encoding
    encoding = encoding or {}
    merged_encoding = {**default_encoding, **encoding}
    return merged_encoding

Saving the dataset above then would go along these lines:

sg.save_dataset(ds, "foo.zarr", encoding=encode(ds), mode="w")
/home/peru/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/conventions.py:206: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
  "variable {} has data in the form of a dask array with "

There is still the warning, and I haven't figured out where the None variable comes from. Would a helper function like this be something to consider?

@tomwhite
Copy link
Collaborator

@percyfal that looks like a decent approach to get things working. I'd like to understand this better though, as your original code should work as-is, and the fact that it doesn't looks like a bug to me. Are you able to share the VCF or extract a cutdown/anonymised version as a minimal reproducible example?

@tomwhite tomwhite added the bug Something isn't working label Apr 28, 2022
@percyfal
Copy link

The input VCF is actually simulated using stdpopsim and tskit:

stdpopsim HomSap -d OutOfAfrica_3G09 -c chr21  -o ooa.chr21.ts -l 0.1 -s 42 10 10 10
tskit vcf --ploidy 2 ooa.chr21.ts | sed -e "s/contig=<ID=1/contig=<ID=chr21/g" | sed -e "s/^1/chr21/g" |  bcftools view -O z -o ooa.chr21.vcf.gz

A MWE in python is then

import sgkit as sg
import sgkit.io.vcf as sgvcf
sgvcf.vcf_to_zarr("ooa.chr21.vcf.gz", "ooa.chr21.vcf.gz.zarr")
ds = sg.load_dataset("ooa.chr21.vcf.gz.zarr")
sg.save_dataset(ds, "foo.zarr", mode="w")

which fails with the error above. The simulated output encodes ref/alt as 0/1, which e.g. gatk ValidateVariants complains about. I tried the code also on a "regular" vcf, with the same error.

@tomwhite
Copy link
Collaborator

tomwhite commented May 5, 2022

Thanks for the MWE @percyfal! I have been trying to get to the bottom of this, and I think there are at least two issues. (The ref/alt 0/1 encoding is not the problem, as you point out, since the same thing happens with any VCF.)

The root cause seems to be this long-standing issue in xarray: pydata/xarray#3476. I think we can solve this in sgkit.save_dataset, see https://github.com/tomwhite/sgkit/tree/save-dataset-double-encoding-fix for a fix. (Does this work for you?)

The warning is still a problem though, and I'm not sure how to fix it yet. The warning is coming from xarray:

xarray.coding.variables.SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.

We don't get this warning in the initial vcf_to_zarr code since the arrays are not Dask arrays. However, when we load a dataset, the arrays are Dask arrays, so we get the warning. I opened this issue a while back, but still don't have a good workaround or fix. Since it is a warning, it shouldn't stop you as long as you are not working with very large datasets.

@percyfal
Copy link

percyfal commented May 5, 2022

@tomwhite your proposed fix solves my problem, thanks, and presumably the originally posted issue.

Cheers,

Per

@tomwhite
Copy link
Collaborator

tomwhite commented May 5, 2022

Great - thanks for trying it out.

@mergify mergify bot closed this as completed in #852 May 5, 2022
@hammer
Copy link
Contributor

hammer commented May 5, 2022

Great detective work @tomwhite! Should we file an issue to track the upstream Xarray + Zarr issue?

@tomwhite
Copy link
Collaborator

tomwhite commented May 6, 2022

Yes, I opened #853 for the remaining problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants