-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue of vcf_to_zarr()'s .zarr output #785
Comments
I have encountered this situation as well in a case where I read a vcf, convert to zarr, load the zarr dataset to perform some operations (ld pruning in this case), and then try to save which throws the error above. I did manage to identify a solution which I post here for reference. I'm pretty new to the xarray data structures so let me know if this makes sense or not. Upon reading the zarr dataset (converted from vcf), the dtype of the Dataset Dataarrays 'variant_id', 'variant_allele' and 'sample_id' are 'O'; converting these to e.g. dtype=str does the trick. Apparently An alternative solution is to note that I'm unsure whether resetting the dtype is the proper solution, but I note that when simulating a dataset with |
No problem @jeromekelleher I have some more info if needed. For some reason the encoding is given two passes in zarr.core.Array._encode_chunk: def _encode_chunk(self, chunk):
# apply filters
if self._filters:
for f in self._filters:
chunk = f.encode(chunk) such that |
I agree that strings are hard to get right with NumPy - adding Zarr and Xarray makes things even more constrained. In general, the approach we take is to store strings with dtype #643 and pydata/xarray#5769 are related too.
Perhaps we could make this easier to do, or automate it somehow? |
Hi @tomwhite thanks for the heads up! I'm revisiting this now that I have another dataset I wish to perform a pca on and then save the data. However, even loading a vcf and then trying to save it fails: import sgkit as sg
import sgkit.io.vcf as sgvcf
sgvcf.vcf_to_zarr("data.vcf.gz", "data.vcf.gz.zarr")
ds = sg.load_dataset("data.vcf.gz.zarr")
sg.save_dataset(ds, "foo.zarr", mode="w")
...
File numcodecs/vlen.pyx:106, in numcodecs.vlen.VLenUTF8.encode()
TypeError: expected unicode string, found 16 Full traceNB: there are some added print statements here and there <xarray.Variable (variants: 8968, alleles: 4)>
dask.array<open_dataset-cd5ecb3964c37c4e0530eb3bdfe5306dvariant_allele, shape=(8968, 4), dtype=object, chunksize=(8968, 4), chunktype=numpy.ndarray>
Attributes:
comment: The possible alleles for the variant.
<xarray.Variable (samples: 15)>
dask.array<open_dataset-cd5ecb3964c37c4e0530eb3bdfe5306dsample_id, shape=(15,), dtype=object, chunksize=(15,), chunktype=numpy.ndarray>
Attributes:
comment: The unique identifier of the sample.
<xarray.Variable (variants: 8968)>
dask.array<open_dataset-cd5ecb3964c37c4e0530eb3bdfe5306dvariant_id, shape=(8968,), dtype=object, chunksize=(8968,), chunktype=numpy.ndarray>
Attributes:
comment: The unique identifier of the variant.
/home/peru/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/conventions.py:206: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
"variable {} has data in the form of a dask array with "
sg.save_dataset(ds, "foo.zarr", mode="w")
<xarray.Variable (variants: 8968, alleles: 4)>
dask.array<open_dataset-cd5ecb3964c37c4e0530eb3bdfe5306dvariant_allele, shape=(8968, 4), dtype=object, chunksize=(8968, 4), chunktype=numpy.ndarray>
Attributes:
comment: The possible alleles for the variant.
<xarray.Variable (samples: 15)>
dask.array<open_dataset-cd5ecb3964c37c4e0530eb3bdfe5306dsample_id, shape=(15,), dtype=object, chunksize=(15,), chunktype=numpy.ndarray>
Attributes:
comment: The unique identifier of the sample.
<xarray.Variable (variants: 8968)>
dask.array<open_dataset-cd5ecb3964c37c4e0530eb3bdfe5306dvariant_id, shape=(8968,), dtype=object, chunksize=(8968,), chunktype=numpy.ndarray>
Attributes:
comment: The unique identifier of the variant.
[VLenUTF8(), VLenUTF8()] VLenUTF8() <class 'bytearray'>
Weird chunk: <class 'bytearray'>
/home/peru/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/conventions.py:206: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
"variable {} has data in the form of a dask array with "
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [73], in <cell line: 1>()
----> 1 sg.save_dataset(ds, "foo.zarr", mode="w")
File ~/dev/python/sgkit/sgkit/io/dataset.py:41, in save_dataset(ds, store, storage_options, **kwargs)
38 for v in ds:
39 # Workaround for https://github.com/pydata/xarray/issues/4380
40 ds[v].encoding.pop("chunks", None)
---> 41 ds.to_zarr(store, **kwargs)
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/core/dataset.py:2036, in Dataset.to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options)
2033 if encoding is None:
2034 encoding = {}
-> 2036 return to_zarr(
2037 self,
2038 store=store,
2039 chunk_store=chunk_store,
2040 storage_options=storage_options,
2041 mode=mode,
2042 synchronizer=synchronizer,
2043 group=group,
2044 encoding=encoding,
2045 compute=compute,
2046 consolidated=consolidated,
2047 append_dim=append_dim,
2048 region=region,
2049 safe_chunks=safe_chunks,
2050 )
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/backends/api.py:1431, in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options)
1429 writer = ArrayWriter()
1430 # TODO: figure out how to properly handle unlimited_dims
-> 1431 dump_to_store(dataset, zstore, writer, encoding=encoding)
1432 writes = writer.sync(compute=compute)
1434 if compute:
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/backends/api.py:1119, in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
1116 if encoder:
1117 variables, attrs = encoder(variables, attrs)
-> 1119 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/backends/zarr.py:533, in ZarrStore.store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
530 self.set_attributes(attributes)
531 self.set_dimensions(variables_encoded, unlimited_dims=unlimited_dims)
--> 533 self.set_variables(
534 variables_encoded, check_encoding_set, writer, unlimited_dims=unlimited_dims
535 )
536 if self._consolidate_on_close:
537 zarr.consolidate_metadata(self.zarr_group.store)
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/backends/zarr.py:611, in ZarrStore.set_variables(self, variables, check_encoding_set, writer, unlimited_dims)
608 zarr_array.resize(new_shape)
610 region = tuple(write_region[dim] for dim in dims)
--> 611 writer.add(v.data, zarr_array, region)
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/backends/common.py:154, in ArrayWriter.add(self, source, target, region)
152 else:
153 if region:
--> 154 target[region] = source
155 else:
156 target[...] = source
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1224, in Array.__setitem__(self, selection, value)
1145 """Modify data for an item or region of the array.
1146
1147 Parameters
(...)
1220
1221 """
1223 fields, selection = pop_fields(selection)
-> 1224 self.set_basic_selection(selection, value, fields=fields)
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1319, in Array.set_basic_selection(self, selection, value, fields)
1317 return self._set_basic_selection_zd(selection, value, fields=fields)
1318 else:
-> 1319 return self._set_basic_selection_nd(selection, value, fields=fields)
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1610, in Array._set_basic_selection_nd(self, selection, value, fields)
1604 def _set_basic_selection_nd(self, selection, value, fields=None):
1605 # implementation of __setitem__ for array with at least one dimension
1606
1607 # setup indexer
1608 indexer = BasicIndexer(selection, self)
-> 1610 self._set_selection(indexer, value, fields=fields)
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1682, in Array._set_selection(self, indexer, value, fields)
1679 cv = chunk_value[item]
1680 chunk_values.append(cv)
-> 1682 self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
1683 fields=fields)
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1871, in Array._chunk_setitems(self, lchunk_coords, lchunk_selection, values, fields)
1869 def _chunk_setitems(self, lchunk_coords, lchunk_selection, values, fields=None):
1870 ckeys = [self._chunk_key(co) for co in lchunk_coords]
-> 1871 cdatas = [self._process_for_setitem(key, sel, val, fields=fields)
1872 for key, sel, val in zip(ckeys, lchunk_selection, values)]
1873 values = {k: v for k, v in zip(ckeys, cdatas)}
1874 self.chunk_store.setitems(values)
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1871, in <listcomp>(.0)
1869 def _chunk_setitems(self, lchunk_coords, lchunk_selection, values, fields=None):
1870 ckeys = [self._chunk_key(co) for co in lchunk_coords]
-> 1871 cdatas = [self._process_for_setitem(key, sel, val, fields=fields)
1872 for key, sel, val in zip(ckeys, lchunk_selection, values)]
1873 values = {k: v for k, v in zip(ckeys, cdatas)}
1874 self.chunk_store.setitems(values)
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:1963, in Array._process_for_setitem(self, ckey, chunk_selection, value, fields)
1960 chunk[chunk_selection] = value
1962 # encode chunk
-> 1963 return self._encode_chunk(chunk)
File ~/miniconda3/envs/pgtk/lib/python3.9/site-packages/zarr/core.py:2014, in Array._encode_chunk(self, chunk)
2012 print(self._filters, f, type(chunk))
2013 print("Weird chunk: ", type(chunk))
-> 2014 chunk = f.encode(chunk)
2016 # check object encoding
2017 if ensure_ndarray(chunk).dtype == object:
File numcodecs/vlen.pyx:106, in numcodecs.vlen.VLenUTF8.encode()
TypeError: expected unicode string, found 16 I can get around this by making an encoding function (draft) based on the code in vcf_reader: from typing import Hashable
from xarray import Dataset
from numcodecs import PackBits
try:
from numcodecs import Blosc
DEFAULT_COMPRESSOR = Blosc(cname="zstd", clevel=7, shuffle=Blosc.AUTOSHUFFLE)
except ImportError: # pragma: no cover
warnings.warn("Cannot import Blosc, falling back to no compression", RuntimeWarning)
DEFAULT_COMPRESSOR = None
def encode(ds: Dataset, encoding: Optional[dict] = None):
chunk_width = 1_000
chunk_length = 10_000
compressor = DEFAULT_COMPRESSOR
def get_chunk_size(dim: Hashable, size: int) -> int:
if dim == "variants":
return chunk_length
elif dim == "samples":
return chunk_width
else:
return size
default_encoding = {}
for var in ds.data_vars:
var_chunks = tuple(
get_chunk_size(dim, size)
for (dim, size) in zip(ds[var].dims, ds[var].shape)
)
default_encoding[var] = dict(
chunks=var_chunks, compressor=compressor
)
if ds[var].dtype.kind == "b":
# ensure that booleans are not stored as int8 by xarray https://github.com/pydata/xarray/issues/4386
ds[var].attrs["dtype"] = "bool"
default_encoding[var]["filters"] = [PackBits()]
# values from function args (encoding) take precedence over default_encoding
encoding = encoding or {}
merged_encoding = {**default_encoding, **encoding}
return merged_encoding Saving the dataset above then would go along these lines: sg.save_dataset(ds, "foo.zarr", encoding=encode(ds), mode="w")
/home/peru/miniconda3/envs/pgtk/lib/python3.9/site-packages/xarray/conventions.py:206: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
"variable {} has data in the form of a dask array with " There is still the warning, and I haven't figured out where the None variable comes from. Would a helper function like this be something to consider? |
@percyfal that looks like a decent approach to get things working. I'd like to understand this better though, as your original code should work as-is, and the fact that it doesn't looks like a bug to me. Are you able to share the VCF or extract a cutdown/anonymised version as a minimal reproducible example? |
The input VCF is actually simulated using stdpopsim and tskit: stdpopsim HomSap -d OutOfAfrica_3G09 -c chr21 -o ooa.chr21.ts -l 0.1 -s 42 10 10 10
tskit vcf --ploidy 2 ooa.chr21.ts | sed -e "s/contig=<ID=1/contig=<ID=chr21/g" | sed -e "s/^1/chr21/g" | bcftools view -O z -o ooa.chr21.vcf.gz A MWE in python is then import sgkit as sg
import sgkit.io.vcf as sgvcf
sgvcf.vcf_to_zarr("ooa.chr21.vcf.gz", "ooa.chr21.vcf.gz.zarr")
ds = sg.load_dataset("ooa.chr21.vcf.gz.zarr")
sg.save_dataset(ds, "foo.zarr", mode="w") which fails with the error above. The simulated output encodes ref/alt as 0/1, which e.g. gatk ValidateVariants complains about. I tried the code also on a "regular" vcf, with the same error. |
Thanks for the MWE @percyfal! I have been trying to get to the bottom of this, and I think there are at least two issues. (The ref/alt 0/1 encoding is not the problem, as you point out, since the same thing happens with any VCF.) The root cause seems to be this long-standing issue in xarray: pydata/xarray#3476. I think we can solve this in The warning is still a problem though, and I'm not sure how to fix it yet. The warning is coming from xarray:
We don't get this warning in the initial |
@tomwhite your proposed fix solves my problem, thanks, and presumably the originally posted issue. Cheers, Per |
Great - thanks for trying it out. |
Great detective work @tomwhite! Should we file an issue to track the upstream Xarray + Zarr issue? |
Yes, I opened #853 for the remaining problem. |
The .zarr output of
vcf_to_zarr()
is smaller than before in terms of size, like for a 201MB 1000 Genomes chromosome 21 vcf file, the new result is 50MB zarr file while it was 321MB before. However, I found some problems of this new compressed version of .zarr.For example, if I simply run the following reading data and write:
It works for the old 321MB version, but the new 50MB version would raise error:
The text was updated successfully, but these errors were encountered: