Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to increase the max paths to avoid the OSError: [Errno 5] 500 request size exceeded, max paths is set to 1000 #284

Closed
josephnowak opened this issue Jul 14, 2024 · 11 comments · Fixed by #285
Labels
bug Something isn't working triage Requires initial review (is duplicate, reproduce bug, severity or priority)

Comments

@josephnowak
Copy link

josephnowak commented Jul 14, 2024

What happened?

I have been trying to use LakeFS in conjunction with Xarray and Zarr, but I'm getting the following error when I try to write a Zarr file with many chunks:
OSError: [Errno 5] 500 request size exceeded, max paths is set to 1000.

I would like to know how can I drop that limitation, I need it to be able to write Zarr files with many chunks (every chunk is an individual file).

Additionally, I would like to know if you think that LakeFS is a good option to use with Zarr. I'm asking this because this format can create many files to represent a single Array. In my particular case I have more than 300 data fields and all of them have more than 10K chunks which is equivalent to 10K files, so I'm not sure if it can affect the performance of LakeFS.

import xarray as xr
import dask.array as da

from lakefs_spec import LakeFSFileSystem

lfs = LakeFSFileSystem(
    host="127.0.0.1:8000",
    username="AKIAIOSFOLQUICKSTART",
    password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    access_key_id="AKIAIOSFOLQUICKSTART",
    secret_access_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    # endpoint_url="http://127.0.0.1:8000",
    use_listings_cache=False
)

for folder in ["test-zarr", "test-zarr"]:
    # The first execution is going to work
    path = f"lakefs://quickstart/main/{folder}"
    print(path)
    arr = da.zeros(shape=(100, 30), chunks=(2, 1))
    arr = xr.DataArray(
        arr, 
        dims=["a", "b"], 
        coords={
            "a": list(range(arr.shape[0])), 
            "b": list(range(arr.shape[1]))
        }
    ).to_dataset(name="data")
    
    fs_map = fsspec.FSMap(root=path, fs=lfs)
    
    # The error comes when it tries to clean the whole directory to rewrite the data
    arr.to_zarr(fs_map, mode="w")

    print(xr.open_zarr(fs_map).compute())

I deployed LakeFS using the quickstart docker command:
docker run --name lakefs --pull always --rm --publish 8000:8000 treeverse/lakefs:latest run --quickstart

What did you expect to happen?

I would expect that there is an env variable that allows to modify the max_paths to drop the limitation on the number of requests.

lakeFS-spec version

0.9.0

lakeFS version

1.28.2

@josephnowak josephnowak added bug Something isn't working triage Requires initial review (is duplicate, reproduce bug, severity or priority) labels Jul 14, 2024
@nicholasjng
Copy link
Collaborator

Thanks for the report. Could you share a stack trace or something similar? Just to be sure that this is a limitation on our end (I'm a bit skeptical because of the OSError tbh).

@josephnowak
Copy link
Author

Hi Thanks for your reply, here is the complete traceback:

---------------------------------------------------------------------------
ServiceException                          Traceback (most recent call last)
File ~\.conda\envs\tensordb\Lib\site-packages\lakefs\exceptions.py:141, in api_exception_handler(custom_handler)
    140 try:
--> 141     yield
    142 except lakefs_sdk.ApiException as e:

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs\branch.py:90, in _BaseBranch.delete_objects(self, object_paths)
     89 with api_exception_handler():
---> 90     return self._client.sdk_client.objects_api.delete_objects(
     91         self._repo_id,
     92         self._id,
     93         lakefs_sdk.PathList(paths=object_paths)
     94     )

File ~\.conda\envs\tensordb\Lib\site-packages\pydantic\v1\decorator.py:40, in validate_arguments.<locals>.validate.<locals>.wrapper_function(*args, **kwargs)
     38 @wraps(_func)
     39 def wrapper_function(*args: Any, **kwargs: Any) -> Any:
---> 40     return vd.call(*args, **kwargs)

File ~\.conda\envs\tensordb\Lib\site-packages\pydantic\v1\decorator.py:134, in ValidatedFunction.call(self, *args, **kwargs)
    133 m = self.init_model_instance(*args, **kwargs)
--> 134 return self.execute(m)

File ~\.conda\envs\tensordb\Lib\site-packages\pydantic\v1\decorator.py:206, in ValidatedFunction.execute(self, m)
    205 else:
--> 206     return self.raw_function(**d, **var_kwargs)

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs_sdk\api\objects_api.py:424, in ObjectsApi.delete_objects(self, repository, branch, path_list, force, **kwargs)
    423     raise ValueError("Error! Please call the delete_objects_with_http_info method with `_preload_content` instead and obtain raw data from ApiResponse.raw_data")
--> 424 return self.delete_objects_with_http_info(repository, branch, path_list, force, **kwargs)

File ~\.conda\envs\tensordb\Lib\site-packages\pydantic\v1\decorator.py:40, in validate_arguments.<locals>.validate.<locals>.wrapper_function(*args, **kwargs)
     38 @wraps(_func)
     39 def wrapper_function(*args: Any, **kwargs: Any) -> Any:
---> 40     return vd.call(*args, **kwargs)

File ~\.conda\envs\tensordb\Lib\site-packages\pydantic\v1\decorator.py:134, in ValidatedFunction.call(self, *args, **kwargs)
    133 m = self.init_model_instance(*args, **kwargs)
--> 134 return self.execute(m)

File ~\.conda\envs\tensordb\Lib\site-packages\pydantic\v1\decorator.py:206, in ValidatedFunction.execute(self, m)
    205 else:
--> 206     return self.raw_function(**d, **var_kwargs)

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs_sdk\api\objects_api.py:547, in ObjectsApi.delete_objects_with_http_info(self, repository, branch, path_list, force, **kwargs)
    539 _response_types_map = {
    540     '200': "ObjectErrorList",
    541     '401': "Error",
   (...)
    544     '420': None,
    545 }
--> 547 return self.api_client.call_api(
    548     '/repositories/{repository}/branches/{branch}/objects/delete', 'POST',
    549     _path_params,
    550     _query_params,
    551     _header_params,
    552     body=_body_params,
    553     post_params=_form_params,
    554     files=_files,
    555     response_types_map=_response_types_map,
    556     auth_settings=_auth_settings,
    557     async_req=_params.get('async_req'),
    558     _return_http_data_only=_params.get('_return_http_data_only'),  # noqa: E501
    559     _preload_content=_params.get('_preload_content', True),
    560     _request_timeout=_params.get('_request_timeout'),
    561     collection_formats=_collection_formats,
    562     _request_auth=_params.get('_request_auth'))

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs_sdk\api_client.py:407, in ApiClient.call_api(self, resource_path, method, path_params, query_params, header_params, body, post_params, files, response_types_map, auth_settings, async_req, _return_http_data_only, collection_formats, _preload_content, _request_timeout, _host, _request_auth)
    406 if not async_req:
--> 407     return self.__call_api(resource_path, method,
    408                            path_params, query_params, header_params,
    409                            body, post_params, files,
    410                            response_types_map, auth_settings,
    411                            _return_http_data_only, collection_formats,
    412                            _preload_content, _request_timeout, _host,
    413                            _request_auth)
    415 return self.pool.apply_async(self.__call_api, (resource_path,
    416                                                method, path_params,
    417                                                query_params,
   (...)
    425                                                _request_timeout,
    426                                                _host, _request_auth))

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs_sdk\api_client.py:222, in ApiClient.__call_api(self, resource_path, method, path_params, query_params, header_params, body, post_params, files, response_types_map, auth_settings, _return_http_data_only, collection_formats, _preload_content, _request_timeout, _host, _request_auth)
    221         e.body = e.body.decode('utf-8')
--> 222     raise e
    224 self.last_response = response_data

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs_sdk\api_client.py:212, in ApiClient.__call_api(self, resource_path, method, path_params, query_params, header_params, body, post_params, files, response_types_map, auth_settings, _return_http_data_only, collection_formats, _preload_content, _request_timeout, _host, _request_auth)
    210 try:
    211     # perform request and return response
--> 212     response_data = self.request(
    213         method, url,
    214         query_params=query_params,
    215         headers=header_params,
    216         post_params=post_params, body=body,
    217         _preload_content=_preload_content,
    218         _request_timeout=_request_timeout)
    219 except ApiException as e:

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs_sdk\api_client.py:451, in ApiClient.request(self, method, url, query_params, headers, post_params, body, _preload_content, _request_timeout)
    450 elif method == "POST":
--> 451     return self.rest_client.post_request(url,
    452                                  query_params=query_params,
    453                                  headers=headers,
    454                                  post_params=post_params,
    455                                  _preload_content=_preload_content,
    456                                  _request_timeout=_request_timeout,
    457                                  body=body)
    458 elif method == "PUT":

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs_sdk\rest.py:278, in RESTClientObject.post_request(self, url, headers, query_params, post_params, body, _preload_content, _request_timeout)
    276 def post_request(self, url, headers=None, query_params=None, post_params=None,
    277          body=None, _preload_content=True, _request_timeout=None):
--> 278     return self.request("POST", url,
    279                         headers=headers,
    280                         query_params=query_params,
    281                         post_params=post_params,
    282                         _preload_content=_preload_content,
    283                         _request_timeout=_request_timeout,
    284                         body=body)

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs_sdk\rest.py:235, in RESTClientObject.request(self, method, url, query_params, headers, body, post_params, _preload_content, _request_timeout)
    234 if 500 <= r.status <= 599:
--> 235     raise ServiceException(http_resp=r)
    237 raise ApiException(http_resp=r)

ServiceException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Request-Id': '586c5cf7-a698-4d84-97af-d4e6de8e6eb2', 'Date': 'Sun, 14 Jul 2024 17:56:29 GMT', 'Content-Length': '62'})
HTTP response body: {"message":"request size exceeded, max paths is set to 1000"}



The above exception was the direct cause of the following exception:

ServerException                           Traceback (most recent call last)
File ~\.conda\envs\tensordb\Lib\site-packages\lakefs_spec\spec.py:168, in LakeFSFileSystem.wrapped_api_call(self, rpath, message, set_cause)
    167 try:
--> 168     yield
    169 except ServerException as e:

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs_spec\spec.py:718, in LakeFSFileSystem.rm(self, path, recursive, maxdepth)
    717 if maxdepth is None:
--> 718     branch.delete_objects(obj.path for obj in objgen)
    719 else:
    720     # nesting level is just the amount of "/"s in the path, no leading "/".

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs\branch.py:89, in _BaseBranch.delete_objects(self, object_paths)
     88     object_paths = [o.path if isinstance(o, StoredObject) else o for o in object_paths]
---> 89 with api_exception_handler():
     90     return self._client.sdk_client.objects_api.delete_objects(
     91         self._repo_id,
     92         self._id,
     93         lakefs_sdk.PathList(paths=object_paths)
     94     )

File ~\.conda\envs\tensordb\Lib\contextlib.py:155, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    154 try:
--> 155     self.gen.throw(typ, value, traceback)
    156 except StopIteration as exc:
    157     # Suppress StopIteration *unless* it's the same exception that
    158     # was passed to throw().  This prevents a StopIteration
    159     # raised inside the "with" statement from being suppressed.

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs\exceptions.py:148, in api_exception_handler(custom_handler)
    147 if lakefs_ex is not None:
--> 148     raise lakefs_ex from e

ServerException: code: 500, reason: Internal Server Error, body: {'message': 'request size exceeded, max paths is set to 1000'}

The above exception was the direct cause of the following exception:

OSError                                   Traceback (most recent call last)
Cell In[56], line 23
     20 fs_map = fsspec.FSMap(root=path, fs=lfs)
     22 # The error comes when it tries to clean the whole directory to rewrite the data
---> 23 arr.to_zarr(fs_map, mode="w")
     25 print(xr.open_zarr(fs_map).compute())
     27 time.sleep(5)

File ~\.conda\envs\tensordb\Lib\site-packages\xarray\core\dataset.py:2549, in Dataset.to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options, zarr_version, write_empty_chunks, chunkmanager_store_kwargs)
   2404 """Write dataset contents to a zarr group.
   2405 
   2406 Zarr chunks are determined in the following way:
   (...)
   2545     The I/O user guide, with more details and examples.
   2546 """
   2547 from xarray.backends.api import to_zarr
-> 2549 return to_zarr(  # type: ignore[call-overload,misc]
   2550     self,
   2551     store=store,
   2552     chunk_store=chunk_store,
   2553     storage_options=storage_options,
   2554     mode=mode,
   2555     synchronizer=synchronizer,
   2556     group=group,
   2557     encoding=encoding,
   2558     compute=compute,
   2559     consolidated=consolidated,
   2560     append_dim=append_dim,
   2561     region=region,
   2562     safe_chunks=safe_chunks,
   2563     zarr_version=zarr_version,
   2564     write_empty_chunks=write_empty_chunks,
   2565     chunkmanager_store_kwargs=chunkmanager_store_kwargs,
   2566 )

File ~\.conda\envs\tensordb\Lib\site-packages\xarray\backends\api.py:1661, in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options, zarr_version, write_empty_chunks, chunkmanager_store_kwargs)
   1659     already_consolidated = False
   1660     consolidate_on_close = consolidated or consolidated is None
-> 1661 zstore = backends.ZarrStore.open_group(
   1662     store=mapper,
   1663     mode=mode,
   1664     synchronizer=synchronizer,
   1665     group=group,
   1666     consolidated=already_consolidated,
   1667     consolidate_on_close=consolidate_on_close,
   1668     chunk_store=chunk_mapper,
   1669     append_dim=append_dim,
   1670     write_region=region,
   1671     safe_chunks=safe_chunks,
   1672     stacklevel=4,  # for Dataset.to_zarr()
   1673     zarr_version=zarr_version,
   1674     write_empty=write_empty_chunks,
   1675 )
   1677 if region is not None:
   1678     zstore._validate_and_autodetect_region(dataset)

File ~\.conda\envs\tensordb\Lib\site-packages\xarray\backends\zarr.py:483, in ZarrStore.open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, stacklevel, zarr_version, write_empty)
    464 @classmethod
    465 def open_group(
    466     cls,
   (...)
    480     write_empty: bool | None = None,
    481 ):
--> 483     zarr_group, consolidate_on_close, close_store_on_close = _get_open_params(
    484         store=store,
    485         mode=mode,
    486         synchronizer=synchronizer,
    487         group=group,
    488         consolidated=consolidated,
    489         consolidate_on_close=consolidate_on_close,
    490         chunk_store=chunk_store,
    491         storage_options=storage_options,
    492         stacklevel=stacklevel,
    493         zarr_version=zarr_version,
    494     )
    496     return cls(
    497         zarr_group,
    498         mode,
   (...)
    504         close_store_on_close,
    505     )

File ~\.conda\envs\tensordb\Lib\site-packages\xarray\backends\zarr.py:1332, in _get_open_params(store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, stacklevel, zarr_version)
   1330     zarr_group = zarr.open_consolidated(store, **open_kwargs)
   1331 else:
-> 1332     zarr_group = zarr.open_group(store, **open_kwargs)
   1333 close_store_on_close = zarr_group.store is not store
   1334 return zarr_group, consolidate_on_close, close_store_on_close

File ~\.conda\envs\tensordb\Lib\site-packages\zarr\hierarchy.py:1581, in open_group(store, mode, cache_attrs, synchronizer, path, chunk_store, storage_options, zarr_version, meta_array)
   1578         raise GroupNotFoundError(path)
   1580 elif mode == "w":
-> 1581     init_group(store, overwrite=True, path=path, chunk_store=chunk_store)
   1583 elif mode == "a":
   1584     if not contains_group(store, path=path):

File ~\.conda\envs\tensordb\Lib\site-packages\zarr\storage.py:682, in init_group(store, overwrite, path, chunk_store)
    679     store["zarr.json"] = store._metadata_class.encode_hierarchy_metadata(None)  # type: ignore
    681 # initialise metadata
--> 682 _init_group_metadata(store=store, overwrite=overwrite, path=path, chunk_store=chunk_store)
    684 if store_version == 3:
    685     # TODO: Should initializing a v3 group also create a corresponding
    686     #       empty folder under data/root/? I think probably not until there
    687     #       is actual data written there.
    688     pass

File ~\.conda\envs\tensordb\Lib\site-packages\zarr\storage.py:704, in _init_group_metadata(store, overwrite, path, chunk_store)
    701 if overwrite:
    702     if store_version == 2:
    703         # attempt to delete any pre-existing items in store
--> 704         rmdir(store, path)
    705         if chunk_store is not None:
    706             rmdir(chunk_store, path)

File ~\.conda\envs\tensordb\Lib\site-packages\zarr\storage.py:212, in rmdir(store, path)
    209 store_version = getattr(store, "_store_version", 2)
    210 if hasattr(store, "rmdir") and store.is_erasable():  # type: ignore
    211     # pass through
--> 212     store.rmdir(path)
    213 else:
    214     # slow version, delete one key at a time
    215     if store_version == 2:

File ~\.conda\envs\tensordb\Lib\site-packages\zarr\storage.py:1549, in FSStore.rmdir(self, path)
   1547 store_path = self.dir_path(path)
   1548 if self.fs.isdir(store_path):
-> 1549     self.fs.rm(store_path, recursive=True)

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs_spec\spec.py:714, in LakeFSFileSystem.rm(self, path, recursive, maxdepth)
    711 path = stringify_path(path)
    712 repository, ref, prefix = parse(path)
--> 714 with self.wrapped_api_call(rpath=path):
    715     branch = lakefs.Branch(repository, ref, client=self.client)
    716     objgen = branch.objects(prefix=prefix, delimiter="" if recursive else "/")

File ~\.conda\envs\tensordb\Lib\contextlib.py:155, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    153     value = typ()
    154 try:
--> 155     self.gen.throw(typ, value, traceback)
    156 except StopIteration as exc:
    157     # Suppress StopIteration *unless* it's the same exception that
    158     # was passed to throw().  This prevents a StopIteration
    159     # raised inside the "with" statement from being suppressed.
    160     return exc is not value

File ~\.conda\envs\tensordb\Lib\site-packages\lakefs_spec\spec.py:170, in LakeFSFileSystem.wrapped_api_call(self, rpath, message, set_cause)
    168     yield
    169 except ServerException as e:
--> 170     raise translate_lakefs_error(e, rpath=rpath, message=message, set_cause=set_cause)

OSError: [Errno 5] 500 request size exceeded, max paths is set to 1000: 'quickstart/main/test-zarr'

@nicholasjng
Copy link
Collaborator

Thanks. As you can see at the start of the trace, the rm call (a POST to '/repositories/{repository}/branches/{branch}/objects/delete' under the hood) fails, presumably because of a max-paths limitation on the lakeFS server.

I don't know the details about how to increase the max paths number (specifically, whether it is set at compile time or if it is a configurable attribute on the server), so I'd raise that question on the lakeFS repo.

But tl,dr: This is a lakeFS server-side issue, not a lakefs-spec one.

@josephnowak
Copy link
Author

Thanks, I will raise the question on the LakeFS repo.

@arielshaqed
Copy link

Hi I'm from the lakeFS team, here because of treeverse/lakeFS#7992. This error is indeed generated by lakeFS, when you attempt to call deleteObjects over >1000 objects. Note that AWS S3 has the exact same limitation in its DeleteObjects call.

I agree that this is a limitation, but it is unavoidable on the lakeFS side. We might be able to make it configurable. But the run-time behaviour will not be good. I believe that this line from the stacktrace should actually call deleteObjects in "chunks".

Please consider re-opening this issue.

@nicholasjng nicholasjng reopened this Jul 16, 2024
@nicholasjng
Copy link
Collaborator

Thanks for the source @arielshaqed! Should be easily doable with an itertools slice or something similar.

@josephnowak
Copy link
Author

@nicholasjng Sorry for bothering you, but I was reading the following forum of LakeFS https://forum.lakefs.io/t/15556053/hello-i-m-new-to-lakefs-and-would-like-to-ask-does-anyone-ha, which mentions that they write directly on S3 using s3fs instead of lakefs-spec, so I would like to know if that is the recommended approach or if I should use LakeFS-spec. What is strange to me is how LakeFS can handle/update the metadata automatically if I write directly to the bucket using the code on the forum. (If need it I can open a separate issue for this question)

@nicholasjng
Copy link
Collaborator

Regarding metadata, I would assume that lakeFS tracks the bucket state very closely in its metadata model, so they will pick up changes you write by hand.

The s3fs approach is just another way of accessing the underlying storage at this point I think. We used to have a fallback that wrote and read files to/from lakeFS directly using s3fs if the pre-signed URL feature was selected, but not anymore since the new lakefs wrapper library was created.

So you should go with whatever you find most convenient. Going directly through S3 is probably also fine, if you don't consider switching to a different cloud provider anytime soon. (It also becomes a little harder to spin up working local instances.)

@arielshaqed
Copy link

Sorry for not replying for lakeFS, @nicholasjng proceeded much faster than I tracked this issue! tl;dr: Prefer lakefs-spec if at all possible.

Everything Nicholas wrote about accessing lakeFS is of course accurate. To give context from the lakeFS side to your questions:

The sample use of S3FS predates lakefs-spec AFAIR; I would prefer to use lakefs-spec.

This setup does not go directly to the underlying S3 bucket! The configuration option endpoint_url points to the S3-compatible endpoint which lakeFS provides. Your upload goes through the lakeFS server, and that's how it tracks metadata from that upload. lakeFS provides this endpoint even when your backing store is not S3 compatible, so it will work.

But it will typically be slower than using lakefs-spec! Depending on lakeFS configuration, this fsspec implementation can upload data using a presigned URL, and then notifies lakeFS about the new object. This way lakeFS only handles metadata and never data, which is considerably more efficient and reliable.

@nicholasjng
Copy link
Collaborator

lakefs-spec v0.10.0 is out with the fix from #284, please let us know if there are any further issues.

@josephnowak
Copy link
Author

That was fast, I tested, and everything worked as expected, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Requires initial review (is duplicate, reproduce bug, severity or priority)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants