Support reading from Azure with adlsv2 managed identity #2005

jaychia · 2024-03-12T18:58:36Z

Is your feature request related to a problem? Please describe.

In Azure, sometimes users may use Azure managed identity instead of just pure credentials.

Daft should support this. This may require some API changes on the AzureConfig.

Resources:

(What are managed identities for Azure resources?) https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview
(How to use it from databricks) https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/azure-managed-identities

The text was updated successfully, but these errors were encountered:

djouallah · 2024-05-05T12:36:58Z

I did tried daft to read iceberg table using a catalog, it worked well with s3 but got no implemented error when using Azure ADLSGen2 ? is it planned ?

jaychia · 2024-05-05T19:28:30Z

Hi @djouallah !

This is planned, but requires some work on our end to correctly set up the Azure environment to test it. Let me bump up the priority on this :)

jaychia · 2024-05-05T19:29:29Z

@djouallah could you provide an example of your setup to help guide development of this feature?

djouallah · 2024-05-06T00:27:30Z

I am getting this

i get this error
DaftCoreException: DaftError::External Source not yet implemented: abfss

jaychia · 2024-05-06T19:19:16Z

Source not yet implemented:

Thanks! Actually this error might be because you're using abfss:// instead of abfs://.

Daft currently recognizes only az:// and abfs:// as valid ABFS URLs. Could you try using abfs:// instead and let me know if that works/how it fails?

samster25 · 2024-05-07T03:41:54Z

It might be a simple fix to just map abfss:// to our Azure reader. It would be great if you can try it out @djouallah!

djouallah · 2024-05-07T04:07:41Z

can you map it please changing from abfss to abfs will break other engines :(

samster25 · 2024-05-07T04:14:40Z

@djouallah Yeah happy to map it on our end, we just want to verify that it indeed fixes your issue.
If you could run this on your end

df = daft.read_parquet("abfs://PATH_TO_PARQUET_FILE_THAT_IS_ABFSS")

and verify that it works

djouallah · 2024-05-07T04:21:27Z

I understand, I am reading from a catalog, I can't change the URL

samster25 · 2024-05-07T04:29:14Z

Actually just took a look at what azure's fsspec implementation looks like and they just map it.. I'll push up a PR!

samster25 · 2024-05-07T04:41:26Z

PR UP: #2244

samster25 · 2024-05-09T01:08:21Z

@djouallah The latest version of daft should now have the fix for reading abfss!

pip install getdaft==0.2.24

djouallah · 2024-05-09T02:58:00Z

@samster25 thanks, now I get this

although the credential are already defined in the catalog


from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.io.fsspec import FsspecFileIO
catalog = SqlCatalog(
    "default",
    **{
        "uri"                : postgresql_db ,
        "adlfs.account-name" : account_name ,
        "adlfs.account-key"  : AZURE_STORAGE_ACCOUNT_KEY,
        "adlfs.tenant-id"    : azure_storage_tenant_id,
        "py-io-impl"         : "pyiceberg.io.fsspec.FsspecFileIO"
    },
)

polars works fine

samster25 · 2024-05-09T03:30:40Z

Ah, we might have missed an option when converting the Iceberg credentials to our IOConfig. I'll make a fix ASAP!

passing currently in storage_options for polars?

djouallah · 2024-05-09T03:37:45Z

no polars just figure out it by default, did not add anything

kevinzwang · 2024-06-04T22:18:43Z

@djouallah we recently added additional methods for Azure authentication in the latest release of Daft (v0.2.26)

You can now used a managed identity by passing in tenant_id, client_id, and client_secret into daft.io.AzureConfig. If credentials are not provided, it will also try to pull credentials from environment variables or the Azure CLI.

Please check it out and let me know how it goes!

djouallah · 2024-06-05T01:21:27Z

does not work

AttributeError                            Traceback (most recent call last)
[<ipython-input-11-2f1f1c3a12d5>](https://localhost:8080/#) in <cell line: 4>()
      2 import daft
      3 
----> 4 df = daft.read_iceberg(scada)
      5 df.show()

    [... skipping hidden 2 frame]

[/usr/local/lib/python3.10/dist-packages/daft/io/_iceberg.py](https://localhost:8080/#) in read_iceberg(pyiceberg_table, io_config)
    117 
    118     io_config = (
--> 119         _convert_iceberg_file_io_properties_to_io_config(pyiceberg_table.io.properties)
    120         if io_config is None
    121         else io_config

AttributeError: 'LazyFrame' object has no attribute 'io'

samster25 · 2024-06-05T01:26:45Z

hi @djouallah, we don't have a concept of a LazyFrame, maybe you passed in a Polars LazyFrame instead of a PyIceberg table?

djouallah · 2024-06-05T01:29:44Z

this is very embarrasing :( this is the real error

ScanWithTask-LocalLimit-LocalLimit [Stage:5]:   0%
 0/1 [00:00<?, ?it/s]
---------------------------------------------------------------------------
DaftCoreException                         Traceback (most recent call last)
[<ipython-input-27-eb9e045261aa>](https://localhost:8080/#) in <cell line: 3>()
      1 import daft
      2 df = daft.read_iceberg(calendar)
----> 3 df.show()

13 frames
[/usr/local/lib/python3.10/dist-packages/daft/table/micropartition.py](https://localhost:8080/#) in _from_scan_task(scan_task)
     78     def _from_scan_task(scan_task: _ScanTask) -> MicroPartition:
     79         assert isinstance(scan_task, _ScanTask)
---> 80         return MicroPartition._from_pymicropartition(_PyMicroPartition.from_scan_task(scan_task))
     81 
     82     @staticmethod

DaftCoreException: DaftError::External Unable to open file abfss://yyyy.dfs.core.windows.net/data/cccc/calendar/data/00000-0-a16c15a5-9a43-4f1d-b156-b5d733a698c7.parquet: Error { context: Full(Custom { kind: HttpResponse { status: BadRequest, error_code: Some("InvalidResourceName") }, error: HttpError { status: BadRequest, details: ErrorDetails { code: Some("InvalidResourceName"), message: None }, headers: {"server": "Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0", "x-ms-version": "2020-10-02", "x-ms-error-code": "InvalidResourceName", "x-ms-request-id": "bdcf-0055-6ce7-b6826b000000", "transfer-encoding": "chunked", "date": "Wed, 05 Jun 2024 01:28:12 GMT"}, body: b"" } }, "server returned error status which will not be retried: 400") }

kevinzwang · 2024-06-05T02:19:34Z

Hey @djouallah, thanks for the response! Could you please share the entire block of code that you're running?

djouallah · 2024-06-05T02:55:28Z

@kevinzwang

https://colab.research.google.com/drive/1zlE2A7b2sXSb8fSk1Wv8OczuWwdNdayp?usp=sharing

kevinzwang · 2024-06-05T20:12:51Z

Hi @djouallah, I believe we've determined a fix for your issue, and I'm currently working on testing this on an Azure deployed Iceberg table. Should be out soon!

In this PR: - `pyarrow.dataset.write_dataset` does not properly write Parquet metadata in version 12.0.0, set the requirements for it to be >=12.0.1 - Azure fsspec filesystem now initialized IOConfig values - Azure URIs that look like `PROTOCOL://account.dfs.core.windows.net/container/path-part/file` now properly parsed, URI parsing also cleaned up and unified - fixed small discrepancies for AzureConfig in `daft.pyi` - Added a public test Iceberg table on Azure, a SQLite catalog that points to the table, and a test for those tables. - More tests should be written - #2348 Should resolve #2005

kevinzwang · 2024-06-07T01:52:32Z

Oops I think the merged PR also accidentally closed this issue. @djouallah we just cut a release for v0.2.27 which will have the fix for the issue you've been seeing. Hoping it works this time 🤞 and thank you for all of your patience :)

djouallah · 2024-06-07T02:34:43Z

Thank you, it is working

kevinzwang · 2024-06-07T02:39:44Z

Amazing, so glad to hear

djouallah · 2024-06-11T10:20:15Z

sorry to be annoying but it does not works with Tabular catalog

DaftCoreException: DaftError::External Unauthorized to access store: gcs for file: gs://xxxxxxxxxx5-47a9-9eac-f3f5b12a8131.parquet
You may need to set valid Credentials
Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).

samster25 · 2024-06-11T17:29:02Z

@djouallah Ah it looks like we're in anonymous mode for gcs! This usually happens when we don't detect any credentials on the machine. Is the tabular catalog performing some kind of credentials vending?

djouallah · 2024-06-12T02:35:08Z

@samster25 yes, and polars works fine, so it is something you have to do probably

samster25 · 2024-06-12T06:16:04Z

Okay I think I triaged the issue! So it looks like Tabular is vending some credentials via tokens and expiry and Daft instead looks for credentials on the users machines / workers since the GCP libraries strongly discourage this.

@kevinzwang We should forward the following variables into our IOConfig just how pyiceberg does for pyarrow. The key thing we have to do add the options to our GCSConfig and create a FixedCredientialsProvider that returns the inputed credientials by implementing the traits required by the ClientConfig here

Do you think you could take a look tmr @kevinzwang ?

kevinzwang · 2024-06-14T00:17:49Z

Sorry about the delay! Working on this rn

kevinzwang · 2024-06-18T20:09:39Z

@djouallah With the latest release (v0.2.28) Daft should now work with Tabular on google cloud by default. You can also pass in a credential file or access token manually into daft.io.GCSConfig (docs here). Let me know how it works for you!

kevinzwang · 2024-06-18T20:28:03Z

Gonna actually close this since the original issue has been resolved. Feel free to open up a new issue if you encounter anything new though!

jaychia added the p1 Important to tackle soon, but preemptable by p0 label Mar 12, 2024

jaychia added p2 Nice to have features and removed p1 Important to tackle soon, but preemptable by p0 labels Apr 22, 2024

jaychia added p0 Priority 0 - to be addressed immediately and removed p2 Nice to have features labels May 5, 2024

jaychia removed the p0 Priority 0 - to be addressed immediately label May 6, 2024

kevinzwang self-assigned this May 15, 2024

kevinzwang linked a pull request Jun 4, 2024 that will close this issue

[FEAT] Add additional Azure authentication methods #2333

Merged

kevinzwang mentioned this issue Jun 6, 2024

[BUG] Azure and Iceberg read and write fixes #2349

Merged

kevinzwang closed this as completed in #2349 Jun 7, 2024

samster25 reopened this Jun 7, 2024

kevinzwang closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading from Azure with adlsv2 managed identity #2005

Support reading from Azure with adlsv2 managed identity #2005

jaychia commented Mar 12, 2024

djouallah commented May 5, 2024

jaychia commented May 5, 2024

jaychia commented May 5, 2024

djouallah commented May 6, 2024

jaychia commented May 6, 2024

samster25 commented May 7, 2024

djouallah commented May 7, 2024

samster25 commented May 7, 2024

djouallah commented May 7, 2024 •

edited

Loading

samster25 commented May 7, 2024

samster25 commented May 7, 2024

samster25 commented May 9, 2024 •

edited

Loading

djouallah commented May 9, 2024

samster25 commented May 9, 2024

djouallah commented May 9, 2024

kevinzwang commented Jun 4, 2024 •

edited

Loading

djouallah commented Jun 5, 2024

samster25 commented Jun 5, 2024

djouallah commented Jun 5, 2024

kevinzwang commented Jun 5, 2024

djouallah commented Jun 5, 2024

kevinzwang commented Jun 5, 2024

kevinzwang commented Jun 7, 2024

djouallah commented Jun 7, 2024

kevinzwang commented Jun 7, 2024

djouallah commented Jun 11, 2024

samster25 commented Jun 11, 2024

djouallah commented Jun 12, 2024

samster25 commented Jun 12, 2024

kevinzwang commented Jun 14, 2024

kevinzwang commented Jun 18, 2024

kevinzwang commented Jun 18, 2024

Support reading from Azure with adlsv2 managed identity #2005

Support reading from Azure with adlsv2 managed identity #2005

Comments

jaychia commented Mar 12, 2024

djouallah commented May 5, 2024

jaychia commented May 5, 2024

jaychia commented May 5, 2024

djouallah commented May 6, 2024

jaychia commented May 6, 2024

samster25 commented May 7, 2024

djouallah commented May 7, 2024

samster25 commented May 7, 2024

djouallah commented May 7, 2024 • edited Loading

samster25 commented May 7, 2024

samster25 commented May 7, 2024

samster25 commented May 9, 2024 • edited Loading

djouallah commented May 9, 2024

samster25 commented May 9, 2024

djouallah commented May 9, 2024

kevinzwang commented Jun 4, 2024 • edited Loading

djouallah commented Jun 5, 2024

samster25 commented Jun 5, 2024

djouallah commented Jun 5, 2024

kevinzwang commented Jun 5, 2024

djouallah commented Jun 5, 2024

kevinzwang commented Jun 5, 2024

kevinzwang commented Jun 7, 2024

djouallah commented Jun 7, 2024

kevinzwang commented Jun 7, 2024

djouallah commented Jun 11, 2024

samster25 commented Jun 11, 2024

djouallah commented Jun 12, 2024

samster25 commented Jun 12, 2024

kevinzwang commented Jun 14, 2024

kevinzwang commented Jun 18, 2024

kevinzwang commented Jun 18, 2024

djouallah commented May 7, 2024 •

edited

Loading

samster25 commented May 9, 2024 •

edited

Loading

kevinzwang commented Jun 4, 2024 •

edited

Loading