Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading from Azure with adlsv2 managed identity #2005

Closed
jaychia opened this issue Mar 12, 2024 · 32 comments · Fixed by #2333 or #2349
Closed

Support reading from Azure with adlsv2 managed identity #2005

jaychia opened this issue Mar 12, 2024 · 32 comments · Fixed by #2333 or #2349
Assignees

Comments

@jaychia
Copy link
Contributor

jaychia commented Mar 12, 2024

Is your feature request related to a problem? Please describe.

In Azure, sometimes users may use Azure managed identity instead of just pure credentials.

Daft should support this. This may require some API changes on the AzureConfig.

Resources:

(What are managed identities for Azure resources?) https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview
(How to use it from databricks) https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/azure-managed-identities

@jaychia jaychia added the p1 Important to tackle soon, but preemptable by p0 label Mar 12, 2024
@jaychia jaychia added p2 Nice to have features and removed p1 Important to tackle soon, but preemptable by p0 labels Apr 22, 2024
@djouallah
Copy link
Contributor

I did tried daft to read iceberg table using a catalog, it worked well with s3 but got no implemented error when using Azure ADLSGen2 ? is it planned ?

@jaychia
Copy link
Contributor Author

jaychia commented May 5, 2024

Hi @djouallah !

This is planned, but requires some work on our end to correctly set up the Azure environment to test it. Let me bump up the priority on this :)

@jaychia jaychia added p0 Priority 0 - to be addressed immediately and removed p2 Nice to have features labels May 5, 2024
@jaychia
Copy link
Contributor Author

jaychia commented May 5, 2024

@djouallah could you provide an example of your setup to help guide development of this feature?

@djouallah
Copy link
Contributor

I am getting this

i get this error
DaftCoreException: DaftError::External Source not yet implemented: abfss

@jaychia
Copy link
Contributor Author

jaychia commented May 6, 2024

Source not yet implemented:

Thanks! Actually this error might be because you're using abfss:// instead of abfs://.

Daft currently recognizes only az:// and abfs:// as valid ABFS URLs. Could you try using abfs:// instead and let me know if that works/how it fails?

@jaychia jaychia removed the p0 Priority 0 - to be addressed immediately label May 6, 2024
@samster25
Copy link
Member

It might be a simple fix to just map abfss:// to our Azure reader. It would be great if you can try it out @djouallah!

@djouallah
Copy link
Contributor

can you map it please changing from abfss to abfs will break other engines :(

@samster25
Copy link
Member

@djouallah Yeah happy to map it on our end, we just want to verify that it indeed fixes your issue.
If you could run this on your end

df = daft.read_parquet("abfs://PATH_TO_PARQUET_FILE_THAT_IS_ABFSS")

and verify that it works

@djouallah
Copy link
Contributor

djouallah commented May 7, 2024

I understand, I am reading from a catalog, I can't change the URL
image

@samster25
Copy link
Member

Actually just took a look at what azure's fsspec implementation looks like and they just map it.. I'll push up a PR!

@samster25
Copy link
Member

PR UP: #2244

@samster25
Copy link
Member

samster25 commented May 9, 2024

@djouallah The latest version of daft should now have the fix for reading abfss!

pip  install getdaft==0.2.24

@djouallah
Copy link
Contributor

@samster25 thanks, now I get this
image

although the credential are already defined in the catalog


from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.io.fsspec import FsspecFileIO
catalog = SqlCatalog(
    "default",
    **{
        "uri"                : postgresql_db ,
        "adlfs.account-name" : account_name ,
        "adlfs.account-key"  : AZURE_STORAGE_ACCOUNT_KEY,
        "adlfs.tenant-id"    : azure_storage_tenant_id,
        "py-io-impl"         : "pyiceberg.io.fsspec.FsspecFileIO"
    },
)

polars works fine
image

@samster25
Copy link
Member

Ah, we might have missed an option when converting the Iceberg credentials to our IOConfig. I'll make a fix ASAP!

passing currently in storage_options for polars?

@djouallah
Copy link
Contributor

no polars just figure out it by default, did not add anything

@kevinzwang kevinzwang self-assigned this May 15, 2024
@kevinzwang kevinzwang linked a pull request Jun 4, 2024 that will close this issue
@kevinzwang
Copy link
Member

kevinzwang commented Jun 4, 2024

@djouallah we recently added additional methods for Azure authentication in the latest release of Daft (v0.2.26)

You can now used a managed identity by passing in tenant_id, client_id, and client_secret into daft.io.AzureConfig. If credentials are not provided, it will also try to pull credentials from environment variables or the Azure CLI.

Please check it out and let me know how it goes!

@djouallah
Copy link
Contributor

does not work

AttributeError                            Traceback (most recent call last)
[<ipython-input-11-2f1f1c3a12d5>](https://localhost:8080/#) in <cell line: 4>()
      2 import daft
      3 
----> 4 df = daft.read_iceberg(scada)
      5 df.show()

    [... skipping hidden 2 frame]

[/usr/local/lib/python3.10/dist-packages/daft/io/_iceberg.py](https://localhost:8080/#) in read_iceberg(pyiceberg_table, io_config)
    117 
    118     io_config = (
--> 119         _convert_iceberg_file_io_properties_to_io_config(pyiceberg_table.io.properties)
    120         if io_config is None
    121         else io_config

AttributeError: 'LazyFrame' object has no attribute 'io'

@samster25
Copy link
Member

hi @djouallah, we don't have a concept of a LazyFrame, maybe you passed in a Polars LazyFrame instead of a PyIceberg table?

@djouallah
Copy link
Contributor

this is very embarrasing :( this is the real error

ScanWithTask-LocalLimit-LocalLimit [Stage:5]:   0%
 0/1 [00:00<?, ?it/s]
---------------------------------------------------------------------------
DaftCoreException                         Traceback (most recent call last)
[<ipython-input-27-eb9e045261aa>](https://localhost:8080/#) in <cell line: 3>()
      1 import daft
      2 df = daft.read_iceberg(calendar)
----> 3 df.show()

13 frames
[/usr/local/lib/python3.10/dist-packages/daft/table/micropartition.py](https://localhost:8080/#) in _from_scan_task(scan_task)
     78     def _from_scan_task(scan_task: _ScanTask) -> MicroPartition:
     79         assert isinstance(scan_task, _ScanTask)
---> 80         return MicroPartition._from_pymicropartition(_PyMicroPartition.from_scan_task(scan_task))
     81 
     82     @staticmethod

DaftCoreException: DaftError::External Unable to open file abfss://yyyy.dfs.core.windows.net/data/cccc/calendar/data/00000-0-a16c15a5-9a43-4f1d-b156-b5d733a698c7.parquet: Error { context: Full(Custom { kind: HttpResponse { status: BadRequest, error_code: Some("InvalidResourceName") }, error: HttpError { status: BadRequest, details: ErrorDetails { code: Some("InvalidResourceName"), message: None }, headers: {"server": "Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0", "x-ms-version": "2020-10-02", "x-ms-error-code": "InvalidResourceName", "x-ms-request-id": "bdcf-0055-6ce7-b6826b000000", "transfer-encoding": "chunked", "date": "Wed, 05 Jun 2024 01:28:12 GMT"}, body: b"" } }, "server returned error status which will not be retried: 400") }

@kevinzwang
Copy link
Member

Hey @djouallah, thanks for the response! Could you please share the entire block of code that you're running?

@djouallah
Copy link
Contributor

@kevinzwang
Copy link
Member

Hi @djouallah, I believe we've determined a fix for your issue, and I'm currently working on testing this on an Azure deployed Iceberg table. Should be out soon!

kevinzwang added a commit that referenced this issue Jun 7, 2024
In this PR:
- `pyarrow.dataset.write_dataset` does not properly write Parquet
metadata in version 12.0.0, set the requirements for it to be >=12.0.1
- Azure fsspec filesystem now initialized IOConfig values
- Azure URIs that look like
`PROTOCOL://account.dfs.core.windows.net/container/path-part/file` now
properly parsed, URI parsing also cleaned up and unified
- fixed small discrepancies for AzureConfig in `daft.pyi`
- Added a public test Iceberg table on Azure, a SQLite catalog that
points to the table, and a test for those tables.
  - More tests should be written - #2348

Should resolve #2005
@samster25 samster25 reopened this Jun 7, 2024
@kevinzwang
Copy link
Member

Oops I think the merged PR also accidentally closed this issue. @djouallah we just cut a release for v0.2.27 which will have the fix for the issue you've been seeing. Hoping it works this time 🤞 and thank you for all of your patience :)

@djouallah
Copy link
Contributor

Thank you, it is working

image

@kevinzwang
Copy link
Member

Amazing, so glad to hear

@djouallah
Copy link
Contributor

sorry to be annoying but it does not works with Tabular catalog

DaftCoreException: DaftError::External Unauthorized to access store: gcs for file: gs://xxxxxxxxxx5-47a9-9eac-f3f5b12a8131.parquet
You may need to set valid Credentials
Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).

@samster25
Copy link
Member

@djouallah Ah it looks like we're in anonymous mode for gcs! This usually happens when we don't detect any credentials on the machine. Is the tabular catalog performing some kind of credentials vending?

@djouallah
Copy link
Contributor

@samster25 yes, and polars works fine, so it is something you have to do probably

@samster25
Copy link
Member

Okay I think I triaged the issue! So it looks like Tabular is vending some credentials via tokens and expiry and Daft instead looks for credentials on the users machines / workers since the GCP libraries strongly discourage this.

@kevinzwang We should forward the following variables into our IOConfig just how pyiceberg does for pyarrow. The key thing we have to do add the options to our GCSConfig and create a FixedCredientialsProvider that returns the inputed credientials by implementing the traits required by the ClientConfig here

Do you think you could take a look tmr @kevinzwang ?

@kevinzwang
Copy link
Member

Sorry about the delay! Working on this rn

@kevinzwang
Copy link
Member

@djouallah With the latest release (v0.2.28) Daft should now work with Tabular on google cloud by default. You can also pass in a credential file or access token manually into daft.io.GCSConfig (docs here). Let me know how it works for you!

@kevinzwang
Copy link
Member

Gonna actually close this since the original issue has been resolved. Feel free to open up a new issue if you encounter anything new though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
4 participants