Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data loaders to user agent when downloading #215

Merged
merged 6 commits into from
Jan 29, 2025

Conversation

goeffthomas
Copy link
Contributor

Logs show the values coming through as expected when testing locally. We'll use these values to track usage.

http://b/392143732

[Logs](https://screenshot.googleplex.com/7EjawSq7aC7Gbt9) show the values coming through as expected when testing locally. We'll use these values to track usage.

http://b/392143732
Copy link
Contributor

@jplotts jplotts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

path: Optional[str] = None,
*,
force_download: Optional[bool] = False,
referrer: Optional[str] = None, # noqa: ARG002
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance of fixing this exception noqa: ARG002?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd have to use the referrer value in the method, but there's only a use for it in the HTTP resolver. I had to add it to all of these to get the types to work with the Resolver that all of these inherit from. I'm still pretty new to developing in python, so there may be something simple that I missed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like you can use _referrer as the parameter name which is a way to note that an argument is unused. You might need to do this everywhere the __call__ method is implemented and its not being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I didn't try that but I'll give it a try, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That didn't seem to work:

Signature of "__call__" incompatible with supertype "Resolver"Mypy[override](https://mypy.readthedocs.io/en/latest/_refs.html#code-override)
     Superclass:
         def __call__(self, handle: NotebookHandle, path: str | None, *, force_download: bool | None = ..., referrer: str | None = ...) -> str
     Subclass:
         def __call__(self, h: NotebookHandle, path: str | None = ..., *, force_download: bool | None = ..., _referrer: str | None = ...) -> str Mypy

I'm not sure if you're able to use the _ approach with classes like this 🤷

@@ -87,7 +88,8 @@ def load_pandas_dataset(
read_function = _validate_read_function(file_extension, sql_query)

# Now that everything has been validated, we can start downloading and processing
filepath = dataset_download(handle, path)
# This method is called by the HF data loader, so respect that as the referrer if provided
filepath = internal_dataset_download(handle, path, referrer=referrer if referrer else DATASET_DOWNLOAD_REFERRER)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I normally don't see the tenary statements while assigning an argument in python. It looks like too much is going on.

This is the normal convention:

if referrer is None: 
   referrer = DATASET_DOWNLOAD_REFERRER
internal_dataset_download(handle, path, referrer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that.

@@ -84,6 +84,10 @@ def test_error_message_with_mismatch(self) -> None:
def test_get_user_agent(self) -> None:
self.assertEqual(clients.get_user_agent(), f"kagglehub/{kagglehub.__version__}")

def test_get_user_agent_referrer(self) -> None:
referrer = "pandas_data_loader"
self.assertEqual(clients.get_user_agent(referrer), f"kagglehub/{kagglehub.__version__} {referrer}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does our server side handle a space separated string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MT basically just does a bunch of .Contains checks in looking at the user agent string. AFAICT, they've been space-separated since the additional user agents were first added: https://github.com/Kaggle/kagglehub/pull/215/files#diff-fbfea7858926863a65c33f0de87573d84dbff809b8d50d93f6d8f1c928d41c44L84

@neshdev
Copy link
Member

neshdev commented Jan 28, 2025

I think you are doing too much in the PR. I just made a sample PR

The basic premise is that by the time the function gets to get_user_agent(), the stack frame should contain the call hierarchy. We just need to ensure that stackframe contains an entry for the function of interest.

Checkout the unit tests here. Its probably simpler to extend this PR or make a new PR then modify this one.

@goeffthomas
Copy link
Contributor Author

I think you are doing too much in the PR. I just made a sample PR

The basic premise is that by the time the function gets to get_user_agent(), the stack frame should contain the call hierarchy. We just need to ensure that stackframe contains an entry for the function of interest.

Checkout the unit tests here. Its probably simpler to extend this PR or make a new PR then modify this one.

@neshdev I remember trying that but not being satisfied with it for some reason. I'll try again and report back.

@goeffthomas goeffthomas requested a review from neshdev January 29, 2025 07:36
@goeffthomas
Copy link
Contributor Author

@neshdev Please review again. I've reverted the explicit referrer and have an inspect-based approach like what you shared. The part that I didn't like was that we're doing string matching on method names which felt very brittle. I've mitigated that by including unit tests for both data loaders to ensure the right values are added to the request headers.

Copy link
Member

@neshdev neshdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@goeffthomas goeffthomas merged commit 8c1e558 into main Jan 29, 2025
6 checks passed
@goeffthomas goeffthomas deleted the add-data-loader-referrer branch January 29, 2025 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants