-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data loaders to user agent when downloading #215
Conversation
[Logs](https://screenshot.googleplex.com/7EjawSq7aC7Gbt9) show the values coming through as expected when testing locally. We'll use these values to track usage. http://b/392143732
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
path: Optional[str] = None, | ||
*, | ||
force_download: Optional[bool] = False, | ||
referrer: Optional[str] = None, # noqa: ARG002 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance of fixing this exception noqa: ARG002
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd have to use the referrer
value in the method, but there's only a use for it in the HTTP resolver. I had to add it to all of these to get the types to work with the Resolver
that all of these inherit from. I'm still pretty new to developing in python, so there may be something simple that I missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like you can use _referrer
as the parameter name which is a way to note that an argument is unused. You might need to do this everywhere the __call__
method is implemented and its not being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I didn't try that but I'll give it a try, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That didn't seem to work:
Signature of "__call__" incompatible with supertype "Resolver"Mypy[override](https://mypy.readthedocs.io/en/latest/_refs.html#code-override)
Superclass:
def __call__(self, handle: NotebookHandle, path: str | None, *, force_download: bool | None = ..., referrer: str | None = ...) -> str
Subclass:
def __call__(self, h: NotebookHandle, path: str | None = ..., *, force_download: bool | None = ..., _referrer: str | None = ...) -> str Mypy
I'm not sure if you're able to use the _
approach with classes like this 🤷
src/kagglehub/pandas_datasets.py
Outdated
@@ -87,7 +88,8 @@ def load_pandas_dataset( | |||
read_function = _validate_read_function(file_extension, sql_query) | |||
|
|||
# Now that everything has been validated, we can start downloading and processing | |||
filepath = dataset_download(handle, path) | |||
# This method is called by the HF data loader, so respect that as the referrer if provided | |||
filepath = internal_dataset_download(handle, path, referrer=referrer if referrer else DATASET_DOWNLOAD_REFERRER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I normally don't see the tenary statements while assigning an argument in python. It looks like too much is going on.
This is the normal convention:
if referrer is None:
referrer = DATASET_DOWNLOAD_REFERRER
internal_dataset_download(handle, path, referrer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that.
tests/test_kaggle_api_client.py
Outdated
@@ -84,6 +84,10 @@ def test_error_message_with_mismatch(self) -> None: | |||
def test_get_user_agent(self) -> None: | |||
self.assertEqual(clients.get_user_agent(), f"kagglehub/{kagglehub.__version__}") | |||
|
|||
def test_get_user_agent_referrer(self) -> None: | |||
referrer = "pandas_data_loader" | |||
self.assertEqual(clients.get_user_agent(referrer), f"kagglehub/{kagglehub.__version__} {referrer}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does our server side handle a space separated string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MT basically just does a bunch of .Contains
checks in looking at the user agent string. AFAICT, they've been space-separated since the additional user agents were first added: https://github.com/Kaggle/kagglehub/pull/215/files#diff-fbfea7858926863a65c33f0de87573d84dbff809b8d50d93f6d8f1c928d41c44L84
I think you are doing too much in the PR. I just made a sample PR The basic premise is that by the time the function gets to Checkout the unit tests here. Its probably simpler to extend this PR or make a new PR then modify this one. |
@neshdev I remember trying that but not being satisfied with it for some reason. I'll try again and report back. |
@neshdev Please review again. I've reverted the explicit referrer and have an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Logs show the values coming through as expected when testing locally. We'll use these values to track usage.
http://b/392143732