-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Webpage Transform #507
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
) | ||
output_df = pd.DataFrame.from_records(outputs) | ||
final_df = pd.concat([dataset.df, output_df], axis=1) | ||
dataset = AutolabelDataset(final_df, self.config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably add this as a util function in autolabelDataset so we can change this if needed in the future. I missed this when creating the base transform
) | ||
|
||
def name(self) -> str: | ||
return TransformType.WEBPAGE_TRANSFORM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider if adding parameters to the name makes sense so that the name of a transform is more unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any specific parameters in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding timeout and url column? Alternatively we can create an _id function which can be used to hash the transform for caching.
metadata[meta.get("name")] = meta.get("content") | ||
elif meta.get("property") and meta.get("content"): | ||
metadata[meta.get("property")] = meta.get("content") | ||
return metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metadata will be missing some columns in case we dont find them in the page right? How will this be handled by _apply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, we are returning all the meta fields that are present from the webpage. An alternate could be to always look for specific fields, such as the description, encoding, social media tags etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. In case one webpage has a title but the other webpage doesn't have a title we will put nan for the second page under the webpage column. This is being handled by pandas. Tried it out using
import pandas as pd
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
pd.DataFrame(l)
a b c
0 1 2.0 NaN
1 3 NaN 4.0
response = await client.get(url, headers=headers) | ||
|
||
# TODO: Add support for other parsers | ||
soup = self.beautiful_soup(response.text, HTML_PARSER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It make sense to add parsers as part of the config and then the init of the transform will handle instantiating the parser.
However, in the future we might want to have a parsing class which can parse the webpage as you want and we define the usual parsers, i.e text parser, url parser, table parser etc. In that case even this might be passed as a class. However, that is difficult to do in the config ways of doing things and registering parsers seems to be out of scope for now
Edit: (Adding parsing as a separate transform is also possible however we need html in order to transform, extracting urls may not be possible with the text format)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we can keep them as two separate set of transforms. The webpage transform just loads a given url and returns the loaded html, text or soup. The parser in this context refers to the beautiful soup parsers such as html.parser, lxml and html5lib.
The other transform support extracting content/urls from a raw html, bytes or soup.
That way each of these transforms have separation of concerns and be extended independent of each other.
src/autolabel/utils.py
Outdated
with progress: | ||
progress_task = progress.add_task(description, total=total) | ||
tasks = [_task_with_tracker(task, progress, progress_task) for task in tasks] | ||
import asyncio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for importing inside this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! Can move it to the top.
"sentence_transformers", | ||
"bs4", | ||
"httpx", | ||
"fake_useragent" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to add asyncio to the pyproject file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think it is a built-in module
metadata[meta.get("name")] = meta.get("content") | ||
elif meta.get("property") and meta.get("content"): | ||
metadata[meta.get("property")] = meta.get("content") | ||
return metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. In case one webpage has a title but the other webpage doesn't have a title we will put nan for the second page under the webpage column. This is being handled by pandas. Tried it out using
import pandas as pd
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
pd.DataFrame(l)
a b c
0 1 2.0 NaN
1 3 NaN 4.0
Adds #506