Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Webpage Transform #507

Merged
merged 6 commits into from
Aug 7, 2023
Merged

Add support for Webpage Transform #507

merged 6 commits into from
Aug 7, 2023

Conversation

yadavsahil197
Copy link
Contributor

Adds #506

Copy link
Contributor

@rajasbansal rajasbansal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

)
output_df = pd.DataFrame.from_records(outputs)
final_df = pd.concat([dataset.df, output_df], axis=1)
dataset = AutolabelDataset(final_df, self.config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably add this as a util function in autolabelDataset so we can change this if needed in the future. I missed this when creating the base transform

)

def name(self) -> str:
return TransformType.WEBPAGE_TRANSFORM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider if adding parameters to the name makes sense so that the name of a transform is more unique

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any specific parameters in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding timeout and url column? Alternatively we can create an _id function which can be used to hash the transform for caching.

metadata[meta.get("name")] = meta.get("content")
elif meta.get("property") and meta.get("content"):
metadata[meta.get("property")] = meta.get("content")
return metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata will be missing some columns in case we dont find them in the page right? How will this be handled by _apply?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, we are returning all the meta fields that are present from the webpage. An alternate could be to always look for specific fields, such as the description, encoding, social media tags etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. In case one webpage has a title but the other webpage doesn't have a title we will put nan for the second page under the webpage column. This is being handled by pandas. Tried it out using

import pandas as pd
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
pd.DataFrame(l)
   a    b    c
0  1  2.0  NaN
1  3  NaN  4.0

response = await client.get(url, headers=headers)

# TODO: Add support for other parsers
soup = self.beautiful_soup(response.text, HTML_PARSER)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It make sense to add parsers as part of the config and then the init of the transform will handle instantiating the parser.

However, in the future we might want to have a parsing class which can parse the webpage as you want and we define the usual parsers, i.e text parser, url parser, table parser etc. In that case even this might be passed as a class. However, that is difficult to do in the config ways of doing things and registering parsers seems to be out of scope for now

Edit: (Adding parsing as a separate transform is also possible however we need html in order to transform, extracting urls may not be possible with the text format)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we can keep them as two separate set of transforms. The webpage transform just loads a given url and returns the loaded html, text or soup. The parser in this context refers to the beautiful soup parsers such as html.parser, lxml and html5lib.
The other transform support extracting content/urls from a raw html, bytes or soup.
That way each of these transforms have separation of concerns and be extended independent of each other.

with progress:
progress_task = progress.add_task(description, total=total)
tasks = [_task_with_tracker(task, progress, progress_task) for task in tasks]
import asyncio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for importing inside this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! Can move it to the top.

"sentence_transformers",
"bs4",
"httpx",
"fake_useragent"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to add asyncio to the pyproject file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think it is a built-in module

metadata[meta.get("name")] = meta.get("content")
elif meta.get("property") and meta.get("content"):
metadata[meta.get("property")] = meta.get("content")
return metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. In case one webpage has a title but the other webpage doesn't have a title we will put nan for the second page under the webpage column. This is being handled by pandas. Tried it out using

import pandas as pd
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
pd.DataFrame(l)
   a    b    c
0  1  2.0  NaN
1  3  NaN  4.0

@yadavsahil197 yadavsahil197 merged commit 7c447aa into main Aug 7, 2023
2 checks passed
@yadavsahil197 yadavsahil197 deleted the webpage_transform branch August 7, 2023 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants