Add support for Webpage Transform #507

yadavsahil197 · 2023-08-04T00:19:01Z

rajasbansal

lgtm

rajasbansal · 2023-08-04T16:01:57Z

src/autolabel/labeler.py

+        )
+        output_df = pd.DataFrame.from_records(outputs)
+        final_df = pd.concat([dataset.df, output_df], axis=1)
+        dataset = AutolabelDataset(final_df, self.config)


we should probably add this as a util function in autolabelDataset so we can change this if needed in the future. I missed this when creating the base transform

rajasbansal · 2023-08-04T16:05:11Z

src/autolabel/transforms/webpage_transform.py

+            )
+
+    def name(self) -> str:
+        return TransformType.WEBPAGE_TRANSFORM


consider if adding parameters to the name makes sense so that the name of a transform is more unique

Do you have any specific parameters in mind?

Adding timeout and url column? Alternatively we can create an _id function which can be used to hash the transform for caching.

rajasbansal · 2023-08-04T16:14:30Z

src/autolabel/transforms/webpage_transform.py

+                metadata[meta.get("name")] = meta.get("content")
+            elif meta.get("property") and meta.get("content"):
+                metadata[meta.get("property")] = meta.get("content")
+        return metadata


metadata will be missing some columns in case we dont find them in the page right? How will this be handled by _apply?

Right now, we are returning all the meta fields that are present from the webpage. An alternate could be to always look for specific fields, such as the description, encoding, social media tags etc.

Makes sense. In case one webpage has a title but the other webpage doesn't have a title we will put nan for the second page under the webpage column. This is being handled by pandas. Tried it out using

import pandas as pd l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}] pd.DataFrame(l) a b c 0 1 2.0 NaN 1 3 NaN 4.0

rajasbansal · 2023-08-04T16:18:36Z

src/autolabel/transforms/webpage_transform.py

+            response = await client.get(url, headers=headers)
+
+            # TODO: Add support for other parsers
+            soup = self.beautiful_soup(response.text, HTML_PARSER)


It make sense to add parsers as part of the config and then the init of the transform will handle instantiating the parser.

However, in the future we might want to have a parsing class which can parse the webpage as you want and we define the usual parsers, i.e text parser, url parser, table parser etc. In that case even this might be passed as a class. However, that is difficult to do in the config ways of doing things and registering parsers seems to be out of scope for now

Edit: (Adding parsing as a separate transform is also possible however we need html in order to transform, extracting urls may not be possible with the text format)

I agree we can keep them as two separate set of transforms. The webpage transform just loads a given url and returns the loaded html, text or soup. The parser in this context refers to the beautiful soup parsers such as html.parser, lxml and html5lib.
The other transform support extracting content/urls from a raw html, bytes or soup.
That way each of these transforms have separation of concerns and be extended independent of each other.

rajasbansal · 2023-08-04T16:19:51Z

src/autolabel/utils.py

+    with progress:
+        progress_task = progress.add_task(description, total=total)
+        tasks = [_task_with_tracker(task, progress, progress_task) for task in tasks]
+        import asyncio


any reason for importing inside this?

good catch! Can move it to the top.

rajasbansal · 2023-08-04T16:20:52Z

pyproject.toml

+    "sentence_transformers",
+    "bs4",
+    "httpx",
+    "fake_useragent"


do we need to add asyncio to the pyproject file?

think it is a built-in module

rajasbansal · 2023-08-07T16:28:03Z

src/autolabel/transforms/webpage_transform.py

+                metadata[meta.get("name")] = meta.get("content")
+            elif meta.get("property") and meta.get("content"):
+                metadata[meta.get("property")] = meta.get("content")
+        return metadata


Makes sense. In case one webpage has a title but the other webpage doesn't have a title we will put nan for the second page under the webpage column. This is being handled by pandas. Tried it out using

import pandas as pd l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}] pd.DataFrame(l) a b c 0 1 2.0 NaN 1 3 NaN 4.0

Add support for Webpage Transform

ce749a6

yadavsahil197 requested review from nihit, rajasbansal and Tyrest August 4, 2023 00:19

yadavsahil197 added 3 commits August 3, 2023 17:36

separate out exception handling

47d9292

add pytest-asyncio in all dependencies

ad3a6b2

fix tests dependencies

1fcd12b

rajasbansal reviewed Aug 4, 2023

View reviewed changes

rajasbansal approved these changes Aug 5, 2023

View reviewed changes

yadavsahil197 added 2 commits August 7, 2023 01:22

return additional data from the transform

05ccbc3

fix test

c58fb43

rajasbansal approved these changes Aug 7, 2023

View reviewed changes

yadavsahil197 merged commit 7c447aa into main Aug 7, 2023

yadavsahil197 deleted the webpage_transform branch August 7, 2023 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Webpage Transform #507

Add support for Webpage Transform #507

yadavsahil197 commented Aug 4, 2023

rajasbansal left a comment

rajasbansal Aug 4, 2023

rajasbansal Aug 4, 2023

yadavsahil197 Aug 7, 2023

rajasbansal Aug 7, 2023

rajasbansal Aug 4, 2023

yadavsahil197 Aug 7, 2023

rajasbansal Aug 7, 2023

rajasbansal Aug 4, 2023

yadavsahil197 Aug 7, 2023

rajasbansal Aug 4, 2023

yadavsahil197 Aug 7, 2023

rajasbansal Aug 4, 2023

yadavsahil197 Aug 7, 2023

rajasbansal Aug 7, 2023

Add support for Webpage Transform #507

Add support for Webpage Transform #507

Conversation

yadavsahil197 commented Aug 4, 2023

rajasbansal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment