Preprocessing #39

brasfb · 2023-04-03T07:34:33Z

brasfb
Apr 3, 2023

Maybe more of a general scraping question, also maybe I'm in over my head & new to this. After preprocessing & Xpath/CSS selector, what gets sent to openai?

Is it less helpful to just scrape all plain text on a page & then do auto splitter?

Answered by jamesturk

Apr 5, 2023

The matching HTML is sent to the LLM. So if you had HTML like:

<html>
<div class="sidebar"> ... </div>
<main><a href="#">Something</a><div>Something Else</div></main>
<div class="footer"> ... </div>
</html>

And passed the selector main, all of the innerHTML of that element would get sent. (If there are multiple matches they are appended together.)

Just extracting text would tend to lose context, and some data (URL, phone numbers, etc.) might not be in the text, but instead in attributes. Further refinement of this library will likely entail figuring out how minimal the HTML sent can be without affecting results. (I'm currently working on building a test corpus of sorts, since these sorts…

View full answer

jamesturk · 2023-04-05T00:46:01Z

jamesturk
Apr 5, 2023
Maintainer

The matching HTML is sent to the LLM. So if you had HTML like:

<html>
<div class="sidebar"> ... </div>
<main><a href="#">Something</a><div>Something Else</div></main>
<div class="footer"> ... </div>
</html>

And passed the selector main, all of the innerHTML of that element would get sent. (If there are multiple matches they are appended together.)

Just extracting text would tend to lose context, and some data (URL, phone numbers, etc.) might not be in the text, but instead in attributes. Further refinement of this library will likely entail figuring out how minimal the HTML sent can be without affecting results. (I'm currently working on building a test corpus of sorts, since these sorts of questions need to be answered in probabilities.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing #39

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Preprocessing #39

brasfb Apr 3, 2023

Replies: 1 comment

jamesturk Apr 5, 2023 Maintainer

brasfb
Apr 3, 2023

jamesturk
Apr 5, 2023
Maintainer