-
Maybe more of a general scraping question, also maybe I'm in over my head & new to this. After preprocessing & Xpath/CSS selector, what gets sent to openai? Is it less helpful to just scrape all plain text on a page & then do auto splitter? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The matching HTML is sent to the LLM. So if you had HTML like:
And passed the selector Just extracting text would tend to lose context, and some data (URL, phone numbers, etc.) might not be in the text, but instead in attributes. Further refinement of this library will likely entail figuring out how minimal the HTML sent can be without affecting results. (I'm currently working on building a test corpus of sorts, since these sorts of questions need to be answered in probabilities.) |
Beta Was this translation helpful? Give feedback.
The matching HTML is sent to the LLM. So if you had HTML like:
And passed the selector
main
, all of the innerHTML of that element would get sent. (If there are multiple matches they are appended together.)Just extracting text would tend to lose context, and some data (URL, phone numbers, etc.) might not be in the text, but instead in attributes. Further refinement of this library will likely entail figuring out how minimal the HTML sent can be without affecting results. (I'm currently working on building a test corpus of sorts, since these sorts…