validate html #36

kcmcleod · 2019-08-27T09:21:16Z

JSoup does very basic HTML fixes (for example moving the closing HTML tag so that everything is inside). However, it does not fix or identify more serious issues like missing closing tags.

If markup is JSON, this is not a problem. If it is RDFa then it is a massive problem as it changes the triples Any23 generates. Thus the output from the scraper is junk.

The problem is that many pages use both JSON and RDFa. One solution would be to use the JSON and ignore the RDFa, but there is no mechanism for that.

Possible solutions:

For the service/batch scraper write a method to remove the JSON and pass it to Any23, thus removing the bad HTML/RDFa. Also record the presence of errors in the status record (in mysql).
For the on demand scraper, validate the HTML and tell the user if there is an issue. As this is not being saved to the crawl, the triples can be shown even if they are junk.

AlasdairGray · 2022-07-29T09:57:20Z

We would need to develop tests for this issue to understand what is happening

kcmcleod added the enhancement New feature or request label Aug 27, 2019

kcmcleod self-assigned this Aug 27, 2019

AlasdairGray unassigned kcmcleod Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate html #36

validate html #36

kcmcleod commented Aug 27, 2019

AlasdairGray commented Jul 29, 2022

validate html #36

validate html #36

Comments

kcmcleod commented Aug 27, 2019

AlasdairGray commented Jul 29, 2022