You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
JSoup does very basic HTML fixes (for example moving the closing HTML tag so that everything is inside). However, it does not fix or identify more serious issues like missing closing tags.
If markup is JSON, this is not a problem. If it is RDFa then it is a massive problem as it changes the triples Any23 generates. Thus the output from the scraper is junk.
The problem is that many pages use both JSON and RDFa. One solution would be to use the JSON and ignore the RDFa, but there is no mechanism for that.
Possible solutions:
For the service/batch scraper write a method to remove the JSON and pass it to Any23, thus removing the bad HTML/RDFa. Also record the presence of errors in the status record (in mysql).
For the on demand scraper, validate the HTML and tell the user if there is an issue. As this is not being saved to the crawl, the triples can be shown even if they are junk.
The text was updated successfully, but these errors were encountered:
JSoup does very basic HTML fixes (for example moving the closing HTML tag so that everything is inside). However, it does not fix or identify more serious issues like missing closing tags.
If markup is JSON, this is not a problem. If it is RDFa then it is a massive problem as it changes the triples Any23 generates. Thus the output from the scraper is junk.
The problem is that many pages use both JSON and RDFa. One solution would be to use the JSON and ignore the RDFa, but there is no mechanism for that.
Possible solutions:
The text was updated successfully, but these errors were encountered: