Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate html #36

Open
kcmcleod opened this issue Aug 27, 2019 · 1 comment
Open

validate html #36

kcmcleod opened this issue Aug 27, 2019 · 1 comment
Labels
enhancement New feature or request

Comments

@kcmcleod
Copy link
Contributor

JSoup does very basic HTML fixes (for example moving the closing HTML tag so that everything is inside). However, it does not fix or identify more serious issues like missing closing tags.

If markup is JSON, this is not a problem. If it is RDFa then it is a massive problem as it changes the triples Any23 generates. Thus the output from the scraper is junk.

The problem is that many pages use both JSON and RDFa. One solution would be to use the JSON and ignore the RDFa, but there is no mechanism for that.

Possible solutions:

  1. For the service/batch scraper write a method to remove the JSON and pass it to Any23, thus removing the bad HTML/RDFa. Also record the presence of errors in the status record (in mysql).
  2. For the on demand scraper, validate the HTML and tell the user if there is an issue. As this is not being saved to the crawl, the triples can be shown even if they are junk.
@kcmcleod kcmcleod added the enhancement New feature or request label Aug 27, 2019
@kcmcleod kcmcleod self-assigned this Aug 27, 2019
@AlasdairGray
Copy link
Member

We would need to develop tests for this issue to understand what is happening

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants