forked from markusmobius/go-domdistiller
-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
948d5e3
commit e59c202
Showing
3 changed files
with
45 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# Improvements | ||
|
||
After using both Readability.js and DOM Distiller, we found that there are several improvements that can be implemented into this port. Besides that, from our experiments we also found some possible bugs that we decided to fix. | ||
|
||
These so-called improvements are listed here as historical documentation and to explain the difference between the main branch and stable branch. | ||
|
||
## From Readability | ||
|
||
- Implement function to check if a HTML element is probably visible or not. This is especially useful since one of the DOM Distiller strategy is to exclude invisible elements by computing the stylesheets (which is impossible to do in Go). | ||
- Exclude form and input element, since in distilled mode we only want to read. | ||
- Skip byline, empty div and unlikely elements by checking its class name, id and role attributes. | ||
- Convert anchors with Javascript URL into an ordinary text node. | ||
- Convert font to span elements. This is done because the font elements is usually only used for styling, so Readability.js decided to convert it. | ||
- Exclude identification and presentational attributes (eg. `id`, `class` and `style`) from each elements. | ||
|
||
## From our own experiments | ||
|
||
- Make sure figure's caption doesn't contains noscript elements. This is done because noscript in Go is a bit weird, sometimes it detected as HTML element while the other times it detected as plain text, so we need additional schecks to clean it. | ||
- Mark large blocks around main content's tag level as content as well. In original DOM Distiller, they are looking for the most likely main content, then they mark text blocks that exist in the same tag level of the main content as content as well. Unfortunately, we found out that in some sites parts of the article are omitted by DOM Distiller. To fix this, we decided to make the filter more tolerant by checking text blocks in lower and upper tag levels as well. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters