Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
RadhiFadlillah committed Nov 2, 2020
1 parent 948d5e3 commit e59c202
Show file tree
Hide file tree
Showing 3 changed files with 45 additions and 8 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Changelog

### 31 October 2020

- Separate stable version to its own branch.

### 30 October 2020

- From Readability: strip identification and presentational attributes from each nodes.
Expand Down
19 changes: 19 additions & 0 deletions IMPROVEMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Improvements

After using both Readability.js and DOM Distiller, we found that there are several improvements that can be implemented into this port. Besides that, from our experiments we also found some possible bugs that we decided to fix.

These so-called improvements are listed here as historical documentation and to explain the difference between the main branch and stable branch.

## From Readability

- Implement function to check if a HTML element is probably visible or not. This is especially useful since one of the DOM Distiller strategy is to exclude invisible elements by computing the stylesheets (which is impossible to do in Go).
- Exclude form and input element, since in distilled mode we only want to read.
- Skip byline, empty div and unlikely elements by checking its class name, id and role attributes.
- Convert anchors with Javascript URL into an ordinary text node.
- Convert font to span elements. This is done because the font elements is usually only used for styling, so Readability.js decided to convert it.
- Exclude identification and presentational attributes (eg. `id`, `class` and `style`) from each elements.

## From our own experiments

- Make sure figure's caption doesn't contains noscript elements. This is done because noscript in Go is a bit weird, sometimes it detected as HTML element while the other times it detected as plain text, so we need additional schecks to clean it.
- Mark large blocks around main content's tag level as content as well. In original DOM Distiller, they are looking for the most likely main content, then they mark text blocks that exist in the same tag level of the main content as content as well. Unfortunately, we found out that in some sites parts of the article are omitted by DOM Distiller. To fix this, we decided to make the filter more tolerant by checking text blocks in lower and upper tag levels as well.
30 changes: 22 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
# Go-DomDistiller

> This main branch is the development version for Go-DomDistiller. Check the [stable branch][5] for the stable version.
Go-DomDistiller is a Go package that finds the main readable content and the metadata from a HTML page. It works by removing clutter like buttons, ads, background images, scripts, etc.

This package is based on [DOM Distiller][0] which is part of the Chromium project that is built using Java language. The structure of this package is arranged following the structure of original Java code. This way, any improvements from Chromium (hopefully) can be implemented easily here.

The port has been [completed][6] and we have used it to process millions of web pages, so it should be stable enough to use.

## Motivations

We are doing computational social science research on news consumption, so we collect a lot of web pages and extract the article inside it using headless Chrome running Readability.js and DOM Distiller. This works fine, but unbearably slow.

After looking around, we found out that [Readability.js][1] has been [ported to Go][2] and it has an impressive performance. With that said, we decided to port DOM Distiller to Go language as well.
After looking around, we found out that [Readability.js][1] has been [ported to Go][2] by [@RadhiFadlillah] and it has an impressive performance. With that said, we decided to ask him to port DOM Distiller to Go language as well.

## Limitations

Expand All @@ -18,6 +22,12 @@ Unfortunately it's impossible to do that on the server side, and we don't want t

Fortunately, according to [research][4] by Mohammad Ghasemisharif et al. (2018) they expect that this modification has minimal effects on extraction results, so we feel confident going forward with the port.

## Comparison with the stable branch

The stable branch is the faithful port of original DOM Distiller which only receive bug fixes, while this main branch adds some [insights][7] from Go-Readability.

Both should be stable enough to use, but you may prefer to use the stable branch if you want to use the one that as close as original DOM Distiller.

## Comparison with Go-Readability

Since Readability and DOM Distiller work using different algorithms, their results are a bit different. In general they give satisfactory results, however we found out that there are some cases where DOM Distiller is better and vice versa. In practice we use both of them then use some kind of scoring to find out which extraction result is more suitable for our use case.
Expand All @@ -37,18 +47,18 @@ The pros of Readability :
Here is the benchmark result between DOM Distiller and Readability :

```
BenchmarkReadability-8 1 22270423614 ns/op 5134614848 B/op 21071083 allocs/op
BenchmarkDistillerWithoutPagination-8 1 24248745284 ns/op 7987711256 B/op 30309028 allocs/op
BenchmarkDistillerPageNumberPagination-8 1 33292305569 ns/op 8080458848 B/op 32918938 allocs/op
BenchmarkDistillerPrevNextPagination-8 1 47737605918 ns/op 8378848776 B/op 36243299 allocs/op
BenchmarkReadability-8 1 22270423614 ns/op 5134614848 B/op 21071083 allocs/op
BenchmarkDistillerWithoutPagination-8 1 24248745284 ns/op 7987711256 B/op 30309028 allocs/op
BenchmarkDistillerPageNumberPagination-8 1 33292305569 ns/op 8080458848 B/op 32918938 allocs/op
BenchmarkDistillerPrevNextPagination-8 1 47737605918 ns/op 8378848776 B/op 36243299 allocs/op
```

## Installation

To install this package, just run `go get` :
To install the development version of this package, just run `go get` for main branch :

```
go get -u -v github.com/markusmobius/go-domdistiller
go get -u -v github.com/markusmobius/go-domdistiller@main
```

## API Documentation
Expand Down Expand Up @@ -248,4 +258,8 @@ Go-DomDistiller is distributed under [MIT license](https://choosealicense.com/li
[1]: https://github.com/mozilla/readability
[2]: https://github.com/go-shiori/go-readability
[3]: https://github.com/markusmobius/go-domdistiller/search?q=NEED-COMPUTE-CSS
[4]: https://arxiv.org/abs/1811.03661
[4]: https://arxiv.org/abs/1811.03661
[5]: https://github.com/markusmobius/go-domdistiller/tree/stable
[6]: https://github.com/markusmobius/go-domdistiller/blob/main/CHANGELOG.md
[7]: https://github.com/markusmobius/go-domdistiller/blob/main/IMPROVEMENTS.md
[@RadhiFadlillah]: https://github.com/RadhiFadlillah

0 comments on commit e59c202

Please sign in to comment.