Skip to content

Releases: markusmobius/go-trafilatura

v1.12.2

31 Oct 22:24
b16c8f7
Compare
Choose a tag to compare
  • Update dependencies and compile one regex with re2go in #10
  • Update dependencies and add benchmark in #11
  • Catch up to Python' Trafilatura v1.12.2 and add end of line for CLI following POSIX standard in #12

Full Changelog: v1.11.1...v1.12.2

v1.11.1

07 Jul 06:26
Compare
Choose a tag to compare
  • Catch up to v1.11.0 of Python's Trafilatura.
  • Add a new config prop to allow disabling HtmlDate extractor, which is useful when you don't really need a precise publish date.

v1.10.0

01 Jul 12:48
Compare
Choose a tag to compare

At this point we've caught up with Trafilatura v1.10.0. Our last release was equal with Trafilatura v1.5.1, so there are lots of change on their side. Thanks to that, there are three breaking changes in our side:

1. Change in how to specify extraction focus

Back in v1.5.1, we specify extraction focus by setting boolean property FavorPrecision and FavorRecall in Options struct. For example:

optsRecall := trafilatura.Options { FavorRecall: true }
optsPrecision := trafilatura.Options { FavorPrecision: true }

In v1.10.0, these properties are replaced with enum ExtractionFocus that can be put in Focus property in Options struct. For example:

optsRecall := trafilatura.Options { Focus: trafilatura.FavorRecall }
optsPrecision := trafilatura.Options { Focus: trafilatura.FavorPrecision }

2. Change in how to enable fallback extractors

Back in v1.5.1, if we want to enable fallback extractors we need to put at least an empty FallbackConfig in property FallbackCandidates in Options struct. For example:

optsNoFallback := trafilatura.Options { FallbackCandidates: nil }
optsWithFallback := trafilatura.Options { FallbackCandidates: &trafilatura.FallbackConfig{} }

As you can see, it's not really clear and quite confusing. So in v1.10.0, it's replaced with simple boolean property EnableFallback in Options struct. For example:

optsNoFallback := trafilatura.Options { EnableFallback: false }
optsWithFallback := trafilatura.Options { EnableFallback: true }

3. Change in how to specify custom fallback candidates

This breaking change is still related with point number 2.

In another projects, we are collecting several web pages and extract their main content using three extractors: Go-Readability, Go-DomDistiller and Go-Trafilatura with fallback enabled.

As you might've know, Go-Trafilatura generate its fallback by running its own instance of Readability and Dom Distiller.

Since we already run Readability and Dom Distiller before running Trafilatura, it makes sense to just pass their extraction result to Trafilatura rather than making Trafilatura running it again.

So, since v.1.5.1 we already support it by passing the extraction results by using FallbackConfig struct in FallbackCandidates property in Options struct:

readabilityResult := runReadability()
distillerResult := runDomDistiller()

opts := trafilatura.Options {
	FallbackCandidates: &trafilatura.FallbackConfig{
		HasReadability:      true,
		ReadabilityFallback: readabilityResult,
		HasDistiller:        true,
		DistillerFallback:   distillerResult,
	},
}

As you can see, it's really verbose and can be simpler. So in v1.10.0, we change it into like this:

readabilityResult := runReadability()
distillerResult := runDomDistiller()

opts := trafilatura.Options {
	EnableFallback: true,
	FallbackCandidates: &trafilatura.FallbackCandidates{
		Readability: readabilityResult,
		Distiller:   distillerResult,
	},
}

Caught up with trafilatura 1.5.0 and compiles under Windows with cgo option

26 Jun 06:03
Compare
Choose a tag to compare
v1.5.1

allow compilation with cgo under Windows