Releases: markusmobius/go-trafilatura
v1.12.2
v1.11.1
- Catch up to v1.11.0 of Python's Trafilatura.
- Add a new config prop to allow disabling HtmlDate extractor, which is useful when you don't really need a precise publish date.
v1.10.0
At this point we've caught up with Trafilatura v1.10.0. Our last release was equal with Trafilatura v1.5.1, so there are lots of change on their side. Thanks to that, there are three breaking changes in our side:
1. Change in how to specify extraction focus
Back in v1.5.1, we specify extraction focus by setting boolean property FavorPrecision
and FavorRecall
in Options
struct. For example:
optsRecall := trafilatura.Options { FavorRecall: true }
optsPrecision := trafilatura.Options { FavorPrecision: true }
In v1.10.0, these properties are replaced with enum ExtractionFocus
that can be put in Focus
property in Options
struct. For example:
optsRecall := trafilatura.Options { Focus: trafilatura.FavorRecall }
optsPrecision := trafilatura.Options { Focus: trafilatura.FavorPrecision }
2. Change in how to enable fallback extractors
Back in v1.5.1, if we want to enable fallback extractors we need to put at least an empty FallbackConfig
in property FallbackCandidates
in Options
struct. For example:
optsNoFallback := trafilatura.Options { FallbackCandidates: nil }
optsWithFallback := trafilatura.Options { FallbackCandidates: &trafilatura.FallbackConfig{} }
As you can see, it's not really clear and quite confusing. So in v1.10.0, it's replaced with simple boolean property EnableFallback
in Options
struct. For example:
optsNoFallback := trafilatura.Options { EnableFallback: false }
optsWithFallback := trafilatura.Options { EnableFallback: true }
3. Change in how to specify custom fallback candidates
This breaking change is still related with point number 2.
In another projects, we are collecting several web pages and extract their main content using three extractors: Go-Readability, Go-DomDistiller and Go-Trafilatura with fallback enabled.
As you might've know, Go-Trafilatura generate its fallback by running its own instance of Readability and Dom Distiller.
Since we already run Readability and Dom Distiller before running Trafilatura, it makes sense to just pass their extraction result to Trafilatura rather than making Trafilatura running it again.
So, since v.1.5.1 we already support it by passing the extraction results by using FallbackConfig
struct in FallbackCandidates
property in Options
struct:
readabilityResult := runReadability()
distillerResult := runDomDistiller()
opts := trafilatura.Options {
FallbackCandidates: &trafilatura.FallbackConfig{
HasReadability: true,
ReadabilityFallback: readabilityResult,
HasDistiller: true,
DistillerFallback: distillerResult,
},
}
As you can see, it's really verbose and can be simpler. So in v1.10.0, we change it into like this:
readabilityResult := runReadability()
distillerResult := runDomDistiller()
opts := trafilatura.Options {
EnableFallback: true,
FallbackCandidates: &trafilatura.FallbackCandidates{
Readability: readabilityResult,
Distiller: distillerResult,
},
}
Caught up with trafilatura 1.5.0 and compiles under Windows with cgo option
v1.5.1 allow compilation with cgo under Windows