Releases: microlinkhq/metascraper
v5.5.4
v5.5.3
v5.5.2
v5.5.1
v5.5.0
v5.4.7
v5.4.6
v5.0.0
Breaking Changes
Rules Bundles processed in parallel
Until now, the rules bundles are processed in the interface, being possible passing meta
between rules:
({ htmlDom: $, meta, url: baseUrl }) => wrap($ => $('meta[property="og:logo"]').attr('content')),
Now, the bundles rules are processed in parallel, being no possible have shared information between rules, so meta
will no more passed.
The only official rule bundler affected by this is metascraper-lang-detector
.
Improvements
Add metascraper-readability
The metascraper-readability
http://npm.im/metascraper-readability is based on https://github.com/mozilla/readability.
v4.9.0
Remove sanitize-html
The dependency is introducing a bug related to malformed URLs: apostrophecms/sanitize-html#274
In fact, I detected it's no longer necessary since htmlparser2
is present as part of cheerio
load method.
Result: Smaller bundler, less parsing time.
Setup CSS Insensitive Rules
One of the things related to sanitize-html
was normalized some common things around the HTML markup.
Because this dependency is no more dependency and after discovering that CSS rules can be insensitive, I enabled it properly in where is possible.
Result: Better data detection, less initial parsing time.
Improve Date Rules
Based on the insensitive CSS rules improvement, I was re-checking the bundle set related to metascraper-date
.
I detected some interesting improvement opportunities: some rules can be merged into the same, also being possible to convert some rules into more generic, improving the data accurately.
Also, I tried to prioritize update over create, so the output is more associated with the last modification date over the creation date.
Result: Better date accurate, more value detected.
Improve URL detection
The URL detection has been improved for being possible detected more kind of URLs.
An URL is a subtype of URI. The thing that I want to be sure is detecting as much data as possible.
Now the metascraper-helpers
related with urls
being possible detected URIs, such data image URI encoded on base64 or magnet URIs.
The challenge here is doing that while we still support original functionality. I added a lot of tests to ensure about that.
Result: Better URLs detection, supporting URIs.