You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: extract custom types with extend option (#313)
* feat: extract custom types with extend option
Adds an `extend` option that lets you add custom types to be extracted
and returned alongside the defaults, either in a call to `parse()` or in
a custom extractor.
```
Mercury.parse(
url,
extend: {
last_edited: { selectors: ['#last-edited'], defaultCleaner: false }
}
)
```
* chore: use Reflect.ownKeys
* feat: add CLI options
* doc: add extend param to cli help
* refactor: extract selectExtendedTypes
* feat: only overwrite null extended results
* feat: add allowMultiple extraction option
* feat: accept extendList CLI args
* feat: allow attribute selectors in extends on CLI
* test: update extend tests
* fix: don't invoke cleaner for custom types
* feat: always return array if allowMultiple
* test: add test for array of single result
* refactor: extract extractHtml
* refactor: destructure allowMultiple
* fix: wrap multiple matches in $ for cheerio shim
* fix: find extended types before any other munging
* feat: absolutize all links
* fix: clean content more directly
* doc: Update CLI docs in README
* chore: update dist
* doc: Document extend in custom extractor README
Copy file name to clipboardExpand all lines: README.md
+12-1
Original file line number
Diff line number
Diff line change
@@ -67,7 +67,9 @@ If Mercury is unable to find a field, that field will return `null`.
67
67
By default, Mercury Parser returns the `content` field as HTML. However, you can override this behavior by passing in options to the `parse` function, specifying whether or not to scrape all pages of an article, and what type of output to return (valid values are `'html'`, `'markdown'`, and `'text'`). For example:
0 commit comments