Skip to content

Commit b3e2a0f

Browse files
droobadampash
authored andcommitted
feat: extract custom types with extend option (#313)
* feat: extract custom types with extend option Adds an `extend` option that lets you add custom types to be extracted and returned alongside the defaults, either in a call to `parse()` or in a custom extractor. ``` Mercury.parse( url, extend: { last_edited: { selectors: ['#last-edited'], defaultCleaner: false } } ) ``` * chore: use Reflect.ownKeys * feat: add CLI options * doc: add extend param to cli help * refactor: extract selectExtendedTypes * feat: only overwrite null extended results * feat: add allowMultiple extraction option * feat: accept extendList CLI args * feat: allow attribute selectors in extends on CLI * test: update extend tests * fix: don't invoke cleaner for custom types * feat: always return array if allowMultiple * test: add test for array of single result * refactor: extract extractHtml * refactor: destructure allowMultiple * fix: wrap multiple matches in $ for cheerio shim * fix: find extended types before any other munging * feat: absolutize all links * fix: clean content more directly * doc: Update CLI docs in README * chore: update dist * doc: Document extend in custom extractor README
1 parent 136d6df commit b3e2a0f

File tree

9 files changed

+549
-101
lines changed

9 files changed

+549
-101
lines changed

Diff for: README.md

+12-1
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,9 @@ If Mercury is unable to find a field, that field will return `null`.
6767
By default, Mercury Parser returns the `content` field as HTML. However, you can override this behavior by passing in options to the `parse` function, specifying whether or not to scrape all pages of an article, and what type of output to return (valid values are `'html'`, `'markdown'`, and `'text'`). For example:
6868

6969
```javascript
70-
Mercury.parse(url, { contentType: 'markdown' }).then(result => console.log(result));
70+
Mercury.parse(url, { contentType: 'markdown' }).then(result =>
71+
console.log(result)
72+
);
7173
```
7274

7375
This returns the the page's `content` as GitHub-flavored Markdown:
@@ -94,6 +96,15 @@ mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source
9496

9597
# Pass optional --format argument to set content type (html|markdown|text)
9698
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --format=markdown
99+
100+
# Pass optional --extend-list argument to add a custom type to the response
101+
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend credit="p:last-child em"
102+
103+
# Pass optional --extend-list argument to add a custom type with multiple matches
104+
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list categories=".meta__tags-list a"
105+
106+
# Get the value of attributes by adding a pipe to --extend or --extend-list
107+
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list links=".body a|href"
97108
```
98109

99110
## License

Diff for: cli.js

+24-3
Original file line numberDiff line numberDiff line change
@@ -8,16 +8,20 @@ const {
88
_: [url],
99
format,
1010
f,
11+
extend,
12+
e,
13+
extendList,
14+
l,
1115
} = argv;
12-
(async (urlToParse, contentType) => {
16+
(async (urlToParse, contentType, extendedTypes, extendedListTypes) => {
1317
if (!urlToParse) {
1418
console.log(
1519
'\n\
1620
mercury-parser\n\n\
1721
The Mercury Parser extracts semantic content from any url\n\n\
1822
Usage:\n\
1923
\n\
20-
$ mercury-parser url-to-parse [--format=html|text|markdown]\n\
24+
$ mercury-parser url-to-parse [--format=html|text|markdown] [--extend type=selector]... [--extend-list type=selector]... \n\
2125
\n\
2226
'
2327
);
@@ -31,8 +35,25 @@ Usage:\n\
3135
text: 'text',
3236
txt: 'text',
3337
};
38+
const extensions = {};
39+
[].concat(extendedTypes || []).forEach(t => {
40+
const [name, selector] = t.split('=');
41+
const fullSelector =
42+
selector.indexOf('|') > 0 ? selector.split('|') : selector;
43+
extensions[name] = { selectors: [fullSelector] };
44+
});
45+
[].concat(extendedListTypes || []).forEach(t => {
46+
const [name, selector] = t.split('=');
47+
const fullSelector =
48+
selector.indexOf('|') > 0 ? selector.split('|') : selector;
49+
extensions[name] = {
50+
selectors: [fullSelector],
51+
allowMultiple: true,
52+
};
53+
});
3454
const result = await Mercury.parse(urlToParse, {
3555
contentType: contentTypeMap[contentType],
56+
extend: extensions,
3657
});
3758
console.log(JSON.stringify(result, null, 2));
3859
} catch (e) {
@@ -51,4 +72,4 @@ Usage:\n\
5172
console.error(`\n${reportBug}\n`);
5273
process.exit(1);
5374
}
54-
})(url, format || f);
75+
})(url, format || f, extend || e, extendList || l);

0 commit comments

Comments
 (0)