-
-
Notifications
You must be signed in to change notification settings - Fork 169
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
10 changed files
with
749 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
|
||
# Comparison | ||
|
||
To give you an idea of how accurate **Metascraper** is, here is a comparison of similar libraries: | ||
|
||
| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` | | ||
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- | | ||
| Correct | **95.54%** | **74.56%** | **61.16%** | **66.52%** | **70.90%** | | ||
| Incorrect | 1.79% | 1.79% | 0.89% | 6.70% | 10.27% | | ||
| Missed | 2.68% | 23.67% | 37.95% | 26.34% | 8.95% | | ||
|
||
A big part of the reason for **Metascraper**'s higher accuracy is that it relies on a series of fallbacks for each piece of metadata, instead of just looking for the most commonly-used, spec-compliant pieces of metadata, like Open Graph. However, **Metascraper** is specifically targetted at parsing article information, which is why it's able to be more highly-tuned than the other libraries for that purpose. | ||
|
||
_Note: this comparison was run against [32 sites](/support/comparison/urls.js) and last updated on June 1, 2016. If you're interested, you can check out the [full results for each library](/support/comparison/results)._ | ||
|
||
And here is the accuracy of each individual piece of metadata: | ||
|
||
###### `author` | ||
|
||
| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` | | ||
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- | | ||
| Correct | **87.50%** | **31.25%** | **31.25%** | **0.00%** | **34.38%** | | ||
| Incorrect | 9.38% | 3.13% | 3.13% | 31.25% | 50.00% | | ||
| Missed | 3.13% | 65.63% | 65.63% | 68.75% | 15.63% | | ||
|
||
_An `author` is incorrect if it's not in the format of `First Last`, or has extra junk information in the string._ | ||
|
||
###### `date` | ||
|
||
| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` | | ||
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- | | ||
| Correct | **87.50%** | **21.86%** | **0.00%** | **0.00%** | **59.38%** | | ||
| Incorrect | 0.00% | 3.13% | 0.00% | 0.00% | 18.75% | | ||
| Missed | 12.50% | 75.00% | 100.00% | 100.00% | 15.63% | | ||
|
||
_A `date` is correct if it's the correct date, regardless of time. A `date` is incorrect if it's not in the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format._ | ||
|
||
###### `description` | ||
|
||
| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` | | ||
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- | | ||
| Correct | **96.88%** | **90.63%** | **96.88%** | **93.75%** | **90.63%** | | ||
| Incorrect | 3.13% | 3.13% | 3.13% | 3.13% | 3.13% | | ||
| Missed | 0.00% | 6.25% | 0.00% | 3.13% | 6.25% | | ||
|
||
_A `description` is correct if it's either the description the publisher chose, or the first paragraph of the article._ | ||
|
||
###### `image` | ||
|
||
| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` | | ||
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- | | ||
| Correct | **100.00%** | **100.00%** | **100.00%** | **100.00%** | **100.00%** | | ||
| Incorrect | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | | ||
| Missed | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | | ||
|
||
_An `image` is correct if it's either the image the publisher chose, or the first image on the page._ | ||
|
||
###### `publisher` | ||
|
||
| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` | | ||
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- | | ||
| Correct | **96.88%** | **81.25%** | **0.00%** | **78.13%** | **81.25%** | | ||
| Incorrect | 0.00% | 0.00% | 0.00% | 12.50% | 0.00% | | ||
| Missed | 3.13% | 18.75% | 100.00% | 6.25% | 18.75% | | ||
|
||
_A `publisher` is correct if it's the publisher's proper name, or the publisher's domain name._ | ||
|
||
###### `title` | ||
|
||
| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` | | ||
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- | | ||
| Correct | **100.00%** | **100.00%** | **100.00%** | **100.00%** | **100.00%** | | ||
| Incorrect | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | | ||
| Missed | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | | ||
|
||
_A `title` is correct if it's the title of the article, or the title of the page._ | ||
|
||
###### `url` | ||
|
||
| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` | | ||
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- | | ||
| Correct | **100.00%** | **96.88%** | **100.00%** | **93.75%** | **93.75%** | | ||
| Incorrect | 0.00% | 3.13% | 0.00% | 0.00% | 0.00% | | ||
| Missed | 0.00% | 0.00% | 0.00% | 6.25% | 6.25% | | ||
|
||
_A `url` is correct if it resolves back to the original article._ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
'use strict' | ||
|
||
const mkdirp = require('mkdirp') | ||
const rimraf = require('rimraf') | ||
const got = require('got') | ||
const fs = require('fs') | ||
const path = require('path') | ||
|
||
const SCRAPERS = require('./scrapers') | ||
const URLS = require('./urls') | ||
|
||
/** | ||
* Run. | ||
*/ | ||
|
||
console.log('Fetching the results for each scraper...') | ||
|
||
getResults() | ||
.then(results => { | ||
console.log('Removing the old results...') | ||
const dir = path.resolve(__dirname, 'results') | ||
rimraf.sync(dir) | ||
|
||
results.forEach(result => { | ||
const file = path.resolve(dir, `${result.name}.json`) | ||
const string = JSON.stringify(result.results, null, 2) | ||
mkdirp.sync(dir) | ||
fs.writeFileSync(file, string) | ||
}) | ||
|
||
console.log('Success! The results have been compiled for each scraper.') | ||
}) | ||
.catch(err => { | ||
console.log('An error occurred:') | ||
console.log() | ||
console.log(err) | ||
console.log() | ||
console.log(err.stack) | ||
}) | ||
|
||
/** | ||
* Get the metadata results. | ||
* | ||
* @return {Promise} results | ||
*/ | ||
|
||
function getResults () { | ||
return getHtmls(URLS).then(htmls => { | ||
return getScrapersResults(SCRAPERS, URLS, htmls).then(results => results) | ||
}) | ||
} | ||
|
||
/** | ||
* Get the metadata results from all `SCRAPERS` and `urls` and `htmls`. | ||
* | ||
* @param {Array} SCRAPERS | ||
* @param {Array} urls | ||
* @param {Array} htmls | ||
* @return {Promise} results | ||
*/ | ||
|
||
function getScrapersResults (SCRAPERS, urls, htmls) { | ||
return Promise.all( | ||
SCRAPERS.map(SCRAPER => { | ||
return getScraperResults(SCRAPER, urls, htmls).then(results => { | ||
return { | ||
name: SCRAPER.name, | ||
results: results | ||
} | ||
}) | ||
}) | ||
) | ||
} | ||
|
||
/** | ||
* Get metadata results from a single `SCRAPER` and `urls` and `htmls`. | ||
* | ||
* @param {Object} SCRAPER | ||
* @param {Array} urls | ||
* @param {Array} htmls | ||
* @return {Promise} results | ||
*/ | ||
|
||
function getScraperResults (SCRAPER, urls, htmls) { | ||
return Promise.all( | ||
urls.map((url, i) => { | ||
const html = htmls[i] | ||
const name = SCRAPER.name | ||
const Module = name === 'metascraper' ? require('..') : require(name) | ||
return SCRAPER.scrape(Module, url, html).then(metadata => { | ||
return SCRAPER.normalize(metadata) | ||
}) | ||
}) | ||
) | ||
} | ||
|
||
/** | ||
* Get html from a list of `urls`. | ||
* | ||
* @param {Array} urls | ||
* @return {Promise} htmls | ||
*/ | ||
|
||
function getHtmls (urls) { | ||
return Promise.all(urls.map(url => got(url).then(res => res.body))) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
[ | ||
{ | ||
"author": null, | ||
"date": "2016-01-12T22:34:00.000Z", | ||
"description": "HackerRank, a tech recruiting company, has launched HackerRank Jobs, to bridge the gap between the applicant and recruiter.", | ||
"image": "https://img.etimg.com/thumb/msid-50551900,width-672,resizemode-4,imgsize-26854/hackerrank-launches-job-search-platform-hackerrank-jobs.jpg", | ||
"publisher": "The Economic Times", | ||
"title": "HackerRank launches job search platform HackerRank Jobs", | ||
"url": "https://economictimes.indiatimes.com/jobs/hackerrank-launches-job-search-platform-hackerrank-jobs/articleshow/50551900.cms" | ||
}, | ||
{ | ||
"author": null, | ||
"date": null, | ||
"description": "HackerRank, a four-year-old startup, is changing the way companies find and evaluate programmers. ", | ||
"image": "https://fortunedotcom.files.wordpress.com/2015/09/gettyimages-1852881881.jpg", | ||
"publisher": "Fortune", | ||
"title": "Why your next job search may involve solving online puzzles", | ||
"url": "http://fortune.com/2015/10/05/hackerrank-recruiting-tool/" | ||
}, | ||
{ | ||
"author": null, | ||
"date": null, | ||
"description": "Browser performance guru, Nat Duca, has introduced high resolution timing to JavaScript. Ready to try in Chrome 20, the experimental window...", | ||
"image": "http://4.bp.blogspot.com/-f2MG3BXeZxY/T8sN8avF8MI/AAAAAAAAJxE/gJmx_aUEK-s/w1200-h630-p-k-no-nu/Screen%2BShot%2B2012-06-03%2Bat%2B12.09.18%2BAM.png", | ||
"publisher": null, | ||
"title": "A better timer for JavaScript", | ||
"url": "http://gent.ilcore.com/2012/06/better-timer-for-javascript.html" | ||
}, | ||
{ | ||
"author": null, | ||
"date": "2016-01-20T23:50:25+02:00", | ||
"description": "The Netanya base company is transforming how developers and DevOps team manage binary artifacts: JFrog’s total capital raised to date is $62 million.", | ||
"image": "http://jewishbusinessnews.com/wp-content/uploads/2016/01/Shlomi-Ben-Haim-JFrog-Co-Founder-and-CEO-Source-JFrog-e1453326404147.jpg", | ||
"publisher": "Jewish Business News", | ||
"title": "Israeli startup JFrog raises $50 million in C round - Jewish Business News", | ||
"url": "http://jewishbusinessnews.com/2016/01/20/israeli-startup-jfrog-raises-50-million-in-c-round/" | ||
}, | ||
{ | ||
"author": "Stacey Bishop", | ||
"date": null, | ||
"description": "Analytics open the door to predictive selling and ensure that your sales team is doing more than hoarding data that it doesn't know how to act on. ", | ||
"image": "https://i.amz.mshcdn.com/jw9czJj9h-4ClVBegbu982-Zmb8=/1200x627/2015%2F05%2F13%2Fc6%2Frevenue_ana.65b72.jpg", | ||
"publisher": "Mashable", | ||
"title": "The sales cycle on steroids: 3 ways analytics power up revenue", | ||
"url": "http://mashable.com/2015/05/13/analytics-power-up-revenue/" | ||
}, | ||
{ | ||
"author": "Frederic Lardinois", | ||
"date": null, | ||
"description": "Recruiting software engineers is a massive headache for both startups and established companies. For a while now, HackerRank has tried to make both applying..", | ||
"image": "https://tctechcrunch2011.files.wordpress.com/2015/08/10-interviewed.png", | ||
"publisher": "TechCrunch", | ||
"title": "HackerRank Makes Technical Recruiting More Transparent", | ||
"url": "http://social.techcrunch.com/2016/01/12/hackerrank-jobs-takes-the-mystery-out-of-technical-recruiting/" | ||
}, | ||
{ | ||
"author": null, | ||
"date": null, | ||
"description": "The HR startups go to war.", | ||
"image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v3/1200x800.jpg", | ||
"publisher": "Bloomberg.com", | ||
"title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance", | ||
"url": "https://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance" | ||
} | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
[ | ||
{ | ||
"author": "J Vignesh", | ||
"date": "2016-01-12T22:34:00.000Z", | ||
"description": "HackerRank, a tech recruiting company, has launched HackerRank Jobs, to bridge the gap between the applicant and recruiter.", | ||
"image": "https://img.etimg.com/thumb/msid-50551900,width-672,resizemode-4,imgsize-26854/hackerrank-launches-job-search-platform-hackerrank-jobs.jpg", | ||
"logo": "http://economictimes.indiatimes.com/icons/etfavicon.ico", | ||
"publisher": "The Economic Times", | ||
"title": "HackerRank launches job search platform HackerRank Jobs", | ||
"url": "https://economictimes.indiatimes.com/jobs/hackerrank-launches-job-search-platform-hackerrank-jobs/articleshow/50551900.cms" | ||
}, | ||
{ | ||
"author": "Kia Kokalitcheva", | ||
"date": "2015-10-04T22:00:00.000Z", | ||
"description": "HackerRank, a four-year-old startup, is changing the way companies find and evaluate programmers.", | ||
"image": "https://fortunedotcom.files.wordpress.com/2015/09/gettyimages-1852881881.jpg", | ||
"logo": "http://fortune.com/img/favicons/favicon-192.png", | ||
"publisher": "Fortune", | ||
"title": "Why your next job search may involve solving online puzzles", | ||
"url": "http://fortune.com/2015/10/05/hackerrank-recruiting-tool" | ||
}, | ||
{ | ||
"author": "The Nerdbirder", | ||
"date": "2012-06-13T10:00:00.000Z", | ||
"description": "Browser performance guru, Nat Duca, has introduced high resolution timing to JavaScript. Ready to try in Chrome 20, the experimental window...", | ||
"image": "http://4.bp.blogspot.com/-f2MG3BXeZxY/T8sN8avF8MI/AAAAAAAAJxE/gJmx_aUEK-s/w1200-h630-p-k-no-nu/Screen%2BShot%2B2012-06-03%2Bat%2B12.09.18%2BAM.png", | ||
"logo": "http://gent.ilcore.com/favicon.ico", | ||
"publisher": "Fastersite", | ||
"title": "A better timer for JavaScript", | ||
"url": "http://gent.ilcore.com/2012/06/better-timer-for-javascript.html" | ||
}, | ||
{ | ||
"author": "Jewish Business News Correspondent", | ||
"date": "2016-01-20T21:50:25.000Z", | ||
"description": "The Netanya base company is transforming how developers and DevOps team manage binary artifacts: JFrog’s total capital raised to date is $62 million.", | ||
"image": "http://jewishbusinessnews.com/wp-content/uploads/2016/01/Shlomi-Ben-Haim-JFrog-Co-Founder-and-CEO-Source-JFrog-e1453326404147.jpg", | ||
"logo": "http://2xkcvt35vyxycuy7x23e0em1a5g.wpengine.netdna-cdn.com/favicon-196x196.png", | ||
"publisher": "Jewish Business News", | ||
"title": "Israeli startup JFrog raises $50 million in C round - Jewish Business News", | ||
"url": "http://jewishbusinessnews.com/2016/01/20/israeli-startup-jfrog-raises-50-million-in-c-round" | ||
}, | ||
{ | ||
"author": "Stacey Bishop", | ||
"date": "2015-05-13T16:37:41.000Z", | ||
"description": "Analytics open the door to predictive selling and ensure that your sales team is doing more than hoarding data that it doesn’t know how to act on.", | ||
"image": "https://i.amz.mshcdn.com/jw9czJj9h-4ClVBegbu982-Zmb8=/1200x627/2015%2F05%2F13%2Fc6%2Frevenue_ana.65b72.jpg", | ||
"logo": "https://mashable.com/android-chrome-192x192.png?v=m2Pmw8zNwl", | ||
"publisher": "Mashable", | ||
"title": "The sales cycle on steroids: 3 ways analytics power up revenue", | ||
"url": "http://mashable.com/2015/05/13/analytics-power-up-revenue" | ||
}, | ||
{ | ||
"author": "Frederic Lardinois", | ||
"date": "2016-01-12T08:00:44.000Z", | ||
"description": "Recruiting software engineers is a massive headache for both startups and established companies. For a while now, HackerRank has tried to make both applying..", | ||
"image": "https://tctechcrunch2011.files.wordpress.com/2015/08/10-interviewed.png", | ||
"logo": "https://s0.wp.com/wp-content/themes/vip/techcrunch-2013/assets/images/homescreen_TCIcon_ipad_2x.png", | ||
"publisher": "TechCrunch", | ||
"title": "HackerRank Makes Technical Recruiting More Transparent", | ||
"url": "http://social.techcrunch.com/2016/01/12/hackerrank-jobs-takes-the-mystery-out-of-technical-recruiting" | ||
}, | ||
{ | ||
"author": "Ellen Huet", | ||
"date": "2016-05-24T18:00:03.894Z", | ||
"description": "The HR startups go to war.", | ||
"image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v3/1200x800.jpg", | ||
"logo": "https://assets.bwbx.io/s3/javelin/public/javelin/images/favicon-technology-c079867d2c.png", | ||
"publisher": "Bloomberg.com", | ||
"title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance", | ||
"url": "https://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance" | ||
} | ||
] |
Oops, something went wrong.