Skip to content

Commit

Permalink
Add bench
Browse files Browse the repository at this point in the history
  • Loading branch information
Kikobeats committed Dec 10, 2017
1 parent e2138bd commit 57ddb65
Show file tree
Hide file tree
Showing 10 changed files with 749 additions and 1 deletion.
87 changes: 87 additions & 0 deletions bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@

# Comparison

To give you an idea of how accurate **Metascraper** is, here is a comparison of similar libraries:

| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` |
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
| Correct | **95.54%** | **74.56%** | **61.16%** | **66.52%** | **70.90%** |
| Incorrect | 1.79% | 1.79% | 0.89% | 6.70% | 10.27% |
| Missed | 2.68% | 23.67% | 37.95% | 26.34% | 8.95% |

A big part of the reason for **Metascraper**'s higher accuracy is that it relies on a series of fallbacks for each piece of metadata, instead of just looking for the most commonly-used, spec-compliant pieces of metadata, like Open Graph. However, **Metascraper** is specifically targetted at parsing article information, which is why it's able to be more highly-tuned than the other libraries for that purpose.

_Note: this comparison was run against [32 sites](/support/comparison/urls.js) and last updated on June 1, 2016. If you're interested, you can check out the [full results for each library](/support/comparison/results)._

And here is the accuracy of each individual piece of metadata:

###### `author`

| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` |
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
| Correct | **87.50%** | **31.25%** | **31.25%** | **0.00%** | **34.38%** |
| Incorrect | 9.38% | 3.13% | 3.13% | 31.25% | 50.00% |
| Missed | 3.13% | 65.63% | 65.63% | 68.75% | 15.63% |

_An `author` is incorrect if it's not in the format of `First Last`, or has extra junk information in the string._

###### `date`

| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` |
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
| Correct | **87.50%** | **21.86%** | **0.00%** | **0.00%** | **59.38%** |
| Incorrect | 0.00% | 3.13% | 0.00% | 0.00% | 18.75% |
| Missed | 12.50% | 75.00% | 100.00% | 100.00% | 15.63% |

_A `date` is correct if it's the correct date, regardless of time. A `date` is incorrect if it's not in the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format._

###### `description`

| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` |
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
| Correct | **96.88%** | **90.63%** | **96.88%** | **93.75%** | **90.63%** |
| Incorrect | 3.13% | 3.13% | 3.13% | 3.13% | 3.13% |
| Missed | 0.00% | 6.25% | 0.00% | 3.13% | 6.25% |

_A `description` is correct if it's either the description the publisher chose, or the first paragraph of the article._

###### `image`

| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` |
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
| Correct | **100.00%** | **100.00%** | **100.00%** | **100.00%** | **100.00%** |
| Incorrect | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
| Missed | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |

_An `image` is correct if it's either the image the publisher chose, or the first image on the page._

###### `publisher`

| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` |
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
| Correct | **96.88%** | **81.25%** | **0.00%** | **78.13%** | **81.25%** |
| Incorrect | 0.00% | 0.00% | 0.00% | 12.50% | 0.00% |
| Missed | 3.13% | 18.75% | 100.00% | 6.25% | 18.75% |

_A `publisher` is correct if it's the publisher's proper name, or the publisher's domain name._

###### `title`

| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` |
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
| Correct | **100.00%** | **100.00%** | **100.00%** | **100.00%** | **100.00%** |
| Incorrect | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
| Missed | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |

_A `title` is correct if it's the title of the article, or the title of the page._

###### `url`

| Library | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff` |
| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
| Correct | **100.00%** | **96.88%** | **100.00%** | **93.75%** | **93.75%** |
| Incorrect | 0.00% | 3.13% | 0.00% | 0.00% | 0.00% |
| Missed | 0.00% | 0.00% | 0.00% | 6.25% | 6.25% |

_A `url` is correct if it resolves back to the original article._

106 changes: 106 additions & 0 deletions bench/index.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
'use strict'

const mkdirp = require('mkdirp')
const rimraf = require('rimraf')
const got = require('got')
const fs = require('fs')
const path = require('path')

const SCRAPERS = require('./scrapers')
const URLS = require('./urls')

/**
* Run.
*/

console.log('Fetching the results for each scraper...')

getResults()
.then(results => {
console.log('Removing the old results...')
const dir = path.resolve(__dirname, 'results')
rimraf.sync(dir)

results.forEach(result => {
const file = path.resolve(dir, `${result.name}.json`)
const string = JSON.stringify(result.results, null, 2)
mkdirp.sync(dir)
fs.writeFileSync(file, string)
})

console.log('Success! The results have been compiled for each scraper.')
})
.catch(err => {
console.log('An error occurred:')
console.log()
console.log(err)
console.log()
console.log(err.stack)
})

/**
* Get the metadata results.
*
* @return {Promise} results
*/

function getResults () {
return getHtmls(URLS).then(htmls => {
return getScrapersResults(SCRAPERS, URLS, htmls).then(results => results)
})
}

/**
* Get the metadata results from all `SCRAPERS` and `urls` and `htmls`.
*
* @param {Array} SCRAPERS
* @param {Array} urls
* @param {Array} htmls
* @return {Promise} results
*/

function getScrapersResults (SCRAPERS, urls, htmls) {
return Promise.all(
SCRAPERS.map(SCRAPER => {
return getScraperResults(SCRAPER, urls, htmls).then(results => {
return {
name: SCRAPER.name,
results: results
}
})
})
)
}

/**
* Get metadata results from a single `SCRAPER` and `urls` and `htmls`.
*
* @param {Object} SCRAPER
* @param {Array} urls
* @param {Array} htmls
* @return {Promise} results
*/

function getScraperResults (SCRAPER, urls, htmls) {
return Promise.all(
urls.map((url, i) => {
const html = htmls[i]
const name = SCRAPER.name
const Module = name === 'metascraper' ? require('..') : require(name)
return SCRAPER.scrape(Module, url, html).then(metadata => {
return SCRAPER.normalize(metadata)
})
})
)
}

/**
* Get html from a list of `urls`.
*
* @param {Array} urls
* @return {Promise} htmls
*/

function getHtmls (urls) {
return Promise.all(urls.map(url => got(url).then(res => res.body)))
}
65 changes: 65 additions & 0 deletions bench/results/html-metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
[
{
"author": null,
"date": "2016-01-12T22:34:00.000Z",
"description": "HackerRank, a tech recruiting company, has launched HackerRank Jobs, to bridge the gap between the applicant and recruiter.",
"image": "https://img.etimg.com/thumb/msid-50551900,width-672,resizemode-4,imgsize-26854/hackerrank-launches-job-search-platform-hackerrank-jobs.jpg",
"publisher": "The Economic Times",
"title": "HackerRank launches job search platform HackerRank Jobs",
"url": "https://economictimes.indiatimes.com/jobs/hackerrank-launches-job-search-platform-hackerrank-jobs/articleshow/50551900.cms"
},
{
"author": null,
"date": null,
"description": "HackerRank, a four-year-old startup, is changing the way companies find and evaluate programmers. ",
"image": "https://fortunedotcom.files.wordpress.com/2015/09/gettyimages-1852881881.jpg",
"publisher": "Fortune",
"title": "Why your next job search may involve solving online puzzles",
"url": "http://fortune.com/2015/10/05/hackerrank-recruiting-tool/"
},
{
"author": null,
"date": null,
"description": "Browser performance guru, Nat Duca, has introduced high resolution timing to JavaScript. Ready to try in Chrome 20, the experimental window...",
"image": "http://4.bp.blogspot.com/-f2MG3BXeZxY/T8sN8avF8MI/AAAAAAAAJxE/gJmx_aUEK-s/w1200-h630-p-k-no-nu/Screen%2BShot%2B2012-06-03%2Bat%2B12.09.18%2BAM.png",
"publisher": null,
"title": "A better timer for JavaScript",
"url": "http://gent.ilcore.com/2012/06/better-timer-for-javascript.html"
},
{
"author": null,
"date": "2016-01-20T23:50:25+02:00",
"description": "The Netanya base company is transforming how developers and DevOps team manage binary artifacts: JFrog’s total capital raised to date is $62 million.",
"image": "http://jewishbusinessnews.com/wp-content/uploads/2016/01/Shlomi-Ben-Haim-JFrog-Co-Founder-and-CEO-Source-JFrog-e1453326404147.jpg",
"publisher": "Jewish Business News",
"title": "Israeli startup JFrog raises $50 million in C round - Jewish Business News",
"url": "http://jewishbusinessnews.com/2016/01/20/israeli-startup-jfrog-raises-50-million-in-c-round/"
},
{
"author": "Stacey Bishop",
"date": null,
"description": "Analytics open the door to predictive selling and ensure that your sales team is doing more than hoarding data that it doesn't know how to act on. ",
"image": "https://i.amz.mshcdn.com/jw9czJj9h-4ClVBegbu982-Zmb8=/1200x627/2015%2F05%2F13%2Fc6%2Frevenue_ana.65b72.jpg",
"publisher": "Mashable",
"title": "The sales cycle on steroids: 3 ways analytics power up revenue",
"url": "http://mashable.com/2015/05/13/analytics-power-up-revenue/"
},
{
"author": "Frederic Lardinois",
"date": null,
"description": "Recruiting software engineers is a massive headache for both startups and established companies. For a while now, HackerRank has tried to make both applying..",
"image": "https://tctechcrunch2011.files.wordpress.com/2015/08/10-interviewed.png",
"publisher": "TechCrunch",
"title": "HackerRank Makes Technical Recruiting More Transparent",
"url": "http://social.techcrunch.com/2016/01/12/hackerrank-jobs-takes-the-mystery-out-of-technical-recruiting/"
},
{
"author": null,
"date": null,
"description": "The HR startups go to war.",
"image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v3/1200x800.jpg",
"publisher": "Bloomberg.com",
"title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance",
"url": "https://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance"
}
]
72 changes: 72 additions & 0 deletions bench/results/metascraper.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
[
{
"author": "J Vignesh",
"date": "2016-01-12T22:34:00.000Z",
"description": "HackerRank, a tech recruiting company, has launched HackerRank Jobs, to bridge the gap between the applicant and recruiter.",
"image": "https://img.etimg.com/thumb/msid-50551900,width-672,resizemode-4,imgsize-26854/hackerrank-launches-job-search-platform-hackerrank-jobs.jpg",
"logo": "http://economictimes.indiatimes.com/icons/etfavicon.ico",
"publisher": "The Economic Times",
"title": "HackerRank launches job search platform HackerRank Jobs",
"url": "https://economictimes.indiatimes.com/jobs/hackerrank-launches-job-search-platform-hackerrank-jobs/articleshow/50551900.cms"
},
{
"author": "Kia Kokalitcheva",
"date": "2015-10-04T22:00:00.000Z",
"description": "HackerRank, a four-year-old startup, is changing the way companies find and evaluate programmers.",
"image": "https://fortunedotcom.files.wordpress.com/2015/09/gettyimages-1852881881.jpg",
"logo": "http://fortune.com/img/favicons/favicon-192.png",
"publisher": "Fortune",
"title": "Why your next job search may involve solving online puzzles",
"url": "http://fortune.com/2015/10/05/hackerrank-recruiting-tool"
},
{
"author": "The Nerdbirder",
"date": "2012-06-13T10:00:00.000Z",
"description": "Browser performance guru, Nat Duca, has introduced high resolution timing to JavaScript. Ready to try in Chrome 20, the experimental window...",
"image": "http://4.bp.blogspot.com/-f2MG3BXeZxY/T8sN8avF8MI/AAAAAAAAJxE/gJmx_aUEK-s/w1200-h630-p-k-no-nu/Screen%2BShot%2B2012-06-03%2Bat%2B12.09.18%2BAM.png",
"logo": "http://gent.ilcore.com/favicon.ico",
"publisher": "Fastersite",
"title": "A better timer for JavaScript",
"url": "http://gent.ilcore.com/2012/06/better-timer-for-javascript.html"
},
{
"author": "Jewish Business News Correspondent",
"date": "2016-01-20T21:50:25.000Z",
"description": "The Netanya base company is transforming how developers and DevOps team manage binary artifacts: JFrog’s total capital raised to date is $62 million.",
"image": "http://jewishbusinessnews.com/wp-content/uploads/2016/01/Shlomi-Ben-Haim-JFrog-Co-Founder-and-CEO-Source-JFrog-e1453326404147.jpg",
"logo": "http://2xkcvt35vyxycuy7x23e0em1a5g.wpengine.netdna-cdn.com/favicon-196x196.png",
"publisher": "Jewish Business News",
"title": "Israeli startup JFrog raises $50 million in C round - Jewish Business News",
"url": "http://jewishbusinessnews.com/2016/01/20/israeli-startup-jfrog-raises-50-million-in-c-round"
},
{
"author": "Stacey Bishop",
"date": "2015-05-13T16:37:41.000Z",
"description": "Analytics open the door to predictive selling and ensure that your sales team is doing more than hoarding data that it doesn’t know how to act on.",
"image": "https://i.amz.mshcdn.com/jw9czJj9h-4ClVBegbu982-Zmb8=/1200x627/2015%2F05%2F13%2Fc6%2Frevenue_ana.65b72.jpg",
"logo": "https://mashable.com/android-chrome-192x192.png?v=m2Pmw8zNwl",
"publisher": "Mashable",
"title": "The sales cycle on steroids: 3 ways analytics power up revenue",
"url": "http://mashable.com/2015/05/13/analytics-power-up-revenue"
},
{
"author": "Frederic Lardinois",
"date": "2016-01-12T08:00:44.000Z",
"description": "Recruiting software engineers is a massive headache for both startups and established companies. For a while now, HackerRank has tried to make both applying..",
"image": "https://tctechcrunch2011.files.wordpress.com/2015/08/10-interviewed.png",
"logo": "https://s0.wp.com/wp-content/themes/vip/techcrunch-2013/assets/images/homescreen_TCIcon_ipad_2x.png",
"publisher": "TechCrunch",
"title": "HackerRank Makes Technical Recruiting More Transparent",
"url": "http://social.techcrunch.com/2016/01/12/hackerrank-jobs-takes-the-mystery-out-of-technical-recruiting"
},
{
"author": "Ellen Huet",
"date": "2016-05-24T18:00:03.894Z",
"description": "The HR startups go to war.",
"image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v3/1200x800.jpg",
"logo": "https://assets.bwbx.io/s3/javelin/public/javelin/images/favicon-technology-c079867d2c.png",
"publisher": "Bloomberg.com",
"title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance",
"url": "https://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance"
}
]
Loading

0 comments on commit 57ddb65

Please sign in to comment.