Add bench

microlinkhq · Dec 10, 2017 · 57ddb65 · 57ddb65
1 parent e2138bd
commit 57ddb65
Show file tree

Hide file tree

Showing 10 changed files with 749 additions and 1 deletion.
diff --git a/bench/README.md b/bench/README.md
@@ -0,0 +1,87 @@
+
+# Comparison
+
+To give you an idea of how accurate **Metascraper** is, here is a comparison of similar libraries:
+
+| Library   | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff`   |
+| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
+| Correct   | **95.54%**    | **74.56%**      | **61.16%**           | **66.52%**           | **70.90%**  |
+| Incorrect | 1.79%         | 1.79%           | 0.89%                | 6.70%                | 10.27%      |
+| Missed    | 2.68%         | 23.67%          | 37.95%               | 26.34%               | 8.95%       |
+
+A big part of the reason for **Metascraper**'s higher accuracy is that it relies on a series of fallbacks for each piece of metadata, instead of just looking for the most commonly-used, spec-compliant pieces of metadata, like Open Graph. However, **Metascraper** is specifically targetted at parsing article information, which is why it's able to be more highly-tuned than the other libraries for that purpose.
+
+_Note: this comparison was run against [32 sites](/support/comparison/urls.js) and last updated on June 1, 2016. If you're interested, you can check out the [full results for each library](/support/comparison/results)._
+
+And here is the accuracy of each individual piece of metadata:
+
+###### `author`
+
+| Library   | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff`   |
+| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
+| Correct   | **87.50%**    | **31.25%**      | **31.25%**           | **0.00%**            | **34.38%**  |
+| Incorrect | 9.38%         | 3.13%           | 3.13%                | 31.25%               | 50.00%      |
+| Missed    | 3.13%         | 65.63%          | 65.63%               | 68.75%               | 15.63%      |
+
+_An `author` is incorrect if it's not in the format of `First Last`, or has extra junk information in the string._
+
+###### `date`
+
+| Library   | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff`   |
+| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
+| Correct   | **87.50%**    | **21.86%**      | **0.00%**            | **0.00%**            | **59.38%**  |
+| Incorrect | 0.00%         | 3.13%           | 0.00%                | 0.00%                | 18.75%      |
+| Missed    | 12.50%        | 75.00%          | 100.00%              | 100.00%              | 15.63%      |
+
+_A `date` is correct if it's the correct date, regardless of time. A `date` is incorrect if it's not in the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format._
+
+###### `description`
+
+| Library   | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff`   |
+| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
+| Correct   | **96.88%**    | **90.63%**      | **96.88%**           | **93.75%**           | **90.63%**  |
+| Incorrect | 3.13%         | 3.13%           | 3.13%                | 3.13%                | 3.13%       |
+| Missed    | 0.00%         | 6.25%           | 0.00%                | 3.13%                | 6.25%       |
+
+_A `description` is correct if it's either the description the publisher chose, or the first paragraph of the article._
+
+###### `image`
+
+| Library   | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff`   |
+| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
+| Correct   | **100.00%**   | **100.00%**     | **100.00%**          | **100.00%**          | **100.00%** |
+| Incorrect | 0.00%         | 0.00%           | 0.00%                | 0.00%                | 0.00%       |
+| Missed    | 0.00%         | 0.00%           | 0.00%                | 0.00%                | 0.00%       |
+
+_An `image` is correct if it's either the image the publisher chose, or the first image on the page._
+
+###### `publisher`
+
+| Library   | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff`   |
+| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
+| Correct   | **96.88%**    | **81.25%**      | **0.00%**            | **78.13%**           | **81.25%**  |
+| Incorrect | 0.00%         | 0.00%           | 0.00%                | 12.50%               | 0.00%       |
+| Missed    | 3.13%         | 18.75%          | 100.00%              | 6.25%                | 18.75%      |
+
+_A `publisher` is correct if it's the publisher's proper name, or the publisher's domain name._
+
+###### `title`
+
+| Library   | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff`   |
+| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
+| Correct   | **100.00%**   | **100.00%**     | **100.00%**          | **100.00%**          | **100.00%** |
+| Incorrect | 0.00%         | 0.00%           | 0.00%                | 0.00%                | 0.00%       |
+| Missed    | 0.00%         | 0.00%           | 0.00%                | 0.00%                | 0.00%       |
+
+_A `title` is correct if it's the title of the article, or the title of the page._
+
+###### `url`
+
+| Library   | `metascraper` | `html-metadata` | `node-metainspector` | `open-graph-scraper` | `unfluff`   |
+| :-------- | :------------ | :-------------- | :------------------- | :------------------- | :---------- |
+| Correct   | **100.00%**   | **96.88%**      | **100.00%**          | **93.75%**           | **93.75%**  |
+| Incorrect | 0.00%         | 3.13%           | 0.00%                | 0.00%                | 0.00%       |
+| Missed    | 0.00%         | 0.00%           | 0.00%                | 6.25%                | 6.25%       |
+
+_A `url` is correct if it resolves back to the original article._
+
diff --git a/bench/index.js b/bench/index.js
@@ -0,0 +1,106 @@
+'use strict'
+
+const mkdirp = require('mkdirp')
+const rimraf = require('rimraf')
+const got = require('got')
+const fs = require('fs')
+const path = require('path')
+
+const SCRAPERS = require('./scrapers')
+const URLS = require('./urls')
+
+/**
+ * Run.
+ */
+
+console.log('Fetching the results for each scraper...')
+
+getResults()
+  .then(results => {
+    console.log('Removing the old results...')
+    const dir = path.resolve(__dirname, 'results')
+    rimraf.sync(dir)
+
+    results.forEach(result => {
+      const file = path.resolve(dir, `${result.name}.json`)
+      const string = JSON.stringify(result.results, null, 2)
+      mkdirp.sync(dir)
+      fs.writeFileSync(file, string)
+    })
+
+    console.log('Success! The results have been compiled for each scraper.')
+  })
+  .catch(err => {
+    console.log('An error occurred:')
+    console.log()
+    console.log(err)
+    console.log()
+    console.log(err.stack)
+  })
+
+/**
+ * Get the metadata results.
+ *
+ * @return {Promise} results
+ */
+
+function getResults () {
+  return getHtmls(URLS).then(htmls => {
+    return getScrapersResults(SCRAPERS, URLS, htmls).then(results => results)
+  })
+}
+
+/**
+ * Get the metadata results from all `SCRAPERS` and `urls` and `htmls`.
+ *
+ * @param {Array} SCRAPERS
+ * @param {Array} urls
+ * @param {Array} htmls
+ * @return {Promise} results
+ */
+
+function getScrapersResults (SCRAPERS, urls, htmls) {
+  return Promise.all(
+    SCRAPERS.map(SCRAPER => {
+      return getScraperResults(SCRAPER, urls, htmls).then(results => {
+        return {
+          name: SCRAPER.name,
+          results: results
+        }
+      })
+    })
+  )
+}
+
+/**
+ * Get metadata results from a single `SCRAPER` and `urls` and `htmls`.
+ *
+ * @param {Object} SCRAPER
+ * @param {Array} urls
+ * @param {Array} htmls
+ * @return {Promise} results
+ */
+
+function getScraperResults (SCRAPER, urls, htmls) {
+  return Promise.all(
+    urls.map((url, i) => {
+      const html = htmls[i]
+      const name = SCRAPER.name
+      const Module = name === 'metascraper' ? require('..') : require(name)
+      return SCRAPER.scrape(Module, url, html).then(metadata => {
+        return SCRAPER.normalize(metadata)
+      })
+    })
+  )
+}
+
+/**
+ * Get html from a list of `urls`.
+ *
+ * @param {Array} urls
+ * @return {Promise} htmls
+ */
+
+function getHtmls (urls) {
+  return Promise.all(urls.map(url => got(url).then(res => res.body)))
+}
diff --git a/bench/results/html-metadata.json b/bench/results/html-metadata.json
@@ -0,0 +1,65 @@
+[
+  {
+    "author": null,
+    "date": "2016-01-12T22:34:00.000Z",
+    "description": "HackerRank, a tech recruiting company, has launched HackerRank Jobs, to bridge the gap between the applicant and recruiter.",
+    "image": "https://img.etimg.com/thumb/msid-50551900,width-672,resizemode-4,imgsize-26854/hackerrank-launches-job-search-platform-hackerrank-jobs.jpg",
+    "publisher": "The Economic Times",
+    "title": "HackerRank launches job search platform HackerRank Jobs",
+    "url": "https://economictimes.indiatimes.com/jobs/hackerrank-launches-job-search-platform-hackerrank-jobs/articleshow/50551900.cms"
+  },
+  {
+    "author": null,
+    "date": null,
+    "description": "HackerRank, a four-year-old startup, is changing the way companies find and evaluate programmers.  ",
+    "image": "https://fortunedotcom.files.wordpress.com/2015/09/gettyimages-1852881881.jpg",
+    "publisher": "Fortune",
+    "title": "Why your next job search may involve solving online puzzles",
+    "url": "http://fortune.com/2015/10/05/hackerrank-recruiting-tool/"
+  },
+  {
+    "author": null,
+    "date": null,
+    "description": "Browser performance guru, Nat Duca, has introduced  high resolution timing to JavaScript. Ready to try in Chrome 20, the experimental window...",
+    "image": "http://4.bp.blogspot.com/-f2MG3BXeZxY/T8sN8avF8MI/AAAAAAAAJxE/gJmx_aUEK-s/w1200-h630-p-k-no-nu/Screen%2BShot%2B2012-06-03%2Bat%2B12.09.18%2BAM.png",
+    "publisher": null,
+    "title": "A better timer for JavaScript",
+    "url": "http://gent.ilcore.com/2012/06/better-timer-for-javascript.html"
+  },
+  {
+    "author": null,
+    "date": "2016-01-20T23:50:25+02:00",
+    "description": "The Netanya base company is transforming how developers and DevOps team manage binary artifacts: JFrog’s total capital raised to date is $62 million.",
+    "image": "http://jewishbusinessnews.com/wp-content/uploads/2016/01/Shlomi-Ben-Haim-JFrog-Co-Founder-and-CEO-Source-JFrog-e1453326404147.jpg",
+    "publisher": "Jewish Business News",
+    "title": "Israeli startup JFrog raises $50 million in C round - Jewish Business News",
+    "url": "http://jewishbusinessnews.com/2016/01/20/israeli-startup-jfrog-raises-50-million-in-c-round/"
+  },
+  {
+    "author": "Stacey Bishop",
+    "date": null,
+    "description": "Analytics open the door to predictive selling and ensure that your sales team is doing more than hoarding data that it doesn't know how to act on. ",
+    "image": "https://i.amz.mshcdn.com/jw9czJj9h-4ClVBegbu982-Zmb8=/1200x627/2015%2F05%2F13%2Fc6%2Frevenue_ana.65b72.jpg",
+    "publisher": "Mashable",
+    "title": "The sales cycle on steroids: 3 ways analytics power up revenue",
+    "url": "http://mashable.com/2015/05/13/analytics-power-up-revenue/"
+  },
+  {
+    "author": "Frederic Lardinois",
+    "date": null,
+    "description": "Recruiting software engineers is a massive headache for both startups and established companies. For a while now, HackerRank has tried to make both applying..",
+    "image": "https://tctechcrunch2011.files.wordpress.com/2015/08/10-interviewed.png",
+    "publisher": "TechCrunch",
+    "title": "HackerRank Makes Technical Recruiting More Transparent",
+    "url": "http://social.techcrunch.com/2016/01/12/hackerrank-jobs-takes-the-mystery-out-of-technical-recruiting/"
+  },
+  {
+    "author": null,
+    "date": null,
+    "description": "The HR startups go to war.",
+    "image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v3/1200x800.jpg",
+    "publisher": "Bloomberg.com",
+    "title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance",
+    "url": "https://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance"
+  }
+]
diff --git a/bench/results/metascraper.json b/bench/results/metascraper.json
@@ -0,0 +1,72 @@
+[
+  {
+    "author": "J Vignesh",
+    "date": "2016-01-12T22:34:00.000Z",
+    "description": "HackerRank, a tech recruiting company, has launched HackerRank Jobs, to bridge the gap between the applicant and recruiter.",
+    "image": "https://img.etimg.com/thumb/msid-50551900,width-672,resizemode-4,imgsize-26854/hackerrank-launches-job-search-platform-hackerrank-jobs.jpg",
+    "logo": "http://economictimes.indiatimes.com/icons/etfavicon.ico",
+    "publisher": "The Economic Times",
+    "title": "HackerRank launches job search platform HackerRank Jobs",
+    "url": "https://economictimes.indiatimes.com/jobs/hackerrank-launches-job-search-platform-hackerrank-jobs/articleshow/50551900.cms"
+  },
+  {
+    "author": "Kia Kokalitcheva",
+    "date": "2015-10-04T22:00:00.000Z",
+    "description": "HackerRank, a four-year-old startup, is changing the way companies find and evaluate programmers.",
+    "image": "https://fortunedotcom.files.wordpress.com/2015/09/gettyimages-1852881881.jpg",
+    "logo": "http://fortune.com/img/favicons/favicon-192.png",
+    "publisher": "Fortune",
+    "title": "Why your next job search may involve solving online puzzles",
+    "url": "http://fortune.com/2015/10/05/hackerrank-recruiting-tool"
+  },
+  {
+    "author": "The Nerdbirder",
+    "date": "2012-06-13T10:00:00.000Z",
+    "description": "Browser performance guru, Nat Duca, has introduced high resolution timing to JavaScript. Ready to try in Chrome 20, the experimental window...",
+    "image": "http://4.bp.blogspot.com/-f2MG3BXeZxY/T8sN8avF8MI/AAAAAAAAJxE/gJmx_aUEK-s/w1200-h630-p-k-no-nu/Screen%2BShot%2B2012-06-03%2Bat%2B12.09.18%2BAM.png",
+    "logo": "http://gent.ilcore.com/favicon.ico",
+    "publisher": "Fastersite",
+    "title": "A better timer for JavaScript",
+    "url": "http://gent.ilcore.com/2012/06/better-timer-for-javascript.html"
+  },
+  {
+    "author": "Jewish Business News Correspondent",
+    "date": "2016-01-20T21:50:25.000Z",
+    "description": "The Netanya base company is transforming how developers and DevOps team manage binary artifacts: JFrog’s total capital raised to date is $62 million.",
+    "image": "http://jewishbusinessnews.com/wp-content/uploads/2016/01/Shlomi-Ben-Haim-JFrog-Co-Founder-and-CEO-Source-JFrog-e1453326404147.jpg",
+    "logo": "http://2xkcvt35vyxycuy7x23e0em1a5g.wpengine.netdna-cdn.com/favicon-196x196.png",
+    "publisher": "Jewish Business News",
+    "title": "Israeli startup JFrog raises $50 million in C round - Jewish Business News",
+    "url": "http://jewishbusinessnews.com/2016/01/20/israeli-startup-jfrog-raises-50-million-in-c-round"
+  },
+  {
+    "author": "Stacey Bishop",
+    "date": "2015-05-13T16:37:41.000Z",
+    "description": "Analytics open the door to predictive selling and ensure that your sales team is doing more than hoarding data that it doesn’t know how to act on.",
+    "image": "https://i.amz.mshcdn.com/jw9czJj9h-4ClVBegbu982-Zmb8=/1200x627/2015%2F05%2F13%2Fc6%2Frevenue_ana.65b72.jpg",
+    "logo": "https://mashable.com/android-chrome-192x192.png?v=m2Pmw8zNwl",
+    "publisher": "Mashable",
+    "title": "The sales cycle on steroids: 3 ways analytics power up revenue",
+    "url": "http://mashable.com/2015/05/13/analytics-power-up-revenue"
+  },
+  {
+    "author": "Frederic Lardinois",
+    "date": "2016-01-12T08:00:44.000Z",
+    "description": "Recruiting software engineers is a massive headache for both startups and established companies. For a while now, HackerRank has tried to make both applying..",
+    "image": "https://tctechcrunch2011.files.wordpress.com/2015/08/10-interviewed.png",
+    "logo": "https://s0.wp.com/wp-content/themes/vip/techcrunch-2013/assets/images/homescreen_TCIcon_ipad_2x.png",
+    "publisher": "TechCrunch",
+    "title": "HackerRank Makes Technical Recruiting More Transparent",
+    "url": "http://social.techcrunch.com/2016/01/12/hackerrank-jobs-takes-the-mystery-out-of-technical-recruiting"
+  },
+  {
+    "author": "Ellen Huet",
+    "date": "2016-05-24T18:00:03.894Z",
+    "description": "The HR startups go to war.",
+    "image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v3/1200x800.jpg",
+    "logo": "https://assets.bwbx.io/s3/javelin/public/javelin/images/favicon-technology-c079867d2c.png",
+    "publisher": "Bloomberg.com",
+    "title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance",
+    "url": "https://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance"
+  }
+]