All notable changes to this project will be documented in this file.
- Avoid overwriting extracted metadata with similarly named keys (like
og:image
andog:image:width
) - Imported new
getSiteName()
feature from JS version as of 21 Dec 2018 - Added getFirstElementChild function to NodeTrait + test case (Issue #83)
- Reworked the test suit to use TestPage objects and give more hints about what failed
- Removed getWordThreshold and setWordThreshold configuration functions
- Added NodeUtility::filterTextNodes and deprecated NodeTrait getChildren()
- Added new DOMNodeList fake class that mimics the original DOMNodeList class but allows to add new nodes to the list
- Added new Dockerfiles that pulls different versions of PHP and libxml. Now we are supporting 4 versions of PHP and 6 versions of libxml!
- Fixed small issue that prevented the main image from showing up in the results
- [BREAKING CHANGE] Bumped the minimum supported version of PHP to 7.0
- Clean
<aside>
tags duringprepArticle()
. - Merged PR #58: Fix notice non-object on $parentOfTopCandidate for tumblr.com
- Fixed issue #63: Division by zero
- Housekeeping:
- Removed $parseSuccessful flag that wasn't needed anymore
- Rename wordThreshold to charThreshold and throw deprecation notices. WordThreshold will be removed in version 3.0.
- Added "-ad-" as unlikely candidate
- Added Docker containers with PHP 7.0, 7.1, and 7.2 and makefile to trigger the tests.
- Imported new code from the JS version as of 19 Nov 2018, which includes the following changes:
- Move phrasing contents into paragraphs
- Improved the title detection
- Remove single cell tables
- Improved the detection of video related elements
- New test cases
- Various minor fixes
- Merged PR#49 (Missing object when calling
->getContent()
) - Imported all changes from Readability.js as of 2 March 2018 (8525c6a):
- Check for
<base>
elements before converting URLs to absolute. - Clean
<link>
tags onprepArticle()
- Attempt to return at least some text if all the algorithm runs fail (Check PR #423 on JS version)
- Add new test cases for the previous changes
- And all other changes reflected in this diff
- Check for
- Switched from assertEquals to assertSame on unit testing to avoid weak comparisons.
- Added a safe check to avoid sending the DOMDocument as a node when scanning for node ancestors.
- Fix issue #45: Small mistake in documentation
- Fix issue #46: Added
data-src
as a image source path - Fixed bug when extracting all the image of the article (Was extracting images from the original DOM instead of the parsed one)
- Added the
->getDOMDocument()
getter to retrieve the fully parsed DOMDocument - Merged PR #48 that allows passing an array as configuration (@topotru)
- Added 'data-orig' as an URL source for images
- Removed 'modal' as a negative property from classes
- Added option to inject a logger
- Removed all references to the
data-readability
tags that don't apply anymore to the new structure - Merged PR #38 (Missing DOMEntityReference)
- Node encapsulation is gone. Pre v1 all nodes where encapsulated in a Readability class, which created lots of trouble with dependencies, responsibilities, and properties. Now all the encapsulation is gone: all the DOMNodes inside the Readability class are extensions of the original DOM classes, which allows the system to take advantage of the functions and properties of DOMDocument.
- HTMLParser is gone, Readability is the new main class. Switched things a bit for this release. Pre v1 you had to create an HTMLParser class to parse the HTML. Now you have to create a Readability class, feed it the text, and check the result.
- No more dumb arrays as a result. If you want to get the title, content, images, or anything else you'll have to use the getters of the Readability class.
- Environment class is gone. Now you have to create a configuration class and use setters to set your configuration options.
- Exceptions. Make sure you wrap your Readability class in a try catch block, because if it fails to parse your HTML, it will throw a
ParseException
. - Minimum PHP version bumped to 5.6.
- Trim titles when detecting hierarchical separators to avoid false negatives on strings with spaces.
- Fix issue when converting divs to p nodes and never rating them (issue #29)
- Fix "Unsupported operand types" (PR #31)
- Fix division by zero when no title was found (issue #32)
- New function to retrieve all images at once (PR #30)
- Get the title from the
<title>
tag before searching on the<meta>
tags
- Merged PR #24. Fixes notice when trying to extract
og:image
- Up to date to commit eb221c5 (2017-10-16), which includes the following changes:
- New tags added to the unlikelyCandidates regex
- Detection and removal of hierarchical separators in titles
- Added more tags to clean after parsing the article (
button
,textarea
,select
, etc.) - New way to detect empty nodes (including a edge case where a node with a
&nsbp;
was detected as a node with content) - Better approach to find a top candidate (specially when a top candidate is the only child of a parent node, which allows a more accurate joining of sibling elements)
- Detect text direction (
ltr
orrtl
) - Detect and mark data tables to avoid removing them during final clean up
- Major fixes when scanning and deleting nodes (no need to traverse backwards anymore)
- Node cleaning via regex matches
- Clean table attributes during final clean up.
- Added license
Next release after this one will be v1 and will be a major refactor around Readability and HTMLParser methods and responsibilities.
- Added a safecheck for really nasty HTML
- Added summonCthulhu option, to remove all script tags via regex
- Added
normalizeEntities
flag to convert UTF-8 characters to its HTML Entity equivalent. Fixes bugs on htmls with mixed encoding. - Added more information to the readme.md file
- New way to create a backup DOM: not creating a backup. In the previous version, the system cloned the $this->dom object to keep it as a backup in order to restart the algorithm with other flags, if needed. This seemed to work until I realized that sometimes the backup changes even if we are not touching it. Seems that the
dom
andbackupdom
objects are linked and some changes on the dom object reach the bakcupdom object. The new approach consists in deleting the backupdom object and recreating from scratch the dom object. Of course this has a performance impact, but seems to be quite low.
100% complete port of Readability.js!
- Every test unit passes
- Readability.php produces the same exact output as Readability.js
- I'm happy :)
- Lots of bugs
- Merged PR by DavidFricker to avoid exceptions while grabbing the document content
- substituteEntities flag, to avoid replacing especial characters with HTML entities. There's nothing we can do about
, that entity is replaced by libxml and there's no way to disable it. - Named data sets so it's easier to detect which test case is failing.
- Couple of test cases that involved broken JS. There's nothing we can do about JS spilling onto the text.
We are getting closer to be a 100% complete port of Readability.js!
- Added prepArticle to remove junk after selecting the top candidates.
- Added a function to restore score after selecting top candidates. This basically works by scanning the data-readability tag and restoring the score to the contentScore variable. This is an horrible hack and should be removed once we ditch the Element interface of html-to-markdown and start extending the DOMDocument object.
- Switched all strlen functions to mb_strlen
- Fixed lots of bugs and pretty sure that introduced a bunch of new ones.
- Last version I'm using master as the main development branch. All unreleased changes and main development will happen in the develop branch.
- Initial release