Releases: andreskrey/readability.php
v2.1.0 > The one where I realized that libxml didn't die on version v2.9.4
Thanks to issue #86 I realized that there are modern versions of libxml2. I always wondered why the bundled version of libxml was so old (2.9.4 was released in 2016). Turns out I was checking the wrong website. What seems to be the official website has a really old version as the latest version, meanwhile in gitlab the last version was released months ago!
So I realized there are newer versions and from 2.9.5 the normal behavior changed, breaking up all our tests. Luckily the change is "cosmetic" (whitespace differences with 2.9.4) so the tests are still "valid" but PHPUnit will complain anyway. If you know a way to compare HTMLs ignoring whitespace, let me know.
Anyway, the following changes are included in this release:
- Avoid overwriting extracted metadata with similarly named keys (like
og:image
andog:image:width
) - Imported new
getSiteName()
feature from JS version as of 21 Dec 2018 - Added getFirstElementChild function to NodeTrait + test case (Issue #83)
- Reworked the test suit to use TestPage objects and give more hints about what failed
- Removed getWordThreshold and setWordThreshold configuration functions
- Added NodeUtility::filterTextNodes and deprecated NodeTrait getChildren()
- Added new DOMNodeList fake class that mimics the original DOMNodeList class but allows to add new nodes to the list
- Added new Dockerfiles that pulls different versions of PHP and libxml. Now we are supporting 4 versions of PHP and 6 versions of libxml!
I reworked the 4 Dockerfiles we had before and created a dedicated repo for PHP with custom libxml versions. Here it is: https://github.com/andreskrey/php-libxml-docker-images
Each PR will be tested against 4 versions of PHP and 6 versions of libxml, which means that Travis will run 24 virtual machines every time there are changes in the repo. Let's see for how long we can abuse their free resources.
And that's it. Let me know if something is broken for you. Tell your mom you love her. Don't forget to call your father.
v2.0.1 > Oopsie
Oopsie. Noticed that the main image was always missing from the results? That's because I screwed it up. But fear not, it's fixed.
I also updated the tests to be a little more strict so this, IN THEORY, should not happen again.
v2.0.0 > Up to date with Readability.js again + docker containers
Hello everyone,
Guess you weren't expecting a new release of your favorite dependency written in the wrong language, huh!?
We are up-to-date with the JS version as of 19 Nov 2018 which includes the following changes:
- Move phrasing contents into paragraphs
- Improved the title detection
- Remove single cell tables
- Improved the detection of video related elements
- New test cases
- Various minor fixes
The following changes were also added:
- Clean tags during prepArticle().
- Merged PR #58: Fix notice non-object on $parentOfTopCandidate for tumblr.com
- Fixed issue #63: Division by zero
- Housekeeping:
- Removed $parseSuccessful flag that wasn't needed anymore
- Rename wordThreshold to charThreshold and throw deprecation notices. WordThreshold will be removed in version 3.0.
- Added "-ad-" as unlikely candidate
And finally a docker container was added so you can easily test in all the supported PHP versions by simply typing make test-all
in your console. You'll need docker and docker-compose if you want to see some really flashy stuff in your console and not just a silly error message.
If for some reason you're still reading this, you might be wondering why this version is 2.0 and not 1.3.something. I know you do. I know you're making a confused face right now.
The reason is that PHP 5.6 support is GONE.
YES
IT'S GONE
So make sure you run this code in a somewhat modern version of PHP. A version that starts with 7.
That's it. Take care. Call your mother.
v1.2.0 > Up to date with Readability.js
Hi all,
Version 1.2.0 is here. We are up to date with our JS big brother. Here's the full changelog:
- Merged PR#49 (Missing object when calling
->getContent()
) - Imported all changes from Readability.js as of 2 March 2018 (8525c6a):
- Check for
<base>
elements before converting URLs to absolute. - Clean
<link>
tags onprepArticle()
- Attempt to return at least some text if all the algorithm runs fail (Check PR #423 on JS version)
- Add new test cases for the previous changes
- And all other changes reflected in this diff
- Check for
v1.1.1 > The one with small changes
Hello pretty people of the PHP world.
It's monday night in this side of the world, I've just had a lovely dinner with the girlfriend and everything in my world is at peace. This is great opportunity to release a new version of Readability and maybe get some of those github stars feedback from the users.
This version includes the following changes:
- Switched from assertEquals to assertSame on unit testing to avoid weak comparisons.
- Added a safe check to avoid sending the DOMDocument as a node when scanning for node ancestors.
- Fix issue #45: Small mistake in documentation
- Fix issue #46: Added
data-src
as a image source path - Fixed bug when extracting all the image of the article (Was extracting images from the original DOM instead of the parsed one)
- Added the
->getDOMDocument()
getter to retrieve the fully parsed DOMDocument - Merged PR #48 that allows passing an array as configuration (@topotru)
Don't forget to update your dependencies, wash your hands after going to the bathroom, and do I have to tell you again to call your mom? That woman loves you. Call her right now. That composer update andreskrey/readability.php
can wait a couple of minutes.
v1.1.0 > Say hello to optional logging
Hi all!
Happy 2018! Hope you had an excellent 2017!
With this new release you'll be able to log everything that happens inside readability. Of course this is optional and nothing else is required from you if you don't care about logs.
Here's the full changelog:
- Added 'data-orig' as an URL source for images
- Removed 'modal' as a negative property from classes
- Added option to inject a logger
- Removed all references to the
data-readability
tags that don't apply anymore to the new structure - Merged PR #38 (Missing DOMEntityReference)
'til next time!
v1 πππ
Hi all!
Finally v1 is here. πππ The project changed drastically from v0, mainly because the HTMLParser is gone and the Readability class replaces it. I know, confusing, but this change aligns us with Readability.js and makes everything easier to port.
Also another huge change that I wanted to do since version 0.0.1 was getting rid of the node encapsulation. v0 used league\html-to-markdown NodeElement class to encapsulate the nodes and act as a middle man between your code and the DOMDocument. This caused lots of trouble because when you encapsulate nodes, you are actually severing the relation between the original DOM and the encapsulated node, forcing you to keep track of the changes between them instead of letting the system do it. This version instead of encapsulating nodes, extends the original class, solving all these issues.
Check the readme file to understand how to port your v0 code to v1 and the changelog to read about all the other changes.
Enjoy!
v0.3.1
Hi all, happy friday.
I'm releasing this version just to clean the Unreleased section of the changelog and prepare everything for the v1 version. Changes for this release are:
- Trim titles when detecting hierarchical separators to avoid false negatives on strings with spaces.
- Fix issue when converting divs to p nodes and never rating them (issue #29)
- Fix "Unsupported operand types" (PR #31)
- Fix division by zero when no title was found (issue #32)
- New function to retrieve all images at once (PR #30)
- Get the title from the
<title>
tag before searching on the<meta>
tags
Next release will be v1. For real this time.
v0.3.0
Happy November everyone. Took me more than I expected buy finally we are up to date with Readability.js, at least at the moment of writing these lines.
Here are the changelist for this release.
- Merged PR #24. Fixes notice when trying to extract
og:image
- Up to date to commit eb221c5 (2017-10-16), which includes the following changes:
- New tags added to the unlikelyCandidates regex
- Detection and removal of hierarchical separators in titles
- Added more tags to clean after parsing the article (
button
,textarea
,select
, etc.) - New way to detect empty nodes (including a edge case where a node with a
was detected as a node with content) - Better approach to find a top candidate (specially when a top candidate is the only child of a parent node, which allows a more accurate joining of sibling elements)
- Detect text direction (
ltr
orrtl
) - Detect and mark data tables to avoid removing them during final clean up
- Major fixes when scanning and deleting nodes (no need to traverse backwards anymore)
- Node cleaning via regex matches
- Clean table attributes during final clean up.
- Added license
Hopefully you'll find this release useful. Next release will be 1.0.
Don't forget to like, comment, subscribe, follow my Patreon, hit the gym, and call your mom. Make sure you tell your significant other you love him/her/it and if you are alone right now, install Tinder, because that's something I would really like to do but I was already on a relationship when that app was released.
Enjoy!
v0.2.2
Happy September everyone, here's another release of readability.php:
- Added a safecheck for really nasty HTML
- Added summonCthulhu option, to remove all script tags via regex
Enjoy!