Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reasonable handler for unversioned/ill-versioned RSS feeds #186

Closed
kzn opened this issue Jun 15, 2014 · 9 comments
Closed

Reasonable handler for unversioned/ill-versioned RSS feeds #186

kzn opened this issue Jun 15, 2014 · 9 comments

Comments

@kzn
Copy link

kzn commented Jun 15, 2014

Currently, when ROME tries to parse RSS feed it tries different parsers based on version attribute of rss tag, but some feed don't provide one, or provide some non-standard version.

May be add some reasonable default parser for that kind of feeds. Assuming for example, if the version of the feed couldn't be determined, then try the parser for latest version?

@ChristianBusch
Copy link

Maybe one could think of a heuristic to determine which version the feed uses?

@kzn
Copy link
Author

kzn commented Jun 16, 2014

I simply wrote a small fixup DOM procedure that appends the latest version to the rss element if this attribute is absent.

@PatrickGotthard
Copy link
Member

Hi @kzn,

ROME was designed not to handle broken feeds. I don't know whether we really should implement such a fallback mechanism.

When we implement such a functionality it should not be enabled by default. Either the user has to handle such errors explicitly or to enable this mechanism with a setting, configuration flag or something else.

Regards,
Patrick

@mishako
Copy link
Member

mishako commented Jan 9, 2016

@PatrickGotthard

ROME was designed not to handle broken feeds.

I'm inclined to disagree. Some parts of Rome have intentionally been designed to handle brokenness. Namely XmlReader is lenient by default happily accepting a file with UTF-8 BOM followed by <?xml encoding="{insert your favorite encoding}". Another such part is Rome's DateParser which accepts ISO 8601 in spite of what the RSS spec says.

In general I think Rome should be as lenient as possible, because our users usually have no control over the feeds they parse. Unlike databases or compilers where the user has no excuse to keep things broken.

@kzn
Copy link
Author

kzn commented Jan 9, 2016

I think that the state of RSS is a bit similar to the HTML. There is a huge number of html pages that doensn't conform to the standard. Neverless, the browsers are able to handle them and show to the user.

I had similar experience with RSS feeds. Some of them are perfectly handled by RSS readers, while wouldn't parse with ROME.
I think that ROME should be able to handle broken feeds, with reasonable limits, of course. For example, as tagsoup/jsoup libraries handle HTML pages. Otherwise the user will be urged to dig into RSS/ATOM standards to fix broken feeds and IMHO that is comparable with writing custom RSS parser.

@ChristianBusch
Copy link

Imho ROME should be as lenient as possible for the reasons @mishako and @kzn stated.

I'd rather have ROME extract as much information as possible out of a feed instead of giving up, or even withholding information it could parse but decides not to because the feed does not conform to the standard.

If you want to make a user aware of a broken feed I'd prefer it's done in a way, that still allows parsing.

@mishako
Copy link
Member

mishako commented Jan 24, 2016

@PatrickGotthard Any thoughts?

@mishako
Copy link
Member

mishako commented Mar 1, 2016

This is now partially fixed by #274. I say partially because the fix doesn't cover feeds with an incorrect version. Anyway, it's something.

@mishako
Copy link
Member

mishako commented Mar 25, 2016

@kzn Rome 1.6.0 has been released with this issue fixed (or, as I said earlier, partially fixed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants