Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support different (older) PAGE namespaces #14

Open
chris1010010 opened this issue Jul 18, 2019 · 20 comments · May be fixed by #15
Open

Support different (older) PAGE namespaces #14

chris1010010 opened this issue Jul 18, 2019 · 20 comments · May be fixed by #15

Comments

@chris1010010
Copy link
Contributor

chris1010010 commented Jul 18, 2019

Copied from github.com/OCR-D/core/issues/67

I would like to share some ideas on this. This is both an issue of how good our implementation can be, and what can be done in the schema itself to make life easier for applications.

problem

Currently, we have a very bad situation to begin with: The pagecontent schema changes its xs:targetNamespace in a fixed release schedule (once per year), usually without even introducing breaking changes. And our implementation can only parse (or produce) instances with the one version it was "built with" (i.e. on which generateDS was run). This forces us to both release new versions of core each year, and (worse) recreate (or map) all documents to the new version, too. If we forgot to do either of that in just a single case, processing would break down.

We can do better. In the following, I will frequently refer to concepts and recommendations in this excellent guide on XML schema versioning.

First and foremost, I do not think it is good practise to use targetNamespace versioning, if one does not attempt non-breaking changes.

For example, right now, a new release http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 is under way, but all it changes w.r.t. the previous version http://schema.primaresearch.org/PAGE/gts/pagecontent/2018-07-15 are new elements and attributes which are optional. Even in previous years, most of the changes were compatible extensions. The rest were bugfixes (so no instance could possibly become invalid or wrong after the change if it has not been before) and minor semantic changes that would only invalidate a tiny fraction of documents that used the respective features heavily.

solution

targetNamespace for major, version for minor releases

As long as we have purely backwards compatible changes (as we do now), we could just stay in the namespace (even if it happens to be named after some particular year), and add an (internal) version attribute to its declaration (i.e. /xsd:schema/@version), which we increase with each release (e.g. 2.0 etc). That way, old documents can stay as they were, and only applications have to be updated.

(They could be updated manually in a fixed release schedule, or even be built in such a way that they update automatically: they merely have to look up the schema location whether new versions are published, then download and incorporate them accordingly. Or they could look up the schema location and just show a warning that they are outdated.)

And if, in the future, needs do arise that require breaking changes, then we can still start a new namespace (but again with version="1.0"). So we would have:

  • /schema/@targetNamespace changes for major releases, and
  • /schema/@version changes for minor releases.

How does that apply to the situation we face now, how do we introduce this? For the current schema does not yet have a version, and existing applications do not yet have an updating mechanism in place. So if tomorrow new documents appear with the extended features (new elements and attributes), applications will show them as invalid as long as they themselves have not been updated.

But remember: this is not any worse than what we already faced year by year! On the contrary: we used to have the dilemma of either updating our application and breaking all the old documents, or not updating and breaking some of the new documents. Now at least we can safely promote updating applications!

schema/@version for releases vs PcGts/@compatibleVersion for documents

However, it gets better. We could also introduce a new (external) attribute in the schema definition that informs the application which version(s) the document is compatible with, say /pc:PcGts/@compatibleVersion, with a xs:default="1.0". Now an application (updated with the new schema versioning mechanism) can first pre-parse the document and loo

@kba
Copy link
Contributor

kba commented Jul 18, 2019

Just to chime in, I think that this is a great proposal that would make life much easier for us @OCR-D.

Since release of the 2019 version was just two days ago, I'd find it best if we stuck with the 2018 targetNamespace with added version mechanics as proposed by @bertsky

@chris1010010
Copy link
Contributor Author

That's the question. If we have a required version attribute (which I think we should), it's a non-compatible change. In that case I'd say use the 2019 namespace.

@bertsky
Copy link
Contributor

bertsky commented Jul 18, 2019

That's the question. If we have a required version attribute (which I think we should), it's a non-compatible change. In that case I'd say use the 2019 namespace.

The idea was to give it a default value 1.0. So existing documents would be interpreted (by new applications) as being 2018, while new documents can say 2.0 if they need 2019 features. The whole thing should be non-breaking.

@chris1010010
Copy link
Contributor Author

But old applications will not work. They see a 2018 XML file, use the old 2018 schema to validate, and that fails because of the version attribute or any other new attribute/element.
To me, changing the version approach is a major change :-)

@bertsky
Copy link
Contributor

bertsky commented Jul 18, 2019

Applications need to update anyway. But documents can stay as they are – this time from now on.

But remember: this is not any worse than what we already faced year by year! On the contrary: we used to have the dilemma of either updating our application and breaking all the old documents, or not updating and breaking some of the new documents. Now at least we can safely promote updating applications!

@chris1010010
Copy link
Contributor Author

Yes, applications need updating, but there are many copies out there already that would break. At the moment they would say "not supported" for newer versions (namespaces). I really think it has to be 2019-07-15 and from now on we have the option for minor versions. I know this is inconvenient

@bertsky
Copy link
Contributor

bertsky commented Jul 18, 2019

Sorry, but you still do not have me convinced this is necessary at all. If we adopt /schema/@version="2.0" and /PcGts/@schemaVersion/@default="1.0" in the new release, and update applications with it, then old documents using only 2018 features will not break. (And obviously, old applications and old documents won't, either.) It is only new documents with old applications that will not work (as would have happened with a namespace change).

@chris1010010
Copy link
Contributor Author

chris1010010 commented Jul 18, 2019

No, the difference is that, with a new namespace, old applications know they don't support that XML file. If we don't change the namespace, old applications think they support new files, but they don't.
(That was one of the main flaws of the old ALTO versioning also)

@bertsky
Copy link
Contributor

bertsky commented Jul 18, 2019

with a new namespace, old applications know they don't support that XML file

True. I understand now. You want them to say "unsupported" und not "PcGts/@schemaVersion is invalid".

@chris1010010
Copy link
Contributor Author

Yes, that's one scenario. Other tools (and I admit we might have some of those) might silently remove the new things when saving an XML file with the old namespace.
It might sound minor, but from a usability point of view I think it's important.
How big is the impact of a new namespace on OCR-D?

@bertsky
Copy link
Contributor

bertsky commented Jul 18, 2019

Other tools (and I admit we might have some of those) might silently remove the new things when saving an XML file with the old namespace.

Good point!

It's better to be very careful here. I don't think it's that much of a problem to once again have everyone move to a new namespace.

@chris1010010
Copy link
Contributor Author

Just had a brief discussion with Stefan (Mr PAGE) and he had some concerns with the change.
@bertsky Can you explain a little bit more why a version number is better for you (you mentioned generateDS)? Can you not support multiple namespaces/versions?
Perhaps we need a pros/cons table

@bertsky
Copy link
Contributor

bertsky commented Jul 18, 2019

Sorry, I do not know enough about our generateDS integration. My impression was that it is difficult for applications in general to support multiple namespaces. @kba?

@kba
Copy link
Contributor

kba commented Jul 18, 2019

generateDS is a code generation tool we use to create an API to PAGE from the schema. We ship only one version of that generated code, based on the latest schema. Currently, this means breaking backwards-compatibility when a new version is released because the namespace change and code won't work with documents with older namespaces and cannot generate documents with only the new namespace.

Now, we could version our code generation and devise some sort of selection mechanism based on the data. But then we'd also need to upgrade documents dynamically. E.g. adding @orientation to a pc:Page in a 2018 document entails first converting it to 2019 which is non-trivial with XML tools if the namespace changes but would be easy with an in-line versioning scheme.

Another reason for non-namespace versioning are XSLT scripts. We use a lot of those, most of them using only features that have been part of PAGE-XML for years. Here we employ different workarounds to support older versions, like using local-name() == "Page" or pg2017 OR pg2018:Page OR pg2019:Page which defeat the purpose of namespaces IMHO.

@chris1010010
Copy link
Contributor Author

I started a document with some of the mentioned points. Feel free to add.
Google Doc

@wrznr
Copy link
Contributor

wrznr commented Jul 31, 2019

The Google Doc states No changes in existing software required (with regard to XML handling) as a pro for the current approach. However, JPageViewer refuses to open files referring to the 2019 schema:

org.primaresearch.io.xml.XmlModelAndValidatorProvider$UnsupportedSchemaVersionException: 2019-07-15

I.e.,

<?xml version="1.0" encoding="UTF-8"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15">
    <Metadata>
        <Creator>OCR-D/core 1.0.0b11</Creator>
        <Created>2019-07-31T10:44:05.838637</Created>
        <LastChange>2019-07-31T10:44:05.838637</LastChange>
    </Metadata>
    <Page imageFilename="https://digital.slub-dresden.de/data/kitodo/Brsfded_39946221X-18750125/Brsfded_39946221X-18750125_tif/jpegs/00000001.tif.original.jpg" imageWidth="1992" imageHeight="2450">
        <Border>
            <Coords points="83,89 1917,89 1917,2361 83,2361"/>
        </Border>
    </Page>
</PcGts>

@chris1010010
Copy link
Contributor Author

Yes, by that I meant we don't have to change the XML handling procedure. New versions of the XML still have to be supported by the readers and writers. But no changes are required in terms of handling the version number, the storage of the schema files (online and offline) etc.

@bertsky
Copy link
Contributor

bertsky commented Jul 31, 2019

@chris1010010 your draft already captures all the issues at hand IMV. I have 3 comments though:

I would like to point out that on the downsides of the current approach, it is not necessarily only software which needs to be updated when a new release comes out, but (under certain circumstances) also the data (PAGE instances) themselves. That is because certain kinds of PAGE-processing software (like generateDS based parsers or XSLT scripts) cannot handle multiple namespace versions – so as soon as they are updated, all documents must be made up to date as well. (Of course, for this kind of software, changing the data would still be necessary after major updates even in the proposed approach. This is merely a question of how often one is forced to do that.)

Moreover, your last point on the downsides of the proposed approach, on that changes in the schema would now have to be examined for backward compatibility, I don't think this can be counted as a downside at all. This is really a question of who carries the burden: Having a conscious, consensuous decision once per release about whether or not the changes break existing semantics, which is then made totally visible by increase in either version or namespace, is actually an upside for implementors and data providers. It might be considered a minor downside for standardizers, but usually they will be implementors themselves, so they have that burden already!

Regarding a revised namespace URI hosting scheme (i.e. namespace document), I don't think there is much of an industry standard in that area. There is a good discussion in this XMLVS proposal BTW.

@chris1010010
Copy link
Contributor Author

Updated the Google Doc. I'll talk to Stefan

@chris1010010
Copy link
Contributor Author

We discussed this again quite lengthily and came to a conclusion that we would like to keep the current versioning scheme. We really appreciate the interesting discussion and we understand the frustration this might cause. This is our reasoning:

  • PAGE was never intended to change very often. We think the (recently yearly) updates will slow down again.
  • We favour the more explicit approach of having one namespace per version (it leaves no room for interpretation). No reasoning is necessary when creating a new version and no extra responsibility is shifted to software tools.
  • When an XML does not validate it is always due to it being in the wrong format and not possibly because there is a new version the validator is not aware of
  • Software can be updated to support multiple versions. For minor changes, XML conversion to new versions is trivial (see below).
  • We want to support a multitude of software tools and not just a couple of selected ones. The new versioning might break XPath solutions or similar that use indexed access.

We don't intend to make many additions ourselves (most recent additions cam from OCR-D side). So hopefully we can provide a stable format that works for most users.

Now, we could version our code generation and devise some sort of selection mechanism based on the data. But then we'd also need to upgrade documents dynamically. E.g. adding @orientation to a pc:Page in a 2018 document entails first converting it to 2019 which is non-trivial with XML tools if the namespace changes but would be easy with an in-line versioning scheme.

Moving XMLs to a new namespace should be straightforward via a stylesheet, see for example: https://stackoverflow.com/questions/46533579/copying-elements-to-a-new-namespace-with-xslt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants