Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MNG-7830] Switch from plexus-xml to stax / woodstox #1185

Merged
merged 1 commit into from
Jun 29, 2023

Conversation

gnodet
Copy link
Contributor

@gnodet gnodet commented Jun 26, 2023

JIRA issue: https://issues.apache.org/jira/browse/MNG-7830
IT PR: apache/maven-integration-testing#274

Switch the underlying plexus-xml Xpp3 parser to the STAX api / woodstox parser.
Woodstox Stax parser is 20% faster than the xpp3 parser.

The PR also moves out of modello generated code for the repository metadata and core extensions xml models.

Some pom are not valid xml files, for example:

Unable to parse using stax: /Users/gnodet/.m2/repository/org/eclipse/org.eclipse.osgi/3.8.0.v20120529-1548/org.eclipse.osgi-3.8.0.v20120529-1548.pom (Undeclared namespace prefix "xsi" (for attribute "schemaLocation")
 at [row,col,system-id]: [1,108,"/Users/gnodet/.m2/repository/org/eclipse/org.eclipse.osgi/3.8.0.v20120529-1548/org.eclipse.osgi-3.8.0.v20120529-1548.pom"])
Unable to parse using stax: /Users/gnodet/.m2/repository/org/eclipse/equinox/org.eclipse.equinox.coordinator/1.1.0.v20120522-1841/org.eclipse.equinox.coordinator-1.1.0.v20120522-1841.pom (Undeclared namespace prefix "xsi" (for attribute "schemaLocation")
 at [row,col,system-id]: [1,108,"/Users/gnodet/.m2/repository/org/eclipse/equinox/org.eclipse.equinox.coordinator/1.1.0.v20120522-1841/org.eclipse.equinox.coordinator-1.1.0.v20120522-1841.pom"])
Unable to parse using stax: /Users/gnodet/.m2/repository/org/eclipse/equinox/org.eclipse.equinox.region/1.2.101.v20150831-1342/org.eclipse.equinox.region-1.2.101.v20150831-1342.pom (Undeclared namespace prefix "xsi" (for attribute "schemaLocation")
 at [row,col,system-id]: [1,108,"/Users/gnodet/.m2/repository/org/eclipse/equinox/org.eclipse.equinox.region/1.2.101.v20150831-1342/org.eclipse.equinox.region-1.2.101.v20150831-1342.pom"])
Unable to parse using stax: /Users/gnodet/.m2/repository/org/apache/hadoop/hadoop-project/3.2.2/hadoop-project-3.2.2.pom (Unexpected character '-' (code 45) (expected a name start character)
 at [row,col,system-id]: [1942,12,"/Users/gnodet/.m2/repository/org/apache/hadoop/hadoop-project/3.2.2/hadoop-project-3.2.2.pom"])
Unable to parse using stax: /Users/gnodet/.m2/repository/com/sun/activation/jakarta.activation/2.0.0/jakarta.activation-2.0.0.pom (Undeclared namespace prefix "Xlint"
 at [row,col,system-id]: [76,16,"/Users/gnodet/.m2/repository/com/sun/activation/jakarta.activation/2.0.0/jakarta.activation-2.0.0.pom"])
Unable to parse using stax: /Users/gnodet/.m2/repository/com/sun/activation/jakarta.activation/2.0.0-RC3/jakarta.activation-2.0.0-RC3.pom (Undeclared namespace prefix "Xlint"
 at [row,col,system-id]: [76,16,"/Users/gnodet/.m2/repository/com/sun/activation/jakarta.activation/2.0.0-RC3/jakarta.activation-2.0.0-RC3.pom"])
Unable to parse using stax: /Users/gnodet/.m2/repository/com/sun/xml/bind/mvn/jaxb-parent/3.0.0-M5/jaxb-parent-3.0.0-M5.pom (Undeclared namespace prefix "Xlint"
 at [row,col,system-id]: [436,45,"/Users/gnodet/.m2/repository/com/sun/xml/bind/mvn/jaxb-parent/3.0.0-M5/jaxb-parent-3.0.0-M5.pom"])
Unable to parse using stax: /Users/gnodet/.m2/repository/com/sun/xml/bind/mvn/jaxb-parent/3.0.0-M3/jaxb-parent-3.0.0-M3.pom (Undeclared namespace prefix "Xlint"
 at [row,col,system-id]: [434,45,"/Users/gnodet/.m2/repository/com/sun/xml/bind/mvn/jaxb-parent/3.0.0-M3/jaxb-parent-3.0.0-M3.pom"])
Unable to parse using stax: /Users/gnodet/.m2/repository/com/sun/xml/bind/mvn/jaxb-parent/3.0.0-M4/jaxb-parent-3.0.0-M4.pom (Undeclared namespace prefix "Xlint"
 at [row,col,system-id]: [434,45,"/Users/gnodet/.m2/repository/com/sun/xml/bind/mvn/jaxb-parent/3.0.0-M4/jaxb-parent-3.0.0-M4.pom"])
Unable to parse using stax: /Users/gnodet/.m2/repository/com/sun/xml/bind/mvn/jaxb-parent/3.0.0/jaxb-parent-3.0.0.pom (Undeclared namespace prefix "Xlint"
 at [row,col,system-id]: [436,45,"/Users/gnodet/.m2/repository/com/sun/xml/bind/mvn/jaxb-parent/3.0.0/jaxb-parent-3.0.0.pom"])
Unable to parse using stax: /Users/gnodet/.m2/repository/com/sun/istack/istack-commons/4.0.0-M2/istack-commons-4.0.0-M2.pom (Undeclared namespace prefix "Xlint"
 at [row,col,system-id]: [418,37,"/Users/gnodet/.m2/repository/com/sun/istack/istack-commons/4.0.0-M2/istack-commons-4.0.0-M2.pom"])
Unable to parse using stax: /Users/gnodet/.m2/repository/com/sun/istack/istack-commons/4.0.0-M3/istack-commons-4.0.0-M3.pom (Undeclared namespace prefix "Xlint"
 at [row,col,system-id]: [426,45,"/Users/gnodet/.m2/repository/com/sun/istack/istack-commons/4.0.0-M3/istack-commons-4.0.0-M3.pom"])
Unable to parse using stax: /Users/gnodet/.m2/repository/com/sun/istack/istack-commons/3.0.10/istack-commons-3.0.10.pom (Undeclared namespace prefix "Xlint"
 at [row,col,system-id]: [408,37,"/Users/gnodet/.m2/repository/com/sun/istack/istack-commons/3.0.10/istack-commons-3.0.10.pom"])
Unable to parse using stax: /Users/gnodet/.m2/repository/com/sun/istack/istack-commons/3.0.11/istack-commons-3.0.11.pom (Undeclared namespace prefix "Xlint"
 at [row,col,system-id]: [408,37,"/Users/gnodet/.m2/repository/com/sun/istack/istack-commons/3.0.11/istack-commons-3.0.11.pom"])

The StaxTest unit test (disabled by default) reads all POMs in the local repository.
We could disable namespace support when reading the POMs, but I'm not sure that's a good idea.

@gnodet gnodet force-pushed the xml-experiments branch 2 times, most recently from 7c9420a to 04d6bd0 Compare June 28, 2023 11:30
@gnodet gnodet changed the title Move XML to STAX api [MNG-7830] Switch from plexus-xml to stax / woodstox Jun 28, 2023
@gnodet gnodet marked this pull request as ready for review June 28, 2023 11:31
@gnodet gnodet requested a review from elharo June 28, 2023 11:42
Copy link
Contributor

@elharo elharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. This looks great. definitely a big leap forward.

I might have missed it, but I didn't find any explicit use of the Stax2 API or Woodstox classes. Could this be done with JDK classes only?
classes?

@@ -32,6 +32,14 @@ under the License.
<groupId>org.codehaus.plexus</groupId>
<artifactId>plexus-xml</artifactId>
</dependency>
<dependency>
<groupId>org.codehaus.woodstox</groupId>
<artifactId>stax2-api</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need stax2 or could we get away with the stax version in the JDK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the round tripping (i.e. read the xml using stax and write it back to a string without any difference) is only supported by woodstox (the stax2-api is a mandatory dependency of woodstox anyway).

@gnodet
Copy link
Contributor Author

gnodet commented Jun 28, 2023

Wow. This looks great. definitely a big leap forward.

I might have missed it, but I didn't find any explicit use of the Stax2 API or Woodstox classes. Could this be done with JDK classes only? classes?

Stax2 / woodstox is actually needed for the consumer POM transformation: the stax api is not sufficient and does not allow full round tripping with xml as spaces in prolog are not reported. Even aalto-xml does not support it. The effect is that when using another implementation, the line breaks before the first element of the POM are removed, so the generated POM will usually contains the xml declaration, the license and the <project> element on a single line. Most importantly, this breaks tests :-)

Apart from this use case, changing the implementation leads to various small issues as they do sometimes slightly differ in the specific events they generate. Writing namespaces is particularly challenging, though there's certainly a way to solve those discrepancies, but again, it breaks a few ITs which are particularly sensitive to the exact XML generated.

Also, woodstox is 50% faster than the JDK implementation, so I definitely think we should use it.

@elharo
Copy link
Contributor

elharo commented Jun 28, 2023

Depending on spaces in the prolog is a bug. Tests should be comparing XML to XML, not XML to strings. The latter is extremely brittle and can cause tests to break even in minor upgrades of a library. Alternately, we can canonicalize documents before comparing them. If you file issues on specific tests that do that, I can take a look.

@gnodet
Copy link
Contributor Author

gnodet commented Jun 28, 2023

Depending on spaces in the prolog is a bug. Tests should be comparing XML to XML, not XML to strings. The latter is extremely brittle and can cause tests to break even in minor upgrades of a library. Alternately, we can canonicalize documents before comparing them. If you file issues on specific tests that do that, I can take a look.

That's not really the problem. I could hack the tests. However, the result xml is the one uploaded on central, and that one is ugly (because of the missing line breaks). So I think this is important to keep.

@gnodet gnodet force-pushed the xml-experiments branch 2 times, most recently from d232c27 to 543ea16 Compare June 28, 2023 16:11
@elharo
Copy link
Contributor

elharo commented Jun 28, 2023

I am sure there are ways to fix that.

@elharo
Copy link
Contributor

elharo commented Jun 28, 2023

Might or might not be related:

[INFO]
Error: Errors:
Error: DefaultMavenProjectBuilderTest.rereadPom_mng7063:334 � FileSystem C:\Users\RUNNER~1\AppData\Local\Temp\junit5723635972946138167\pom.xml: The process cannot access the file because it is being used by another process.

Error: ProjectBuilderTest.testReadModifiedPoms(Path) � IO Failed to delete temp directory C:\Users\RUNNER~1\AppData\Local\Temp\junit3720659451607426598. The following paths could not be deleted (see suppressed exceptions for details): , child
[INFO]
Error: Tests run: 428, Failures: 0, Errors: 2, Skipped: 1
[INFO]

@gnodet
Copy link
Contributor Author

gnodet commented Jun 28, 2023

I plan to provide additional support for xml namespaces, but I think a follow-up PR may be better. I'll stop on that one.

@gnodet gnodet force-pushed the xml-experiments branch from 144c8e9 to bcefec1 Compare June 28, 2023 20:22
@gnodet gnodet merged commit e39142b into apache:master Jun 29, 2023
@gnodet gnodet deleted the xml-experiments branch November 18, 2023 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants