Skip to content

Enable reading non-utf-8 encodings for java pom.xml files#2047

Merged
wagoodman merged 3 commits into
mainfrom
fix-pom-xml-decoding
Aug 23, 2023
Merged

Enable reading non-utf-8 encodings for java pom.xml files#2047
wagoodman merged 3 commits into
mainfrom
fix-pom-xml-decoding

Conversation

@wagoodman
Copy link
Copy Markdown
Contributor

@wagoodman wagoodman commented Aug 21, 2023

Fixes #2204

Currently the pom.xml decoding will return an error when reading a document that does not have an encoding="..." attribute at the top of an XML document and there are non-utf-8 characters within the document. This PR adds encoding detection before using the XML decoder so that the reader can be wrapped with an adapter to transform the input to UTF-8 during Read() calls.

Fixes: #2044

CC: @westonsteimel

@wagoodman wagoodman requested a review from a team August 21, 2023 20:12
@wagoodman wagoodman self-assigned this Aug 21, 2023
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Aug 21, 2023

Benchmark Test Results

Benchmark results from the latest changes vs base branch
goos: linux%0Agoarch: amd64%0Apkg: github.com/anchore/syft/test/integration%0Acpu: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz%0A                                                              │ ./.tmp/benchmark-101466e.txt │%0A                                                              │            sec/op            │%0AImagePackageCatalogers/alpmdb-cataloger-2                                       12.54m ±  3%25%0AImagePackageCatalogers/apkdb-cataloger-2                                        713.1µ ± 22%25%0AImagePackageCatalogers/binary-cataloger-2                                       206.2µ ±  2%25%0AImagePackageCatalogers/dpkgdb-cataloger-2                                       613.4µ ±  2%25%0AImagePackageCatalogers/dotnet-portable-executable-cataloger-2                   22.61µ ±  2%25%0AImagePackageCatalogers/go-module-binary-cataloger-2                             98.90µ ±  2%25%0AImagePackageCatalogers/java-cataloger-2                                         18.46m ±  2%25%0AImagePackageCatalogers/graalvm-native-image-cataloger-2                         96.26µ ±  1%25%0AImagePackageCatalogers/javascript-package-cataloger-2                           386.9µ ±  1%25%0AImagePackageCatalogers/nix-store-cataloger-2                                    280.7µ ±  1%25%0AImagePackageCatalogers/php-composer-installed-cataloger-2                       806.4µ ±  2%25%0AImagePackageCatalogers/portage-cataloger-2                                      489.7µ ±  1%25%0AImagePackageCatalogers/python-package-cataloger-2                               3.381m ±  3%25%0AImagePackageCatalogers/r-package-cataloger-2                                    204.3µ ±  1%25%0AImagePackageCatalogers/rpm-db-cataloger-2                                       563.7µ ±  3%25%0AImagePackageCatalogers/ruby-gemspec-cataloger-2                                 927.5µ ±  0%25%0AImagePackageCatalogers/sbom-cataloger-2                                         121.4µ ±  6%25%0Ageomean                                                                         503.0µ%0A%0A                                                              │ ./.tmp/benchmark-101466e.txt │%0A                                                              │             B/op             │%0AImagePackageCatalogers/alpmdb-cataloger-2                                       5.138Mi ± 0%25%0AImagePackageCatalogers/apkdb-cataloger-2                                        184.3Ki ± 0%25%0AImagePackageCatalogers/binary-cataloger-2                                       30.47Ki ± 0%25%0AImagePackageCatalogers/dpkgdb-cataloger-2                                       141.5Ki ± 0%25%0AImagePackageCatalogers/dotnet-portable-executable-cataloger-2                   3.695Ki ± 0%25%0AImagePackageCatalogers/go-module-binary-cataloger-2                             9.906Ki ± 0%25%0AImagePackageCatalogers/java-cataloger-2                                         3.064Mi ± 0%25%0AImagePackageCatalogers/graalvm-native-image-cataloger-2                         8.594Ki ± 0%25%0AImagePackageCatalogers/javascript-package-cataloger-2                           83.80Ki ± 0%25%0AImagePackageCatalogers/nix-store-cataloger-2                                    38.94Ki ± 0%25%0AImagePackageCatalogers/php-composer-installed-cataloger-2                       155.2Ki ± 0%25%0AImagePackageCatalogers/portage-cataloger-2                                      109.8Ki ± 0%25%0AImagePackageCatalogers/python-package-cataloger-2                               986.1Ki ± 0%25%0AImagePackageCatalogers/r-package-cataloger-2                                    42.91Ki ± 0%25%0AImagePackageCatalogers/rpm-db-cataloger-2                                       170.9Ki ± 0%25%0AImagePackageCatalogers/ruby-gemspec-cataloger-2                                 123.3Ki ± 0%25%0AImagePackageCatalogers/sbom-cataloger-2                                         14.20Ki ± 0%25%0Ageomean                                                                         92.99Ki%0A%0A                                                              │ ./.tmp/benchmark-101466e.txt │%0A                                                              │          allocs/op           │%0AImagePackageCatalogers/alpmdb-cataloger-2                                        88.07k ± 0%25%0AImagePackageCatalogers/apkdb-cataloger-2                                         4.034k ± 0%25%0AImagePackageCatalogers/binary-cataloger-2                                         848.0 ± 0%25%0AImagePackageCatalogers/dpkgdb-cataloger-2                                        2.911k ± 0%25%0AImagePackageCatalogers/dotnet-portable-executable-cataloger-2                     132.0 ± 0%25%0AImagePackageCatalogers/go-module-binary-cataloger-2                               281.0 ± 0%25%0AImagePackageCatalogers/java-cataloger-2                                          40.61k ± 0%25%0AImagePackageCatalogers/graalvm-native-image-cataloger-2                           228.0 ± 0%25%0AImagePackageCatalogers/javascript-package-cataloger-2                            1.264k ± 0%25%0AImagePackageCatalogers/nix-store-cataloger-2                                      820.0 ± 0%25%0AImagePackageCatalogers/php-composer-installed-cataloger-2                        3.846k ± 0%25%0AImagePackageCatalogers/portage-cataloger-2                                       2.194k ± 0%25%0AImagePackageCatalogers/python-package-cataloger-2                                16.13k ± 0%25%0AImagePackageCatalogers/r-package-cataloger-2                                      851.0 ± 0%25%0AImagePackageCatalogers/rpm-db-cataloger-2                                        3.914k ± 0%25%0AImagePackageCatalogers/ruby-gemspec-cataloger-2                                  2.291k ± 0%25%0AImagePackageCatalogers/sbom-cataloger-2                                           394.0 ± 0%25%0Ageomean                                                                          1.997k

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
@wagoodman wagoodman force-pushed the fix-pom-xml-decoding branch from 1ac86cf to 7566b29 Compare August 22, 2023 00:27
Comment thread syft/pkg/cataloger/java/parse_pom_xml.go Outdated
Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
…test unknown encoding

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
@wagoodman wagoodman merged commit 17d4203 into main Aug 23, 2023
@wagoodman wagoodman deleted the fix-pom-xml-decoding branch August 23, 2023 14:06
GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
* fix reading non utf8 encodings

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* in cases where we cant tell the encoding use the UTF8 replacement char

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* decompose the xml decoding func to get a valid utf8 reader first and test unknown encoding

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

---------

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Syft seems unable to parse non UTF-8 pom.xml files

3 participants