Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

150: don't use regex to clean up XML following pretty-printing #681

Merged
merged 1 commit into from
Feb 1, 2024

Conversation

lindsay-stevens
Copy link
Contributor

Closes #150

Default xml.minidom pretty-printer treats each node in a mixed-content element as a separate node. As a result it adds newlines and indents for each text node and output element, which is ugly and wrong. Then survey.py cleans that up with some regex.

This change copies and customises the default writexml implementation from minidom, and skips newline/indent for text nodes. It also includes conditional whitespace to match the previous processing, in case Collect or Enketo expected it to be exactly that way.

Added a test to specifically enumerate various output/text mixture permutations that appear in a variety of existing tests.

Why is this the best possible solution? Were any other approaches considered?

I think not outputting the undesirable whitespace in the first place is better than cleaning it up afterwards. Copying writexml adds a bit of complexity to the code base but I think the trade off vs the regex is worth it. Being able to do this without a new dependency is preferable. It may also allow more flexibility around users inputting whitespace in their question labels, without pyxform mangling it.

What are the regression risks?

While the added test should provide coverage, it's possible that some combination of text and references doesn't produce exactly the same white-spaced output as before, and assuming Collect or Enketo are sensitive to that it might change the output. But that risk seems pretty small. I'm not sure that the whitespace is actually significant in the first place.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

No

Before submitting this PR, please make sure you have:

  • included test cases for core behavior and edge cases in tests
  • run nosetests and verified all tests pass
  • run black pyxform tests to format code
  • verified that any code or assets from external sources are properly credited in comments

Copy link
Contributor

@lognaturel lognaturel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Pretty printing isn't done by default so it's probably not very common but good to get rid of some regex.

Feel free to merge after addressing the conflict.

- default xml.minidom pretty-printer treats each node in a mixed-content
  element as a separate node. As a result it adds newlines and indents
  for each text node and output element, which is ugly and wrong. Then
  survey.py cleans that up with some regex.
- this change copies and customises the default writexml implementation
  from minidom, and skips newline/indent for text nodes. It also
  includes conditional whitespace to match the previous processing, in
  case Collect or Enketo expected it to be exactly that way.
- added a test to specifically enumerate various output/text mixture
  permutations that appear in a variety of existing tests.
@lindsay-stevens lindsay-stevens merged commit 8f75605 into XLSForm:master Feb 1, 2024
10 checks passed
@lindsay-stevens lindsay-stevens deleted the pyxform-150 branch February 1, 2024 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't use regex to clean up XML
2 participants