Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding empty tag behavior #741

Open
phdavis1027 opened this issue Apr 26, 2024 · 5 comments
Open

Understanding empty tag behavior #741

phdavis1027 opened this issue Apr 26, 2024 · 5 comments
Labels

Comments

@phdavis1027
Copy link
Contributor

First of all, I want to thank everyone involved in this project for the excellent work they've done. It's absurdly fast and fits great in my project.

I have a question about expected behavior for empty tags. I have some XML that looks like this:

...
<value></value>
<value></value>
...
<value></value>
...

That is being parsed by this code:

match (state, reader.read_event()?) {
 (State::ResultsInnerValueInner, Event::Text(e)) => {
         column.push(e.unescape_with(irods_unescapes)?.to_string());
         State::ResultsInnerValue
   }
}

When I later print this value out, it has the value "\n". Is this expected behavior? I think I've seen it a couple other times. I would have guessed that the output would be the empty &str.

@Mingun
Copy link
Collaborator

Mingun commented Apr 26, 2024

I cannot say what the reason of this without the full code, but I believe that you've get the text between </value> and next <value>. You should check that your state management is correct.

It also would be good to use dbg!(state, reader.read_event()?) to see that you've match exactly.

@phdavis1027
Copy link
Contributor Author

Oh interesting. I suppose I assumed that Text events only occurred in the context of something like <tag>...</tag>, but debugging does seem to show that they're appearing in </tag><tag> contexts and I've just gotten lucky so far. Thanks for the lead.

@Mingun
Copy link
Collaborator

Mingun commented Apr 26, 2024

Also, just consuming Event::Texts is error-prone. In XML all text events should be concatenated together with CDATA contents and you should drop any comments between them. The code that takes into account all the nuances is quite large, but unfortunately, there is no good API out of box in quick-xml for this (note self.drain_text(...)):

quick-xml/src/de/mod.rs

Lines 2222 to 2243 in e8ae020

fn next(&mut self) -> Result<DeEvent<'i>, DeError> {
loop {
return match self.next_impl()? {
PayloadEvent::Start(e) => Ok(DeEvent::Start(e)),
PayloadEvent::End(e) => Ok(DeEvent::End(e)),
PayloadEvent::Text(mut e) => {
if self.need_trim_end() && e.inplace_trim_end() {
continue;
}
self.drain_text(e.unescape_with(|entity| self.entity_resolver.resolve(entity))?)
}
PayloadEvent::CData(e) => self.drain_text(e.decode()?),
PayloadEvent::DocType(e) => {
self.entity_resolver
.capture(e)
.map_err(|err| DeError::Custom(format!("cannot parse DTD: {}", err)))?;
continue;
}
PayloadEvent::Eof => Ok(DeEvent::Eof),
};
}
}

@dralley
Copy link
Collaborator

dralley commented Jul 1, 2024

@Mingun presumably the Reader / RawReader distinction will also handle the concatenation of CDATA and Text?

@Mingun
Copy link
Collaborator

Mingun commented Jul 1, 2024

Yes, I'll plan to merge text events in new Reader. I think the average user does not need as high a degree of control as access to each individual text event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants