This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Description
I love this package, but I have often run into a scenario where I have a DataFrame with several columns, one of which contains an XML string that I would like to parse. Since this package only works with files, in order to parse the XML column we have to select the XML column, save it to disk, then read it using this library.
I'd love a UDF that I could call that would parse the column in place. For example, a new function parseXML that parses the XML string and returns a struct that you could reference in the normal way. Maybe something along the lines of the following.
(
df
.withColumn("parsed_XML", parseXML('xml_column'))
.withColumn("field1", "parsed_XML.field1")
.withColumn("array0", col("parsed_XML.array").getItem(0))
)
I'm happy to try to implement this, but I'm hoping the core devs can provide some early feedback. Is this doable? worthwhile? Any suggestions on the right approach?