Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

UDF to parse a column with an XML string #322

@stevenmanton

Description

@stevenmanton

I love this package, but I have often run into a scenario where I have a DataFrame with several columns, one of which contains an XML string that I would like to parse. Since this package only works with files, in order to parse the XML column we have to select the XML column, save it to disk, then read it using this library.

I'd love a UDF that I could call that would parse the column in place. For example, a new function parseXML that parses the XML string and returns a struct that you could reference in the normal way. Maybe something along the lines of the following.

(
    df
    .withColumn("parsed_XML", parseXML('xml_column'))
    .withColumn("field1", "parsed_XML.field1")
    .withColumn("array0", col("parsed_XML.array").getItem(0))
)

I'm happy to try to implement this, but I'm hoping the core devs can provide some early feedback. Is this doable? worthwhile? Any suggestions on the right approach?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions