Challenge #12: Convert an XML file to Excel

slawo · February 10, 2020, 5:22pm

For this challenge, we’re looking at a book catalog given to us in the form of books.xml (23.3 KB) .

Data Source

It’s an XML file containing entries that look like this:

    <book isbn="439554934" original_pub_year="1997">
        <series entry="1">Harry Potter</series>
        <title original="Harry Potter and the Philosopher's Stone">Harry Potter and the Sorcerer's Stone</title>
        <authors>
            <author>J.K. Rowling</author>
            <author>Mary GrandPré</author>
        </authors>
    </book>

The format is as follows:

We get the isbn and original year of publication as attributes on the book tag
If the book is part of a series we get its name in the series tag. It has an entry attribute telling us where the book belongs in the series.
The title of the book is given in the title tag. There is an optional attribute that holds the original title, in case the book appears as a translation or the title was edited for our edition.
The authors of the book come as a list of entries in the authors tag.

Transformation Target

We need to convert this file into an Excel sheet preserving all information. If there are multiple authors, we put them into a single field, and separate them by a comma.

The output file should have the following structure:

Grab the books.xml (23.3 KB) source file and create a data flow that produces the required file.

Hint: use the xml.select function to convert the xml to structured data. Note that the :nested_tags and :nested_text options can make it easier to consistently handle author tag multiplicity and optional attributes respectively.

slawo · February 18, 2020, 2:40pm

The solution.dfl (37.1 KB) strategy is:

read the XML file into a string
use xml.select to parse it into a data structure
put each book’s data into a row
write all rows into an Excel file

Screenshot 2020-02-18 at 15.27.07

To read all contents of a file into a single row, we use the Read File step. There’s nothing remarkable about the configuration.

We use the Get Items step to iterate over the contained book tags. To do that, we need to extract the book tags from the xml into a list.
The function xml.select allows us to do just that.

We specify that we’d like authors to always parse into a list, as sometimes there are multiple, and we want a consistent layout. Similarly, the title tag sometimes has attributes, so we always want its content in a separate text node.

Enabling these options gives us consistency in the parsed data structures.

Extracting the data is then a simple matter of accessing the record in the book field, and pulling the data into fields.

We can rely on book[:authors][:author] always being a list, so we can use strings.join to create a comma-separated list of authors.

At this point, we have all book data in individual fields.

The only thing left to do is to write out these fields into an Excel file using the Excel Output step.

Grab the solution.dfl (37.1 KB) file to try it yourself.