For this challenge, we’re looking at a book catalog given to us in the form of books.xml (23.3 KB) .
Data Source
It’s an XML file containing entries that look like this:
<book isbn="439554934" original_pub_year="1997">
<series entry="1">Harry Potter</series>
<title original="Harry Potter and the Philosopher's Stone">Harry Potter and the Sorcerer's Stone</title>
<authors>
<author>J.K. Rowling</author>
<author>Mary GrandPré</author>
</authors>
</book>
The format is as follows:
- We get the isbn and original year of publication as attributes on the
book
tag - If the book is part of a series we get its name in the
series
tag. It has anentry
attribute telling us where the book belongs in the series. - The title of the book is given in the
title
tag. There is an optional attribute that holds the original title, in case the book appears as a translation or the title was edited for our edition. - The authors of the book come as a list of entries in the
authors
tag.
Transformation Target
We need to convert this file into an Excel sheet preserving all information. If there are multiple authors, we put them into a single field, and separate them by a comma.
The output file should have the following structure:
Grab the books.xml (23.3 KB) source file and create a data flow that produces the required file.
Hint: use the xml.select function to convert the xml to structured data. Note that the
:nested_tags
and:nested_text
options can make it easier to consistently handle author tag multiplicity and optional attributes respectively.