For this challenge, we’re looking at a book catalog given to us in the form of books.xml (23.3 KB) .
It’s an XML file containing entries that look like this:
<book isbn="439554934" original_pub_year="1997"> <series entry="1">Harry Potter</series> <title original="Harry Potter and the Philosopher's Stone">Harry Potter and the Sorcerer's Stone</title> <authors> <author>J.K. Rowling</author> <author>Mary GrandPré</author> </authors> </book>
The format is as follows:
- We get the isbn and original year of publication as attributes on the
- If the book is part of a series we get its name in the
seriestag. It has an
entryattribute telling us where the book belongs in the series.
- The title of the book is given in the
titletag. There is an optional attribute that holds the original title, in case the book appears as a translation or the title was edited for our edition.
- The authors of the book come as a list of entries in the
We need to convert this file into an Excel sheet preserving all information. If there are multiple authors, we put them into a single field, and separate them by a comma.
The output file should have the following structure:
Grab the books.xml (23.3 KB) source file and create a data flow that produces the required file.
Hint: use the xml.select function to convert the xml to structured data. Note that the
:nested_textoptions can make it easier to consistently handle author tag multiplicity and optional attributes respectively.