Challenge #10: Scrape information from a HTML table

For this challenge, we work with a legacy system that we can only access by reading a HTML page from it. The HTML contains a table that looks like this:

Date Order ID Name Price Quantity Total
2019-09-29 200398 Tech Gadget 87X $999.00 1 $999.00
2019-09-28 200397 Samsung HD Black $756.00 1 $756.00
2019-09-26 200396 Game Controller $22.00 2 $44.00
2019-09-25 200392 USB 3.0 Cable $10.00 3 $30.00
2019-09-24 200391 Smartwatch 4.0 LTE Wifi $199.00 6 $1494.00
2019-09-23 200390 Camera C430W 4k $699.00 1 $699.00
2019-09-22 200389 Macbook Pro $2199.00 1 $2199.00
2019-09-19 200387 Super SmartPhone 7 $999.00 1 $999.00
2019-09-18 200386 Super SmartPhone 7 $999.00 1 $999.00
2019-09-22 200389 Macbook Pro $2199.00 1 $2199.00
2019-09-21 200388 Game Console Controller $999.00 1 $999.00
2019-09-19 200387 Super SmartPhone 7 $1299.00 1 $1299.00
2019-09-18 200386 Super SmartPhone 7 $1299.00 1 $1299.00

Our task is to extract the information from the table, so we can put it into a database. We should be able to extract information into fields like this:

Begin with start.dfl (13.1 KB) . It includes the HTML table.

Hint: use the html.select function to select all rows, and extract the text from child cells.

The solution.dfl (25.2 KB) for this challenge uses the following strategy:

  • extract all rows from the html table
  • iterate over the rows
  • output cell values as fields

The Starting Point

We’re starting from a HTML string that that looks something like like this:

The data rows we’re looking for are the children of the tbody tag. The corresponding CSS selector is tbody > tr so we go ahead and use that in a call to html.select.

The call returns us a list of dicts, each representing a tr tag. Each row’s cells appear as a list under the :td key.

Our rows field is a list, so we can use it as in input to the Get Items step to iterate over the row items.

While we’re at it, we access the current row record, and collect the text content of each td cell in the step variable cells.

As a bonus, we also set up a number parser to properly extract monetary values from the cell text.

The output fields just reference our cells in order. Cells containing monetary values are first passed through our money parser.

As a last step we remove temporary fields from the row stream and arrive at the desired output: