Challenge #11: Joining data

For our next challenge, we’re looking at historical environmental data.

We get the average temperature and water level of a deep lake for every year between 1950 and 1970.

Use start.dfl (31.6 KB) as a starting point.

The inputs

The temperature is given in Fahrenheit like so:

1950	51.4
1951	52.7
...
1970	51.9

The water level is given in meters like so:

1950	578.12
1951	579.75
...
1970	579.31

The task

For any given year measurement data may be missing. We want an unbroken sequence of years from 1950-1970 with the associated measurements. If a measurement is missing, we use a nil value.

In addition, we need an new field giving the temperature in ºC.

Desired output

The desired output of our data flow should look like this:

Screenshot 2020-02-04 at 14.30.07

Here’s the reference solution.dfl (46.2 KB) for this challenge:

The strategy is as follows:

  • calculate the ºC value on the temp input
  • generate a sequence of years 1950-1970
  • outer join water level
  • outer join temperature
  • clean up fields

Temperature Conversion ºF to ºC

The formula to convert degrees fahrenheit to degrees celcius is: c = (f-32) * 5/9
We can easily put that into a formula in a calculator step. For bonus consistency, we can make the resulting ºC value a decimal with one fractional digit using the decimals.with_scale function.

Generating a sequence of years 1950-1970

We can use the Generate Rows step to generate 21 rows, and use the result value to populate an output field. I’ve prepared a configuration that stores the start and end year in entry variables, and generates the values automatically.

Outer Join on Year

Both the water level and the temperature streams are joined using an outer join on the generated year. The Join on Sorted Keys step does this for us. Performing an outer join ensures that we keep all years in the stream, even if there is no corresponding measurements.

The only thing left to do is to use the Pick Fields step to clean up the fields in our stream, and we’re done. The final output looks like requested: