At my previous day job, we were often tasked with producing data extracts for distribution to outside companies who sold our products. To accomplish this we used a well known reporting platform to produce either Excel or CSV format files. While it worked, it was like using a hammer to drive in a screw. Its just not the best way to accomplish your goal. So during a rare period of calm, I dissected a couple of our extract reports and attempted to convert them to Pentaho Data Integration (aka Kettle).
Because we were pulling data from multiple sources and combining it into one unified output file, the extracts were made up of several queries gathering data from multiple sources. The first query would pull the bulk of the data, and then specific fields from those results were used as filters to pull additional fields from tables in other data sources. This concept can be used in Pentaho as well, making it unnecessary to temporarily store the data from the first extract and join it later with data from subsequent extracts. An additional benefit is that processing in this manner tends to be much quicker.
This type of processing can also be used for denormalizing data when building a data warehouse and in many other scenarios when working with ETL processes.
For an example, I’ll combine a US population by zip code data set with an edited 2010 crime statistics data set from the city of Austin, TX. If they were combined in the same database, it would be easy enough to do a join between the two tables and combine them that way. Instead I loaded the data into two different MySQL databases, but they could just as easily have been loaded in two different RDBMS.
If you would like to follow along, here is the queries to create the tables in MySQL:
CREATE DATABASE CRIME;
CREATE TABLE Crime2010
CREATE DATABASE POPULATION;
CREATE TABLE Austin
CREATE DATABASE CRIME_BY_POPULATION;
CREATE TABLE Combined_Crime2010
The demographic data set includes only zip codes and the population of that zip code from the 2010 US Census. To setup this extract, I could write either data set to a temporary table in the other database. While that approach will work it does add an unnecessary step to the process. And in real-world instances, you may not have the luxury of creating temporary tables.
So instead, I am going to pull data from both databases and write the combined data to a third database. To begin, start a new Pentaho transformation, and add a Table Input object to the workspace (its under Input in the Design palette). Open the object up, make a connection to your database, and construct a query to pull the bulk of the data. This first query gets the bulk of the data, in this instance the crime data for various zip codes in the Austin area.
Add a Lookup step from the Lookup node in the Design panel, and connect the Table Input object to it. The data from the Table Input step is then passed to the Lookup step where the database connection and the table that holds data we want to lookup are defined. To make it work, key fields are defined (basically a join between the two tables).
In this instance, from the Table Input step we have a ZipCode field. In the Population table on another database, we have a similar field called Zip. In the key(s) to lookup section, you can click a drop down in the Table Field area to choose fields from the Lookup table. Select a Comparator operator in the next field. The equals operator will be the most common one used, but others such as LIKE, BETWEEN, ISNULL, IS NOT NULL, and various Greater Than and Less Than combinations are also available.
In the Values to return from the lookup table section, once again you can choose various fields from your Lookup table from drop downs. You can provide a new name for the field if you like, set a default value, and specify the data type.
Finally at the bottom of the window, you have two options that can be used to fine tune the Database Value Lookup and filter the results. Check the first option if you don’t want the row to be passed on if no results are returned from the lookup. Check the second option if you want the lookup to fail if more than one result is returned from the lookup.
Drag a Table Output step on to the canvas and connect the Lookup step to it. Open the step and define your destination connection information. In this window, I generally use the “Specify database fields” option in order to have more control over the mapping process. Specify the “Truncate table” if you want to remove anything in your destination table before writing to it.
If you chose the “Specify database fields” option, click on the Database Fields tab, and click the Get fields button on the right. The “Fields to insert” table should populate showing the fields from the processing stream on the right, and the destination table fields on the left. If you want to make any changes, click the Enter field mapping button and adjust the mappings.
Click OK to close the window, and save your transformation. Run it and in your third database, you should now have a combined dataset of crime by zip code with the population of each zip code.
Pentaho and Kettle are products of Pentaho Inc.