When using an ETL tool, you often create a workflow that will run multiple times, picking up new and changed records. Getting just those changes is pretty easy using Kettle’s Insert/Update tool. (BTW – Kettle is one component in the Pentaho Data Integration application – PDI for short).
Assumptions and requirements
For this tutorial, I am assuming you have access to a MySQL (or MariaDB) database server. We’ll be creating a a sample database based on one originally created by Fusheng Wang and Carlo Zaniolo at Siemens Corporate Research. For our purposes we only need one table with a small amount of data. Copy the script below and save it as a SQL file on your system. Run it in MySQL to create the database and populate the table (Yes Production is spelling incorrectly).
DROP DATABASE IF EXISTS sample;
CREATE DATABASE IF NOT EXISTS sample;
CREATE TABLE departments
( dept_no CHAR(4) NOT NULL,
dept_name VARCHAR(40) NOT NULL,
PRIMARY KEY (dept_no),
UNIQUE KEY (dept_name) );
INSERT INTO `departments`
Once you have it loaded, do a select on your departments table, and it should return five records.
Now copy the lines below into a text file using a basic text editor (Notepad, TextEdit, Notepad++, Gedit, etc) and save it as changes.txt to a location you can reach from within Pentaho. Notice we are adding four departments and correcting the spelling on the Production department.
- Open up Pentaho and create a new database connection to your sample database. Be sure to test the connection to make sure you can access the data.
- Create a new transformation, and from the Design tab, under the Input node, drag a Text File Input step onto the canvas.
- From under the Output node, drag an Insert/Update step onto the canvas.
- Connect the two, by holding down your Shift key and click on the Input step. Drag over to the output step and release your mouse. The results should look like this:
5. Double click the Text file in
put step, and on the first tab – File, click the Browse button and navigate to where you saved the changes.txt file, choosing OK on the file location window to be returned to the File tab. Click the Add button to move the file down to the Selected Files table, like this.
6. At this point, you can click the Show File Content button at the bottom to see what is in the text file. It should look like this:
7. Close the Preview data window, and double click on the Insert/Update icon. We need to populate the information here to tell Pentaho how to handle our incoming data. We need to specify:
- the database connection: “Sample”
- the target table: “departments”
- whether to perform updates or not. Check the box to only add new records to your database.
- the key(s) to look up the value(s): dept_no = dept_no. Because we are using the same field names in the input and output streams we can have the same values here. You can use multiple input fields to define your table field, and you can choose different comparison operators as well. Click the Get fields button if you would like to have the fields populated for you. Warning – this will get all of the fields not just primary key ones.
- In the bottom table, you specify where data coming in from your source gets written to in your destination. If you have a lot of fields in your tables, use the Get update fields button to populate the table. You can always delete items. If you want fields to be updated when changed data comes in, make sure the Update column in this table is set to Y. Change it to N for those fields you don’t want to change.
- The results should look like this:
8. Click OK to return to the canvas. Click the run button to run your transformation, and if you did everything correctly, you should see these results:
Notice that the Input was five rows, but the Output only says 4. The difference is in the Updated field, where there is one record. Pentaho added four new records to your table, and updated one.
Verify the results by switching back to MySQL and doing a Select on the table. You should see the four new departments, and the updated Production department with the correct spelling.
If your source database has a timestamp field, cast it as a date or a time field to be able to update it in your destination table