Emails files from Kettle

mailenvelopeA common task I encounter when working with ETL tools is to send output files somewhere. I often have to FTP files, but just as often, I need to email output files. I’ll cover how to FTP in a future post, but this time I’ll walk through how to set up a job in Pentaho Kettle (aka Pentaho Data Integration or PDI) to email data files.

Unlike the “Put FTP” step in PDI, where you can specify the file or files you want to upload as part of the job component, when sending files via email, you have to create  a transformation step to define the files you want to send, and then pipe that information into the Email step. This is similar to how variables work in Pentaho, where you define the variables in a step before you can use them.

If this is something you need to do, and you want to know how to do it, read on!

At its most basic level, this kind of task in PDI is very simple building on the task of creating files in PDI, whether they are text , Excel, or whatever. Once the output files are created, sending them via email involved only a couple of steps.

Continue reading

Debugging ETL with Error Output


sys-error
When developing new ETL flows, at an early stage you should include steps for error output so you can more easily locate and fix problems, especially when going between different data platforms. As an example, I have recently been working on data transformations that move records on a regular basis between PostgreSQL, Microsoft SQL Server and a DB2 mainframe system. All of these systems use similar data types, but not always the same ones and the method they use to handle data types may vary as well.

Generally what I do is create an On Error step at the destination points in the transformations, and dump any bad records to a text file. This allows me to review if the error is affecting all of the records ( which would indicate a problem with the configuration) or only some of the records (which may indicate a problem with how the step is coded). And in some cases the error output indicates an issue with the source data that wasn’t foreseen! The biggest benefit however is that you get to run the complete sample of records through your workflow to see how well it processes different values, rather than having it fail on the first error.

Continue reading

Pentaho access to SQL Server with AD Authentication

keys3Back in June, I posted on how to Use SQL Server with Pentaho Data Integrator. That article covered using Microsoft SQL Server (MSSQL) authentication to connect to your MSSQL databases. This time around, we’ll look at the other type of authentication that you can encounter when working with MSSQL - Active Directory authentication (AD for short).

If you are not familiar with AD, its a centralized authentication mechanism allowing access to the various hardware and services in the network. By centralizing the authentication process, the same user account can be used to access multiple resources, and it eliminates some of the setup needed to enable those users on various systems. Most DBA’s prefer to use AD authentication for those reasons, and if you will be using PDI to access multiple MSSQL systems, you’ll probably want to become familiar with setting it up.

Continue reading

Launch Pentaho Spoon from the Desktop

launchComing from a Windows/Mac background, I got in the habit of having shortcuts to applications I use frequently on my desktop or in a task bar. I’ve continued that practice when switching to Linux as well. Unfortunately, Pentaho Spoon - the GUI tool for designing transformations and jobs for Pentaho Data Integration (aka Kettle) is started from a command line. When I tried to create a desktop launcher on my CentOS laptop, but that only resulted in an error message:

Unable to access jarfile launcher/pentaho-launcher-5.1.0.0-752.jar

I searched on the Internet, and apparently this was a common question, so I decided to come up with a quick and easy solution. Here it is:

Continue reading