Running Kettle (Pentaho Data Integration) on Mac OSX 10.12 Sierra

A new version of Mac OSX and a new version of Pentaho Data Integration (aka Kettle) but the same old problem getting Kettle to run. Apple tries to keep their operating system locked down and secure, so if you download applications from the Internet that aren’t from the Apple App Store, the files are quarantined.

With the update to Sierra, the quarantine process has been “improved”. Keep reading to see how to do it!

Continue reading

Install the Saiku Analytics plugin in Pentaho BIServer CE

meterorite

I’ve been working with Mondrian and Pentaho’s Schema Workbench lately and attempted to add Meteorite Consulting’s Saiku Analytic plugin to my installation of Pentaho BI Server community edition, to process some MDX queries. MDX is a query language similar to SQL that is used for processing database cubes. Mondrian is a OLAP engine that implements the MDX language and is incorporated into the Saiku Analytic software. It differs from other OLAP engines in that the cubes are built on the fly as the query processes, rather than having the cube data stored on a server. For simpler cubes, the trade off between a slightly slower build time and disk space is negligible.

Here is the process I followed to get Saiku enabled in my BI Server:

Continue reading

Pentaho Data Integration’s Fuzzy Match

fuzzyWhen cleansing data, one of the biggest challenges is determining if one record is the same as another in the absence of a unique identifier. For example, if your database has a record for Terri Lee Duffy, and you get a new record for Terry Lee Duffy, is it the same person? If you have a government ID number then its possible to tell definitively, that its the same person. But what if you don’t have that to distinguish the record? You could check other related data if you have it, like street address, but what if one record has 100 South Ave and the other is 100 South Road? A human looking could say yes or no that this is the same person.

We don’t want to have to check every discrepancy, especially if we are moving millions of rows at a time. In order to automate this process, we can use a component in Pentaho called Fuzzy Match. (For a longer discussion of Fuzzy Matching, Melissa Data Corporation has a good overview.) While the results of a Fuzzy Match process are not 100% perfect, you can set an allowance threshold so that similarities have to be within a certain range or you can show only the closest match as a result of your Fuzzy Match. Finally, the Fuzzy Match component can use one of several algorithms to determine if one field is a match for another.  The Pentaho Wiki discusses the nuances of these algorithms and has some discussion on the best times to use them.

Continue reading

Use GMail with Pentaho BI-Server Community

Email servers are fairly common in a lot of organizations, but many smaller companies elect to use outside hosting for their email. There are myriad reasons for doing so, and the choice obviously makes sense, otherwise they wouldn’t do it. If your organization is using Google Gmailfor its email, you can set up Pentaho’s BI Server to use it as your email server. In this brief article, I’ll walk you through the settings you’ll need to use to make it work.

Home menuTo access the email settings in the Pentaho User Console, login to your BI-Server website with an Administrator account. Click on the large HOME menu item, and at the bottom of the menu that appears, click on Administration. When the Administration screen appears, click on the Email Server option in the menu on the left. The screen will update with the Email Server settings fields. Below are instructions for what information to use to populate the fields depending on the hosting service you are using.

Continue reading