Pentaho Data Integration’s Fuzzy Match

fuzzyWhen cleansing data, one of the biggest challenges is determining if one record is the same as another in the absence of a unique identifier. For example, if your database has a record for Terri Lee Duffy, and you get a new record for Terry Lee Duffy, is it the same person? If you have a government ID number then its possible to tell definitively, that its the same person. But what if you don’t have that to distinguish the record? You could check other related data if you have it, like street address, but what if one record has 100 South Ave and the other is 100 South Road? A human looking could say yes or no that this is the same person.

We don’t want to have to check every discrepancy, especially if we are moving millions of rows at a time. In order to automate this process, we can use a component in Pentaho called Fuzzy Match. (For a longer discussion of Fuzzy Matching, Melissa Data Corporation has a good overview.) While the results of a Fuzzy Match process are not 100% perfect, you can set an allowance threshold so that similarities have to be within a certain range or you can show only the closest match as a result of your Fuzzy Match. Finally, the Fuzzy Match component can use one of several algorithms to determine if one field is a match for another.  The Pentaho Wiki discusses the nuances of these algorithms and has some discussion on the best times to use them.

Use GMail with Pentaho BI-Server Community

Email servers are fairly common in a lot of organizations, but many smaller companies elect to use outside hosting for their email. There are myriad reasons for doing so, and the choice obviously makes sense, otherwise they wouldn’t do it. If your organization is using Google Gmailfor its email, you can set up Pentaho’s BI Server to use it as your email server. In this brief article, I’ll walk you through the settings you’ll need to use to make it work.

Home menuTo access the email settings in the Pentaho User Console, login to your BI-Server website with an Administrator account. Click on the large HOME menu item, and at the bottom of the menu that appears, click on Administration. When the Administration screen appears, click on the Email Server option in the menu on the left. The screen will update with the Email Server settings fields. Below are instructions for what information to use to populate the fields depending on the hosting service you are using.

Remove evaluator login from Pentaho BI server

evaluateWhen you first install the Pentaho BI server, the login screen includes an option to Login as an Evaluator, either as an Administrator (Admin) or a Power User (Suzy). While this is handy if you just want to check the software out, its a huge security hole if you plan to move to production mode. The good news is that removing that functionality involves editing one configuration file to change a couple of settings.

Open a terminal and navigate to where the BI-Server was installed. On my system that is /opt/pentaho/biserver-ce. Drill down into  the pentaho-solutions folder, and then to system folder.

Using a text editor, open the pentaho.xml file.

Using Chrome with Pentaho Report Designer 6

reportPentaho’s Report Designer (PRD) is a full featured application that allows you to define reports that can be used within the Pentaho BI suite or as stand-alone documents. Output can be in a number of formats: PDF, Excel (XLS or XLSX versions), CSV/TXT, RTF or HTML.  If you would like to do a preview of your report in HTML format and you don’t have one of the default supported browsers installed (like me on my Mint laptop), or you would like to use a different default browser, you can tell PRD which browser to use.

Open Report Designer, and from the main menu, click on EDIT, and then click the Preferences option at the bottom of the screen.