Running Kettle (Pentaho Data Integration) on Mac OSX 10.12 Sierra

A new version of Mac OSX and a new version of Pentaho Data Integration (aka Kettle) but the same old problem getting Kettle to run. Apple tries to keep their operating system locked down and secure, so if you download applications from the Internet that aren’t from the Apple App Store, the files are quarantined.

With the update to Sierra, the quarantine process has been “improved”. Keep reading to see how to do it!

Continue reading

Access a MySQL Server Remotely

logo-mysqlA quick one today: While working on a project, I couldn’t access the MySQL server (version 5.7.12) that was on another system. I was in a development environment on a local network with just me on in, so the MySQL server did not have a firewall running. Here is what I did to get my connection to work.

  1. Add an Administrator user account with permissions to connect from any host:
    CREATE USER 'edpflager'@'%' IDENTIFIED BY 'my_password';
    GRANT ALL PRIVILEGES ON *.* TO 'edpflager'@'%' WITH GRANT OPTION;
  2. Next open a terminal prompt on the MySQL server, and navigate to /etc/mysql/mysql.conf.d
  3. Open a text editor as superuser  and edit mysqld.cnf
    sudo nano ./mysqld.cnf
  4. Find the following line and add a # to the beginning to comment it out:
    bind-address = 127.0.0.1
  5. Save, exit, and restart MySQL to make it take effect.

You should now be able to access MySQL as the admin account you created previously.

 

Pentaho Data Integration’s Fuzzy Match

fuzzyWhen cleansing data, one of the biggest challenges is determining if one record is the same as another in the absence of a unique identifier. For example, if your database has a record for Terri Lee Duffy, and you get a new record for Terry Lee Duffy, is it the same person? If you have a government ID number then its possible to tell definitively, that its the same person. But what if you don’t have that to distinguish the record? You could check other related data if you have it, like street address, but what if one record has 100 South Ave and the other is 100 South Road? A human looking could say yes or no that this is the same person.

We don’t want to have to check every discrepancy, especially if we are moving millions of rows at a time. In order to automate this process, we can use a component in Pentaho called Fuzzy Match. (For a longer discussion of Fuzzy Matching, Melissa Data Corporation has a good overview.) While the results of a Fuzzy Match process are not 100% perfect, you can set an allowance threshold so that similarities have to be within a certain range or you can show only the closest match as a result of your Fuzzy Match. Finally, the Fuzzy Match component can use one of several algorithms to determine if one field is a match for another.  The Pentaho Wiki discusses the nuances of these algorithms and has some discussion on the best times to use them.

Continue reading

Get MongoDB data with Pentaho Data Integration

kettlePentaho Data Integration (aka PDI or Kettle) is one of the most fully-featured tools for extracting data from a MongoDB environment. MongoDB stores information in documents instead of records, with data for a distinct subject instance stored in a single document where a traditional database might use multiple tables linked via primary and foreign keys and joins. This paradigm generally makes retrieval quicker since than with a comparable relational database system. If you are using PDI to connect to MongoDB, it will probably be the initial source or final destination for the data. For this article, I’ll cover how to use PDI to extract information from a MongoDB collection and save it to a text file. It could just as easily be passed onto another database, simple, or manipulated for other processing downstream.

I do make one assumption: You have a development MongoDB environment setup and running. Continue reading