Pentaho Fuzzy Match

fuzzyWhen cleansing data, one of the biggest challenges is determining if one record is the same as another in the absence of a unique identifier. For example, if your database has a record for Terri Lee Duffy, and you get a new record for Terry Lee Duffy, is it the same person? If you have a government ID number then its possible to tell definitively, that its the same person. But what if you don’t have that to distinguish the record? You could check other related data if you have it, like street address, but what if one record has 100 South Ave and the other is 100 South Road? A human looking could say yes or no that this is the same person.

We don’t want to have to check every discrepancy, especially if we are moving millions of rows at a time. In order to automate this process, we can use a component in Pentaho called Fuzzy Match. (For a longer discussion of Fuzzy Matching, Melissa Data Corporation has a good overview.) While the results of a Fuzzy Match process are not 100% perfect, you can set an allowance threshold so that similarities have to be within a certain range or you can show only the closest match as a result of your Fuzzy Match. Finally, the Fuzzy Match component can use one of several algorithms to determine if one field is a match for another.  The Pentaho Wiki discusses the nuances of these algorithms and has some discussion on the best times to use them.

Continue reading

Using Docker on demand with Linux Mint

container_shippingIf you are like me and work on multiple things on your development system, you don’t always want everything running when you start your PC. I’ve previously covered starting other services on demand, and this time around I’ll cover running Docker as needed.

Docker has essentially two separate components. There is the Docker daemon (or service) that is configured to start when the system is booted up and there is the Docker CLI that you interact with and your commands are passed to the daemon. The CLI only runs when you specifically call it from the terminal prompt with the DOCKER command. For my purposes, I didn’t need or want the daemon running all the time because its a laptop. (If I was using a production system or even a full blown development box, I would prefer to have the daemon always running.) So after installing Docker, I needed to configure it to not start up every time the system starts, and then come up with an easy way to start it as needed. Continue reading

Docker Admin Cheat Sheet part 1

container-lockHere is part one of a personal docker administration cheat sheet I have been putting together. I know there are a number of sites that provide similar tools, but for my own purposes, its easier to remember different commands when I organize them myself.

Docker administration cheat sheet

Docker images are a source file (like an .ISO) that is used to start a container (an installed system). Periodic maintenance is necessary because a Docker container remains on your system even after it exits.

What’s running?

Show all running local containers with container id, image it’s based on, any open ports, commands that it runs on startup, and when it was created.

  • Docker ps

Show all local containers (running or not) with short container id, image it’s based on, any open ports, commands that run on startup, and when it was created

  • Docker ps –a or -l

Show only the container id of all local containers (running or not running).

  • Docker ps -aq or -lq

Show the same info as –a or –l but with the full container ID rather than the shortened one.

  • Docker ps –a –no-trunc

Pause the container with the specified ID (which you can get with the ps command)

  • Docker pause <container id>

Restart a previously paused container with the specified ID (which you can get with the ps command)

  • Docker start <container id>

Local Images

Show all locally stored image files, with a common name (referred to as a repository), a 12 character ID, the size of the image file and how long ago the image was created. There may also be a tag indicating the image file version

  • Docker images

Shutdown containers

Gracefully shut down an active container with the id or short name (which you can get with the ps command) – this is preferable to kill.

  • Docker stop <container ID> or <name>

If stop doesn’t work you can forcefully shut down an active container with the id or short name (which you can get with the ps command)

  • Docker kill <container ID> or <name>

To force a shut down of a running container and delete it

  • Docker rm –f <containerID>


Delete the container identified with the specified ID (which you can get with the ps command)

  • Docker rm <container id>

Delete a docker image file (a source file for containers) identified with the specified id (which you can get with the ps command)

  • Docker rmi <image name>

To delete all of the containers on your local system (be careful!), use this command

  • Docker rm $(docker ps –a –q)

That’s all for now! Coming up I have posts on running containers, and how to attach the host file system as a volume in a container.

Photo Break – Mansfield Reformatory


Above are a couple of pictures I took during a photography workshop at the Mansfield Reformatory in Mansfield Ohio. Erected in the late 1800s, the prison was closed a few decades ago but is maintained as an historic site.  In the intervening time its been used as the setting for a number of movies, most notably the Shawshank Redemption. It was a fascinating experience and if you are near Mansfield Ohio I recommend stopping in for a tour.

BTW – my apologizes for the dearth of posts lately, but some recent changes has left me with a serious shortness of free time.