I’ve been working with Mondrian and Pentaho’s Schema Workbench lately and attempted to add Meteorite Consulting’s Saiku Analytic plugin to my installation of Pentaho BI Server community edition, to process some MDX queries. MDX is a query language similar to SQL that is used for processing database cubes. Mondrian is a OLAP engine that implements the MDX language and is incorporated into the Saiku Analytic software. It differs from other OLAP engines in that the cubes are built on the fly as the query processes, rather than having the cube data stored on a server. For simpler cubes, the trade off between a slightly slower build time and disk space is negligible.
Here is the process I followed to get Saiku enabled in my BI Server:
Continue reading “Install the Saiku Analytics plugin in Pentaho BIServer CE”
A quick one today: While working on a project, I couldn’t access the MySQL server (version 5.7.12) that was on another system. I was in a development environment on a local network with just me on in, so the MySQL server did not have a firewall running. Here is what I did to get my connection to work.
- Add an Administrator user account with permissions to connect from any host:
CREATE USER 'edpflager'@'%' IDENTIFIED BY 'my_password';
GRANT ALL PRIVILEGES ON *.* TO 'edpflager'@'%'
WITH GRANT OPTION;
- Next open a terminal prompt on the MySQL server, and navigate to /etc/mysql/mysql.conf.d
- Open a text editor as superuser and edit mysqld.cnf
sudo nano ./mysqld.cnf
- Find the following line and add a # to the beginning to comment it out:
bind-address = 127.0.0.1
- Save, exit, and restart MySQL to make it take effect.
You should now be able to access MySQL as the admin account you created previously.
When cleansing data, one of the biggest challenges is determining if one record is the same as another in the absence of a unique identifier. For example, if your database has a record for Terri Lee Duffy, and you get a new record for Terry Lee Duffy, is it the same person? If you have a government ID number then its possible to tell definitively, that its the same person. But what if you don’t have that to distinguish the record? You could check other related data if you have it, like street address, but what if one record has 100 South Ave and the other is 100 South Road? A human looking could say yes or no that this is the same person.
We don’t want to have to check every discrepancy, especially if we are moving millions of rows at a time. In order to automate this process, we can use a component in Pentaho called Fuzzy Match. (For a longer discussion of Fuzzy Matching, Melissa Data Corporation has a good overview.) While the results of a Fuzzy Match process are not 100% perfect, you can set an allowance threshold so that similarities have to be within a certain range or you can show only the closest match as a result of your Fuzzy Match. Finally, the Fuzzy Match component can use one of several algorithms to determine if one field is a match for another. The Pentaho Wiki discusses the nuances of these algorithms and has some discussion on the best times to use them.
Continue reading “Pentaho Data Integration’s Fuzzy Match”
Adobe’s PDF file format is a wonderful tool, allowing users on disparate operating systems to share documents easily. Because Adobe made the file format an open-standard in 2008, applications to create and read PDF files readers can be found on pretty much every operating system you can think of – Linux distros, Windows, Mac OS X and BSD, just to name a few, from no cost to several hundred dollars. And in most cases, the original document, if not identical to the PDF, is close enough to identical to make the differences irrelevant.
In my work as a BI developer, occasionally I have to extract data from PDF documents and get it into a database. While reading the file is no problem, getting the data out in a usable format where I am not having to retype or reformat the output excessively is often times not so easy. Luckily I have come across an open-source tool, called Tabula, that makes extracting data from a PDF much easier. It doesn’t work for every PDF, only on text-based PDFs. That means reports and data sets that were exported to a PDF file, rather than documents that were scanned into a computer and saved as a PDF file. (The latter tends to be image type files rather than text based documents.)
Continue reading “Extract PDF data with Tabula”