Extract PDF data with Tabula

tabulaAdobe’s PDF file format is a wonderful tool, allowing users on disparate operating systems to share documents easily. Because Adobe made the file format an open-standard in 2008, applications to create and read PDF files readers can be found on pretty much every operating system you can think of – Linux distros, Windows, Mac OS X and BSD,  just to name a few, from no cost to several hundred dollars. And in most cases, the original document, if not identical to the PDF, is close enough to identical to make the differences irrelevant.

In my work as a BI developer, occasionally I have to extract data from PDF documents and get it into a database. While reading the file is no problem, getting the data out in a usable format where I am not having to retype or reformat the output excessively is often times not so easy. Luckily I have come across an open-source tool, called Tabula,  that makes extracting data from a PDF much easier. It doesn’t work for every PDF, only on text-based PDFs. That means reports and data sets that were exported to a PDF file, rather than documents that were scanned into a computer and saved as a PDF file. (The latter tends to be image type files rather than text based documents.)

Continue reading

SQuirreL SQL Client for accessing different databases – Part 1

squirrelIts been my experience that if you work on ETL projects, you eventually accumulate client software for a number of database systems on your development PC. The reason is pretty straightforward – you need to be able to access the systems you are working with to determine data types, schema structures, and occasionally to check that a User account and Password you have been given actually works.

One problem I’ve run into though is that not all operating systems are supported by different database vendors with their tools. While Windows has the largest installation base, Mac OS X, and Linux also are used for ETL development  but Microsoft’s SQL Server management tool will only work on Windows machines. Apple’s FileMaker software is similar, running on Mac OS X and Windows, but not Linux (since version 7). The examples go on and on. Also, because each tool is laid out differently, it can be difficult to find what you need quickly when you only work infrequently on a specific platform. Often times remembering where I need to go in a specific tool will take me longer than getting the actual information I was looking for.

All of this leads to the point of this post – using a free open source product call SQuirreL SQL Client to access multiple database platforms via one application regardless of whether you are running Windows, Mac OS X or any of a large variety of Linux distributions.

Continue reading

Add Pentaho to your CentOS Application menu

menuA while back, I posted on how to get Pentaho Data Integration to launch from a desktop shortcut. Recently, I’ve installed CentOS7 with Gnome and wanted to install a menu item for PDI. While not too difficult, the process isn’t streamlined simple either, so I thought it would be good to document it.

Start with the same process I’ve covered before for setting up a “start-pentaho.sh” bash script. Once you have it created, copy the file (using the root account) to the folder where you installed Pentaho on your system. In my case,  it is under /opt/pentaho/data-integration, so I copied the “start-pentaho.sh” file to the /opt/pentaho folder and renamed it to “start-spoon.sh”.

Continue reading

Install MySQL Workbench 6.2 on Centos

workbenchThe world of computers is constantly evolving, and that means having to upgrade your software periodically if you want to stay current. The GA version of MySQL Workbench, the GUI tool for interacting with the MySQL database engine was recently updated. For information on changes, you can check out the official documentation at this link, but a couple of the biggest changes revolve around Microsoft products:

  • you can now migrate Microsoft Access databases, and
  • 64-bit Windows binaries are now provided to go along with the 32-bit ones.

I use MySQL as a test bed for a lot of Pentaho development, so I like to keep the related tools up to date. Although this version does work with Centos 6.6 (the version I am using of the RHEL  distribution), its not as easy as it should be to install.

Continue reading