Extract PDF data with Tabula

tabulaAdobe’s PDF file format is a wonderful tool, allowing users on disparate operating systems to share documents easily. Because Adobe made the file format an open-standard in 2008, applications to create and read PDF files readers can be found on pretty much every operating system you can think of – Linux distros, Windows, Mac OS X and BSD,  just to name a few, from no cost to several hundred dollars. And in most cases, the original document, if not identical to the PDF, is close enough to identical to make the differences irrelevant.

In my work as a BI developer, occasionally I have to extract data from PDF documents and get it into a database. While reading the file is no problem, getting the data out in a usable format where I am not having to retype or reformat the output excessively is often times not so easy. Luckily I have come across an open-source tool, called Tabula,  that makes extracting data from a PDF much easier. It doesn’t work for every PDF, only on text-based PDFs. That means reports and data sets that were exported to a PDF file, rather than documents that were scanned into a computer and saved as a PDF file. (The latter tends to be image type files rather than text based documents.)

Continue reading

Linux Partitioning

shoji-screens-1416865When installing a Linux distro, one of the things you have to decide on is how to partition your hard drive to store various components of the Linux system. For those new to Linux, you can let the installer decide for you, and as with most default settings the outcome may not be the best but it will work. The system’s default layout generally will define a boot partition and a swap location, and then a root partition for everything else. Not optimal, but it will work.

Once you’ve worked with Linux for a while, and have installed a few distros or upgraded, you realize that those default partitions can cause some problems. Specifically your personal files from your home folder will get overwritten and you may lose any personalized configuration settings that are stored in home hidden folders and files. But defining a partition scheme can be a daunting task. So here are some suggestions on how to partition your drive, using my current setup as an example.

Continue reading

Greetings! Its been almost a month since I posted anything. That’s not because I’ve been inordinately busy at work or at home, just didn’t have anything ready to post. Just as a teaser, I’ve been experimenting with Docker, using Pentaho and MongoDB together, and switched my Dell Inspiron from Ubuntu to Linux Mint.

So hopefully, I’ll have something useful to post soon.