Adobe’s PDF file format is a wonderful tool, allowing users on disparate operating systems to share documents easily. Because Adobe made the file format an open-standard in 2008, applications to create and read PDF files readers can be found on pretty much every operating system you can think of – Linux distros, Windows, Mac OS X and BSD, just to name a few, from no cost to several hundred dollars. And in most cases, the original document, if not identical to the PDF, is close enough to identical to make the differences irrelevant.
In my work as a BI developer, occasionally I have to extract data from PDF documents and get it into a database. While reading the file is no problem, getting the data out in a usable format where I am not having to retype or reformat the output excessively is often times not so easy. Luckily I have come across an open-source tool, called Tabula, that makes extracting data from a PDF much easier. It doesn’t work for every PDF, only on text-based PDFs. That means reports and data sets that were exported to a PDF file, rather than documents that were scanned into a computer and saved as a PDF file. (The latter tends to be image type files rather than text based documents.)