When creating ETL workflows, its useful to store the information in a database repository, rather than as individual files on your workstation. This allows multiple users to have access to the information (why recreate the wheel?), it allows you to pull it into your jobs quickly and easily, and you can back it up quickly and restore it if necessary.
Pentaho Data Integration (aka Kettle) is an open source ETL tool that has a repository feature, which allows you to store your transformations and jobs in local files, or in a central repository database. The file option is pretty easy to implement, so I won’t cover it here. Because of my work experience, I prefer to use a database server based repository. Unfortunately, the documentation for setting up a DB repository is sorely lacking (a common problem with a lot of open source projects). After some experimenting, I did figure out how to create a MySQL based repository, and how to connect to it from a Linux based installation of PDI. Here is a walk through of the process: (more…)