Connecting Kettle to Cloudera Hadoop Impala

hadoop-elephantAs Big Data platforms like Hadoop and its ecosystem of related applications has matured, they have moved beyond the original key-value model to embrace data processing of more traditional structured data. But a big problem for DBAs and Data Analysts wanting to use the power of these new platforms to analyze data from RDBMS systems like MySQL, SQL Server, is getting data moved between them. Using CSV or flat files is one way, but it adds additional processing. Data has to be extracted from the source system to an intermediary format and then imported into the destination. Its far more efficient and less prone to error if the data can be passed without that middle step.

In this first article of a series where I’ll be looking at interactivity between Hadoop and other database systems, I’ll cover setting up a database connection to Hadoop via Cloudera’s Impala JDBC driver to Pentaho’s Kettle ETL system.

Continue reading

Install MySQL Workbench 6.2 on Centos

workbenchThe world of computers is constantly evolving, and that means having to upgrade your software periodically if you want to stay current. The GA version of MySQL Workbench, the GUI tool for interacting with the MySQL database engine was recently updated. For information on changes, you can check out the official documentation at this link, but a couple of the biggest changes revolve around Microsoft products:

  • you can now migrate Microsoft Access databases, and
  • 64-bit Windows binaries are now provided to go along with the 32-bit ones.

I use MySQL as a test bed for a lot of Pentaho development, so I like to keep the related tools up to date. Although this version does work with Centos 6.6 (the version I am using of the RHEL  distribution), its not as easy as it should be to install.

Continue reading

Photo Break!

bulb05Once in a while I like to post a picture I’ve taken if I like how it turned out. The one above is an example. :)

It was taken at 1/8000 of a second, f/11 with ISO 4000 using a 100mm lens and an off camera flash to the left with a blue gel.  If none of that makes any sense to you, don’t worry. I’ll be back to posting about ETL/Big Data soon.


Pentaho – Using Database Lookups

KettleAt my previous day job, we were often tasked with producing data extracts for distribution to outside companies who sold our products. To accomplish this we used a well known reporting platform to produce either Excel or CSV format files. While it worked, it was like using a hammer to drive in a screw. Its just not the best way to accomplish your goal. So during a rare period of calm, I dissected a couple of our extract reports and attempted to convert them to Pentaho Data Integration (aka Kettle).

Because we were pulling data from multiple sources and combining it into one unified output file,  the extracts were made up of several queries gathering data from multiple sources. The first query would pull the bulk of the data, and then specific fields from those results were used as filters to pull additional fields from tables in other data sources.  This concept can be used in Pentaho as well, making it unnecessary to temporarily store the data from the first extract and join it later with data from subsequent extracts. An additional benefit is that processing in this manner tends to be much quicker.

This type of processing can also be used for denormalizing data when building a data warehouse and in many other scenarios when working with ETL processes.

Continue reading