As Big Data platforms like Hadoop and its ecosystem of related applications has matured, they have moved beyond the original key-value model to embrace data processing of more traditional structured data. But a big problem for DBAs and Data Analysts wanting to use the power of these new platforms to analyze data from RDBMS systems like MySQL, SQL Server, is getting data moved between them. Using CSV or flat files is one way, but it adds additional processing. Data has to be extracted from the source system to an intermediary format and then imported into the destination. Its far more efficient and less prone to error if the data can be passed without that middle step.
In this first article of a series where I’ll be looking at interactivity between Hadoop and other database systems, I’ll cover setting up a database connection to Hadoop via Cloudera’s Impala JDBC driver to Pentaho’s Kettle ETL system.