- Cloudera Hadoop
- ODOO with Postgresql
A TALE OF TWO IMPALA BOOKS
A couple of years ago, Cloudera released an open source application to query Hadoop stored data with much of the familiar SQL language syntax used by database professionals. Cloudera seemed to have positioned Impala as a replacement for Hive and Pig and has taken some hits for it. Regardless of corporate motivations, because my day to day work over the past 8 years has revolved around using SQL to development and administer various DB systems, I have taken a keen interest in Impala and how it might be useful. (I’m also interested in Hortonworks Stinger initiative to improve Hive, but that will be a different post).
One of the biggest issues with open source applications, as I have noted before, is the lack of documentation and training materials for people trying to use them. Those of us who work in the corporate world don’t have the luxury of figuring stuff out on our own at our day jobs, so we often look beyond the supplied documentation for better resources for learning new applications. In the past year, two publishers have released books on Cloudera Impala, and I will look at them, compare and contrast and tell you which one I think is better.
As Big Data platforms like Hadoop and its ecosystem of related applications has matured, they have moved beyond the original key-value model to embrace data processing of more traditional structured data. But a big problem for DBAs and Data Analysts wanting to use the power of these new platforms to analyze data from RDBMS systems like MySQL, SQL Server, is getting data moved between them. Using CSV or flat files is one way, but it adds additional processing. Data has to be extracted from the source system to an intermediary format and then imported into the destination. Its far more efficient and less prone to error if the data can be passed without that middle step.
In this first article of a series where I’ll be looking at interactivity between Hadoop and other database systems, I’ll cover setting up a database connection to Hadoop via Cloudera’s Impala JDBC driver to Pentaho’s Kettle ETL system.
This is part 4 of a series about setting up a single-node Hadoop Yarn system for sandbox use. Part 1 was here, part 2 here, and part 3 here. I have another series for using MapReduceV1, which is here. I’m hoping to keep this series in a similar order as the original set of articles, and will deviate only when necessary. All the content here is based on the Cloudera documentation, but I’ve modified it to be easier to follow for setting up a pseudo cluster and added additional content where necessary.
Please be careful when copying lines from these articles to paste into Hadoop config files or a terminal window. I have found that the double hyphen characters used in the comment lines may copy over as a long hyphen instead. This is likely to cause issues when attempting to run the various components.
Before starting, make sure that Python 2.6 or 2.7 is installed on the server. This is easy to accomplish, by opening a terminal window, and from the command line, enter: python
If Python is installed, it will load up and display the version of the software. On my test PC, it responded with Python 2.6.6. Return to the command line by entering at the Python prompt: quit()