Interesting Docker Containers (and some tips on running them)

shipping-82339I’ve been learning how to use Docker for the last couple of months. Part of that experience has been downloading and working with various freely available containers from the Docker Hub. Since an ever increasing number applications are web based (i.e using a website as a UI tool) porting many open source projects to use Docker is becoming less complex. While not all open source applications can benefit yet from containerization, I think its only a matter of time before this technology starts to really gain in popularity.
My area of interest tends to fall more towards Business Intelligence and related technologies, so what I’ve been experimenting with are containers in that realm. This time around, I’ll discuss these containers and my experience getting them running on a native Docker installation on Ubuntu:
  • Rstudio
  • Cloudera Hadoop
  • ODOO with Postgresql

Continue reading “Interesting Docker Containers (and some tips on running them)”





A couple of years ago, Cloudera released an  open source application to query Hadoop stored data with much of the familiar SQL language syntax used by database professionals. Cloudera seemed to have positioned Impala as a replacement for Hive and Pig and has taken some hits for it. Regardless of corporate motivations, because my day to day work over the past 8 years has revolved around using SQL to development and administer various DB systems, I have taken a keen interest in Impala and how it might be useful. (I’m also interested in Hortonworks Stinger initiative to improve Hive, but that will be a different post).

One of the biggest issues with open source applications, as I have noted before, is the lack of documentation and training materials for people trying to use them. Those of us who work in the corporate world don’t have the luxury of figuring stuff out on our own at our day jobs, so we often look beyond the supplied documentation for better resources for learning new applications. In the past year, two publishers have released books on Cloudera Impala, and I will look at them, compare and contrast and tell you which one I think is better.

Continue reading “REVIEW: IMPALA Books”

Connecting Kettle to Cloudera Hadoop Impala

hadoop-elephantAs Big Data platforms like Hadoop and its ecosystem of related applications has matured, they have moved beyond the original key-value model to embrace data processing of more traditional structured data. But a big problem for DBAs and Data Analysts wanting to use the power of these new platforms to analyze data from RDBMS systems like MySQL, SQL Server, is getting data moved between them. Using CSV or flat files is one way, but it adds additional processing. Data has to be extracted from the source system to an intermediary format and then imported into the destination. Its far more efficient and less prone to error if the data can be passed without that middle step.

In this first article of a series where I’ll be looking at interactivity between Hadoop and other database systems, I’ll cover setting up a database connection to Hadoop via Cloudera’s Impala JDBC driver to Pentaho’s Kettle ETL system.

Continue reading “Connecting Kettle to Cloudera Hadoop Impala”

Setup a Single-node Hadoop Yarn machine using CDH5 – Part 4

hue_logo_300dpi_hugeThis is part 4 of a series about setting up a single-node Hadoop Yarn system for sandbox use. Part 1 was here, part 2 here, and part 3 here. I have another series for using MapReduceV1, which is here. I’m hoping to keep this series in a similar order as the original set of articles, and will deviate only when necessary. All the content here is based on the Cloudera documentation, but I’ve modified it to be easier to follow for setting up a pseudo cluster and added additional content where necessary.

Please be careful when copying lines from these articles to paste into Hadoop config files or a terminal window. I have found that the double hyphen characters used in the comment lines may copy over as a long hyphen instead. This is likely to cause issues when attempting to run the various components.

Before starting, make sure that Python 2.6 or 2.7 is installed on the server. This is easy to accomplish, by opening a terminal window, and from the command line, enter: python

If Python is installed, it will load up and display the version of the software. On my test PC, it responded with Python 2.6.6. Return to the command line by entering at the Python prompt:  quit()

Continue reading “Setup a Single-node Hadoop Yarn machine using CDH5 – Part 4”