Get MongoDB data with Pentaho Data Integration

kettlePentaho Data Integration (aka PDI or Kettle) is one of the most fully-featured tools for extracting data from a MongoDB environment. MongoDB stores information in documents instead of records, with data for a distinct subject instance stored in a single document where a traditional database might use multiple tables linked via primary and foreign keys and joins. This paradigm generally makes retrieval quicker since than with a comparable relational database system. If you are using PDI to connect to MongoDB, it will probably be the initial source or final destination for the data. For this article, I’ll cover how to use PDI to extract information from a MongoDB collection and save it to a text file. It could just as easily be passed onto another database, simple, or manipulated for other processing downstream.

I do make one assumption: You have a development MongoDB environment setup and running. Continue reading

Start MongoDB on Demand – Ubuntu

mongodbI use my laptop for development and testing on a number of different database platforms. And for a host of reasons, I like to keep programs installed locally rather than having to connect to another box whenever possible. That can present a challenge however because most database platforms want to start up when the computer starts up.  It makes sense (because typically these are server based applications) but I usually don’t want them to startup when I start the laptop up. My machine has enough to contend with when I am working without having to manage the overhead of several database, especially since I don’t always use each platform every day.

Recently, I started a project using MongoDB, and installed it via the Ubuntu Software Center. Because of some incompatibilities between Ubuntu 15.04 and MongoDB 3.0.4, the Software Center installed version 2.6.3. After restarting my laptop the next day, I noticed that Mongo starts when the laptop starts. Not good! So I poked around a bit to figure out how to disable this, and start it only when I want to start it. Continue reading




A couple of years ago, Cloudera released an  open source application to query Hadoop stored data with much of the familiar SQL language syntax used by database professionals. Cloudera seemed to have positioned Impala as a replacement for Hive and Pig and has taken some hits for it. Regardless of corporate motivations, because my day to day work over the past 8 years has revolved around using SQL to development and administer various DB systems, I have taken a keen interest in Impala and how it might be useful. (I’m also interested in Hortonworks Stinger initiative to improve Hive, but that will be a different post).

One of the biggest issues with open source applications, as I have noted before, is the lack of documentation and training materials for people trying to use them. Those of us who work in the corporate world don’t have the luxury of figuring stuff out on our own at our day jobs, so we often look beyond the supplied documentation for better resources for learning new applications. In the past year, two publishers have released books on Cloudera Impala, and I will look at them, compare and contrast and tell you which one I think is better.

Continue reading

Connecting Kettle to Cloudera Hadoop Impala

hadoop-elephantAs Big Data platforms like Hadoop and its ecosystem of related applications has matured, they have moved beyond the original key-value model to embrace data processing of more traditional structured data. But a big problem for DBAs and Data Analysts wanting to use the power of these new platforms to analyze data from RDBMS systems like MySQL, SQL Server, is getting data moved between them. Using CSV or flat files is one way, but it adds additional processing. Data has to be extracted from the source system to an intermediary format and then imported into the destination. Its far more efficient and less prone to error if the data can be passed without that middle step.

In this first article of a series where I’ll be looking at interactivity between Hadoop and other database systems, I’ll cover setting up a database connection to Hadoop via Cloudera’s Impala JDBC driver to Pentaho’s Kettle ETL system.

Continue reading