In part 1 of this series, I walked through the beginning steps for setting up a single-node Hadoop pseudo-cluster. In that article, I showed you how to configure CentOS 6.5 with the necessary prerequisites for installation, download Cloudera’s Hadoop distribution (CDH5) and install it. In this article, I’ll explain how to install:
- HBase – an open source non-relational database that runs on top of the Hadoop File System (HDFS)
- Zookeeper – an engine that provides distribution, synchronization and naming services for Hadoop,
- and SNAPPY a fast data compression and decompression library that incorporates with many different components in the Hadoop ecosystem.
I’ve been awaiting the open source introduction of Cloudera’s Hadoop distribution, CDH 5, to try installing a pseudo-distributed cluster using CDH, with the HUE GUI interface. (If you are not familiar with the terminology, pseudo-distributed mode allows you to run Hadoop on one machine, with the various daemons each running in a separate JVM.) By setting up a pseudo-distributed cluster, I could free up two other machines for other projects I’m working on. Continue reading
When working on ETL flows, its sometimes useful to store information in temporary files as long as you clean those files up. Pentaho Data Integration (aka Kettle or PDI) has two steps for deleting file(s) – one handles a single file, and one handles multiple files. Both are in the File Management section of the Design node in the Job designer. Continue reading
Spoon is the graphical front end for designing ETL workflows for Pentaho Data Integration also known as Kettle. The latest community edition (5.01) was released in November of 2013, with versions for Windows, Linux and Macs. On the first two platforms it works very well as soon as you extract the archive, but unfortunately on Mac OS X 10.9 (Mavericks) there are some issues. It is possible to get it run, but its not easy.
I’ll assume that you have Data Integration downloaded, and extracted on your system and Java 1.6 installed. The instructions from Pentaho say you can run the Data Integration.app to launch Kettle, but on the systems I’ve tried this on, I get an error message that the App is damaged. If you are experiencing this, don’t click “Move to Trash!” There is a couple of ways to get it working.
The first method is pretty straightforward.
- While you are clicking on the Data Integration application, hold down the Control key on your keyboard. A menu will appear and you can then click on Open near the top. You’ll then see an Are You Sure warning window, where you can click Open again. The application will then start. Simple!
Just a quick note today. A couple of weeks ago, the Spark project over at Apache graduated to a top-level project and it can now be integrated into your Cloudera environment very easily!
Spark is a Hadoop integrated in-memory data analytic framework that uses HDFS (the Hadoop file system) to run programs 100x faster than MapReduce. Speed when using disk isn’t quite as fast, just a 10x faster claim than HDFS. It supports a number of different programming languages (Python, Java, Scala), can be used with UC Berkeley’s Shark application to see those same speed increases with Hive, and it can read from HBase and Cassandra data sources as well.
If you’d like to add Spark to your existing Cloudera cluster, head on over to Cloudera’s website for instructions on how to install it.
SysAdm purists often look down on people who use a GUI to handle tasks on their servers, but having worked for several years on Novell Netware at the beginning of my career (shudder), give me a GUI over a command line every time! On my home CentOS servers, I have the GNOME desktop environment loaded, and it makes me a lot more productive, because I don’t have to remember the locations of many scripts, or the various command line switches to run various applications.
Recently, I was installing a replacement server for my Hadoop cluster, and I found that the Services GUI option was not present under the System – Administration menu. A little hunting turned up that the system-config-services package wasn’t installed. If this happens to you, here’s a quick way to get it back. Open up a terminal – kidding!!!
- Start the Add/Remove Software application from the Administration menu under System.
- Search for system-config-services and check the two options that should appear. One is the application, the other the documentation.
- Click Apply down in the lower right corner, and authenticate as Root.
- Wait a few second, then check under System – Administration. Service should be back right above Software Update.
Posted in Blog, Linux
Tagged Linux, SysAdmin