One of the gotcha’s I’ve encountered when installing Cloudera on CentOS Linux (and I’m assuming it holds true for RHEL), is the presence of SELinux. SELinux is a kernel module that supports some very strong access control security measures. While disabling this module won’t make your system completely vulnerable to outside hacking of your systems, it does make it easier.
If you are planning on using your Hadoop system exposed to the Internet in any way, you probably do not want to do what I am going to show you here. Or if you do, reenable SELinux once you are finished. Use at your own risk.
To disable SELinux, open a terminal prompt and switch to Super User mode. Then in your favorite editor, open the file /etc/sysconfig/selinux. The file itself is very small, and you only need to make one change. About halfway down you will see a line like this:
Change the word “enforcing” to “disabled”, and save the file. Restart your box, and SELINUX will be disabled, and you can install Cloudera.
When using an ETL tool, you often create a workflow that will run multiple times, picking up new and changed records. Getting just those changes is pretty easy using Kettle’s Insert/Update tool. (BTW – Kettle is one component in the Pentaho Data Integration application – PDI for short).
Assumptions and requirements
For this tutorial, I am assuming you have access to a MySQL (or MariaDB) database server. We’ll be creating a a sample database based on one originally created by Fusheng Wang and Carlo Zaniolo at Siemens Corporate Research. For our purposes we only need one table with a small amount of data. Copy the script below and save it as a SQL file on your system. Run it in MySQL to create the database and populate the table (Yes Production is spelling incorrectly). Continue reading
I haven’t posted any new pictures in awhile, so here is one from the past weekend. I was in Ashtabula County Ohio by the township of Geneva on the Lake. There is a state park there with a hotel (The Lodge) that sits right on the edge of Lake Erie.
This time of year there isn’t much to see other than frozen water and snow, but I did happen upon this gazebo at the Lodge. I got a few pictures of it during the day, with the sun up, and the skies a pale blue, but they weren’t too great.
Its often said when you work closely with something, you sometimes lose track of the big picture (you can’t see the forest for the trees). But I also find that a lot of times, people who aren’t familiar with something focus too much on the details and miss the big picture. In discussing my Hadoop cluster project with people, they don’t understand the point of it. “Why setup four PCs to work on something. Can’t you just get one really powerful PC and do the work with that?” they ask.
DIVIDE THE WORK
The innovation that Hadoop provides is in HOW it does the work. For many years, in IT as well as other areas, the solution when faced with bigger and bigger workloads has been to get a bigger, faster, stronger tool. Here are some examples: Continue reading
After re-reading my initial post on What is Big Data? I decided a little more clarification was in order.
Humans as a people are generating more data now than at any point in our history. Figures thrown out are that 90% of human data has been created in the past few years. I think that may be misleading because a lot of data that is generated isn’t really human created data, but machine generated data. Human data may be a blog post like this one or a book from your local library.
This is a continuation of my series on setting up a Hadoop Cluster using Cloudera’s distribution.
When using the HBASE application, or Impala, you may receive errors about the THRIFT service being unavailable. From what I have found this is because Cloudera doesn’t install the THRIFT service as part of the automated installation. Continue reading