I visited Washington DC last week for a few days, and spent a couple of hours touring the National Cathedral. Its a huge awe inspiring building, and I snapped the image here from the lower level looking up.
Moving data between different systems often requires converting between different character encoding specifications. If you aren’t familiar with the term, it means how characters are stored programmatically. In Latin character based languages like English, there are fewer characters, and they require a smaller amount of code to represent them. When computers had a much smaller amount of memory, processing power and disk space the smaller foot print of the ASCII and Windows 1252 characters sets were widely used to conserve resources. However, in other non-Latin character based languages, there can be a significantly larger amount of characters that make them up. Consequently, more system resources are required to represent them.
Today, with larger amounts of data than ever before moving between different counties, moving that information between systems with different character schemes can result in scrambled data if the underlying encoding isn’t taken into account. UTF-8 and UTF-16 encoding schemas are the most prevalent and widely accepted specifications, and should be used in place of the older ASCII and Windows 1252 schemes. Moving data to this format with Pentaho Data Integrator (aka Kettle) can be handled with a transform component called Select Values, although the method to perform this process is somewhat hidden.
Generally when developing an ETL process, if you have to replace a value from a source with a corresponding value, you should use a lookup table. For example, if you were replacing a country abbreviation with the full name of the country, you could have a simple 2 column table with the abbreviation in one column and the full name in the other. By using a lookup table it becomes very easy to update values, enter new ones, or possibly delete obsolete ones. This also gives you the added benefit of being able to reuse your lookup table if you need to in other places. But what if you have a small group of items (say a dozen or less) that you need to replace? In that case you might want to look at using Pentaho’s Value Mapper component.
The Pentaho Community Edition 5.3 Business Intelligence suite was released a few weeks ago, and I downloaded the ETL application tonight to install on my Mac. As with the past few versions, the application generates an error when you try to start it on Mac OS X because of security features in the operating system. I’ve covered previously a couple of ways to overcome those security issues and recently reader Ian emailed me with a third method that I decided to try out.