Hadoop Analysis of Apache Logs Using Flume-NG, Hive and Pig
Big Data is the hotness, there is no doubt about it. Every year its just gotten bigger and bigger and shows no sign of slowing. There is a lot out there about big data, but despite the hype, there isn’t a lot of good technical content for those who want to get started. The lack of technical how-to info is made worse by the fact that many Hadoop projects have moved their documentation around over time and Google searches commonly point to obsolete docs. My intent here is to provide some solid guidance on how to actually get started with practical uses of Hadoop and to encourage others to do the same.
From an SA perspective, the most interesting Hadoop sub-projects have been those for log transport, namely Scribe, Chukwa, and Flume. Lets examine each.
Log Transport Choices
Scribe was created at Facebook and got a lot of popularity early on due to adoption at high profile sites like Twitter, but development has apparently ceased and word is that Facebook stopped using it themselves. So Scribe is off my list.
Chukwa is a confusing beast, its said to be distributed with Hadoop’s core but its just an old version in the same sub-directory of the FTP site, the actual current version is found under the incubator sub-tree. It is a very comprehensive solution, including a web interface for log analysis, but that functionality is based on HBase, which is fine if you want to use HBase but may be a bit more than you wish to chew off for simple Hive/Pig analysis. Most importantly, the major Hadoop distributions from HortonWorks, MapR, and Cloudera use Flume instead. So if your looking for a comprehensive toolset for log analysis, Chukwa is worth checking out, but if you simply need to efficiently get data into Hadoop for use by other Hadoop components, Flume is the clear choice.
That brings us to Flume, more specifically Flume-NG. The first thing to know about Flume is that there were major changes to Flume pre and post 1.0, major enough that they took to refering to pre 1.0 as “Flume OG” (“Old generation” or “Origonal Gangsta” depending on your mood) and the new post 1.0 releases as “Flume NG”. Whenever looking at documentation or help on the web about Flume be certain as to which you are looking at! In particular, stay away from the Flume CWiki pages, refer only to the flume.apache.org. I say that because there is so much old cruft in the CWiki pages that you can be easily mislead and become frustrated, so just avoid it.
Now that we’ve thinned out the available options, what can we do with Flume?