Hunk is the Hadoop version of Splunk (), and it can be used to create reports and dashboards to examine the state of the data on a Hadoop cluster. The tool offers search, reporting, alerts, and dashboards from a web-based user interface. Let’s look at the installation and uses of Hunk, as well as some simple reports and dashboards.
· Installing Hunk
By way of example, I install Hunk onto the Centos 6 Linux host hc2nn and connect it to the Cloudera CDH5 Hadoop cluster on the same node. Before downloading the Splunk software, though, I must first create an account and register my details. I source Hunk from
Version 6 is about 100 MB. I install the software by using the Centos-based Linux hadoop account. Given that I am logged into the hadoop account, the download file is saved to the Downloads directory, as follows:
[hadoop@hc2nn ~]$ pwd
[hadoop@hc2nn Downloads]$ ls -l hunk-6.1.3-228780-Linux-x86_64.tar.gz
-rw-r--r-- 1 hadoop hadoop 105332713 Oct 28 18:19 hunk-6.1.3-228780-Linux-x86_64.tar.gz
This is a gzip compressed tar file, so it needs to be unpacked by using the Linux-based gunzip and tar
commands. I use the Linux gunzip command to decompress the .tar.gz file and create a tar archive file. The Linux tar command then extracts the contents of the tar file to create the Hunk installation directory. In the tar option, x means extract, v means verbose, and f allows me to specify the tar file to use:
[hadoop@hc2nn Downloads]$ gunzip hunk-6.1.3-228780-Linux-x86_64.tar.gz
[hadoop@hc2nn Downloads]$ tar xvf hunk-6.1.3-228780-Linux-x86_64.tar
[hadoop@hc2nn Downloads]$ ls -ld *hunk*
drwxr-xr-x 9 hadoop hadoop 4096 Nov 1 13:35 hunk
The ls -ld Linux command provides a long list of the Hunk installation directory that has just been created.
The l option provides the list while the d option lists the directory details, rather than its contents.
Having created the installation directory, I now move it to a good location, which will be under /usr/local. I need to use the root account to do this because the hadoop account will not have the required access:
[hadoop@hc2nn Downloads]# su -
[root@ hc2nn Downloads]# mv hunk /usr/local
[root@ hc2nn Downloads]# cd /usr/local
[root@ hc2nn local]# chown -R hadoop:hadoop hunk
[root@ hc2nn local]# exit
[hadoop@ hc2nn Downloads]$ cd /usr/local/hunk
The Linux su command switches the current user to the root account. The Linux mv command moves the Hunk directory from the hadoop account Downloads directory to the /usr/local/ directory as root. The cd command then switches to the /usr/local/ directory, and the chmod command changes the ownership and group membership of the installation to hadoop. The -R switch just means to change ownership recursively so all underlying files and directories are affected. The exit command then returns the command to the hadoop login, and the final line changes the directory to the new installation under /usr/local/hunk.
Now that Hunk is installed and in the correct location, I need to configure it so that it will be able to access the Hadoop cluster and the data that the cluster contains. This involves creating three files—indexes.conf, props.conf, and transforms.conf—under the following Hunk installation directory:
[hadoop@hc2nn local]$ cd /usr/local/hunk/etc/system/local
Of these three files, the indexes.conf file provides Hunk with the means to connect to the Hadoop cluster. For example, to create a provider entry, I use a sequence similar to the following:
[hadoop@hc2nn local]$ cat indexes.conf
vix.family = hadoop
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-s6.0-hy2.0.jar
vix.env.HADOOP_HOME = /usr/lib/hadoop
vix.env.JAVA_HOME = /usr/lib/jvm/jre-1.6.0-openjdk.x86_64
vix.fs.default.name = hdfs://hc2nn:8020
vix.splunk.home.hdfs = /user/hadoop/hunk/workdir
vix.mapreduce.framework.name = yarn
vix.yarn.resourcemanager.address = hc2nn:8032
vix.yarn.resourcemanager.scheduler.address = hc2nn:8030
vix.mapred.job.map.memory.mb = 1024
vix.yarn.app.mapreduce.am.staging-dir = /user
vix.splunk.search.recordreader.csv.regex = \.txt$
This entry creates a provider entry called cdh5, which describes the means by which Hunk can connect to HDFS, the Resource Manager, and the Scheduler. The entry describes where Hadoop is installed (via HADOOP_HOME) and the source of Java (via JAVA_HOME). It specifies HDFS access via the local host name and name node port of 8020. Resource Manager access will be at port 8032, and Scheduler access is at port 8030. The framework is described as YARN, and the location on HDFS that Hunk can use as a working directory is described via the property vix.splunk.home.hdfs.
The second file, props.conf, describes the location on HDFS of a data source that is stored under /data/hunk/rdbms/. The first cat command dumps the contents of the file, and the extractcsv value refers to an entry in the file tranforms.conf that describes the contents of the data file:
[hadoop@hc2nn local]$ cat props.conf
REPORT-csvreport = extractcsv
The third file, transforms.conf, contains an entry called extractcsv, which is referenced in the props.conf file above. It has two properties: the DELIMS value describes how the data line fields are delimited (in this case, by commas); and in my case the FIELDS property describes 14 fields of vehicle fuel-consumption data, of course you can use your description of data accordingly.
[hadoop@hc2nn local]$ cat transforms.conf
Here’s a sampling of the CSV file contents via an HDFS file system cat command, which dumps the contents of the file /data/hunk/rdbms/rawdata.txt. The Linux head command limits the output to five lines:
[hadoop@hc2nn local]$ hdfs dfs -cat /data/hunk/rdbms/rawdata.txt | head -5
Now that some basic configuration files are set up, I can start Hunk.
Hunk is started from the bin directory within the installation as the Linux hadoop account user.
In either case, you start Hunk by using the splunk command:
[hadoop@hc2nn local]$ cd /usr/local/hunk/bin
[hadoop@hc2nn bin]$ ./splunk start --accept-license
When you first start Hunk, you must use the --accept-license option; after that, it may be omitted.
When starting, Hunk reads its configuration files, so you need to monitor the output for errors in the
configuration files’ error messages, such as:
Checking conf files for problems...
Invalid key in stanza [source::/data/hunk/rdbms/...] in /usr/local/hunk/etc/system/local/props.conf, line 3: DELIMS (value: ", ")
If any errors occur, you can fix the configuration files and restart Hunk, as follows:
[hadoop@hc2nn bin]$ ./splunk restart
If all is well, you are presented with a message containing the URL at which to access Hunk’s web-based user interface:
The Splunk web interface is at http://hc2nn:8000
You will need to login with the account name “admin” and the initial password of “changeme,” which you will immediately be prompted to change. Once logged in, you will see the Virtual Indexes page, which, as shown in Figure bellow displays the provider cdh5 in the family hadoop that was created in the indexes.conf file. If you don’t see the Virtual Indexes page, then select Settings and go to Virtual Indexes from the top menu bar.
You can click the cdh5 entry to examine the provider’s details. The entire list of provider properties is too large to display here, but know that Hunk automatically adds extra entries like vix.splunk.search.recordreader, which defines how CSV files will be read. To represent most of the details in bellow, I arranged the list in two columns.
Note that the Hadoop version in this figure is set to YARN to reflect the CDH5 YARN version. It has not been necessary to specify the Hadoop supplier name.
Now, click Cancel to leave the cdh5 properties view, and click on the Virtual Indexes tab. For our example, this tab shows that a single virtual index has been created in Hunk called cdh5_vindex, as shown in the following figure
Virtual indexes are the means by which hunk accesses the Hadoop cluster-based data. They enable Hunk to use Map Reduce against the data and present the results within Hunk reports. By selecting the cdh5_vindex entry, you can examine the attributes of this virtual index (following figure). Currently, the entry, which was defined in the props.conf file, doesn’t have much detail.It just defines the directory on HDFS where the CSV data is located. Click the Cancel button to exit this property details screen.