Installing and running Hunk/splunk by example





Blog Language



Hunk

 

Hunk is the Hadoop version of Splunk (www.splunk.com), and it can be used to create reports and dashboards to examine the state of the data on a Hadoop cluster. The tool offers search, reporting, alerts, and dashboards from a web-based user interface. Let’s look at the installation and uses of Hunk, as well as some simple reports and dashboards.

 

·       Installing Hunk

 

By way of example, I install Hunk onto the Centos 6 Linux host hc2nn and connect it to the Cloudera CDH5 Hadoop cluster on the same node. Before downloading the Splunk software, though, I must first create an account and register my details. I source Hunk from 

www.splunk.com/goto/downloadhunk.

 

Version 6 is about 100 MB. I install the software by using the Centos-based Linux hadoop account. Given that I am logged into the hadoop account, the download file is saved to the Downloads directory, as follows:

 

[hadoop@hc2nn ~]$ pwd

/home/hadoop/Downloads

 

[hadoop@hc2nn Downloads]$ ls -l hunk-6.1.3-228780-Linux-x86_64.tar.gz

-rw-r--r-- 1 hadoop hadoop 105332713 Oct 28 18:19 hunk-6.1.3-228780-Linux-x86_64.tar.gz

 

This is a gzip compressed tar file, so it needs to be unpacked by using the Linux-based gunzip and tar

commands. I use the Linux gunzip command to decompress the .tar.gz file and create a tar archive file. The Linux tar command then extracts the contents of the tar file to create the Hunk installation directory. In the tar option, x means extract, v means verbose, and f allows me to specify the tar file to use:

 

[hadoop@hc2nn Downloads]$ gunzip hunk-6.1.3-228780-Linux-x86_64.tar.gz

[hadoop@hc2nn Downloads]$ tar xvf hunk-6.1.3-228780-Linux-x86_64.tar

 

[hadoop@hc2nn Downloads]$ ls -ld *hunk*

drwxr-xr-x 9 hadoop hadoop 4096 Nov 1 13:35 hunk

 

The ls -ld Linux command provides a long list of the Hunk installation directory that has just been created.

The l option provides the list while the d option lists the directory details, rather than its contents.

Having created the installation directory, I now move it to a good location, which will be under /usr/local. I need to use the root account to do this because the hadoop account will not have the required access:

 

[hadoop@hc2nn Downloads]# su -

[root@ hc2nn Downloads]# mv hunk /usr/local

[root@ hc2nn Downloads]# cd /usr/local

[root@ hc2nn local]# chown -R hadoop:hadoop hunk

[root@ hc2nn local]# exit

 

[hadoop@ hc2nn Downloads]$ cd /usr/local/hunk

 

The Linux su command switches the current user to the root account. The Linux mv command moves the Hunk directory from the hadoop account Downloads directory to the /usr/local/ directory as root. The cd command then switches to the /usr/local/ directory, and the chmod command changes the ownership and group membership of the installation to hadoop. The -R switch just means to change ownership recursively so all underlying files and directories are affected. The exit command then returns the command to the hadoop login, and the final line changes the directory to the new installation under /usr/local/hunk.

Now that Hunk is installed and in the correct location, I need to configure it so that it will be able to access the Hadoop cluster and the data that the cluster contains. This involves creating three files—indexes.conf, props.conf, and transforms.conf—under the following Hunk installation directory:

 

[hadoop@hc2nn local]$ cd /usr/local/hunk/etc/system/local

 

Of these three files, the indexes.conf file provides Hunk with the means to connect to the Hadoop cluster. For example, to create a provider entry, I use a sequence similar to the following:

 

[hadoop@hc2nn local]$ cat indexes.conf

 

[provider:cdh5]

vix.family = hadoop

vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-s6.0-hy2.0.jar

vix.env.HADOOP_HOME = /usr/lib/hadoop

vix.env.JAVA_HOME = /usr/lib/jvm/jre-1.6.0-openjdk.x86_64

vix.fs.default.name = hdfs://hc2nn:8020

vix.splunk.home.hdfs = /user/hadoop/hunk/workdir

vix.mapreduce.framework.name = yarn

vix.yarn.resourcemanager.address = hc2nn:8032

vix.yarn.resourcemanager.scheduler.address = hc2nn:8030

vix.mapred.job.map.memory.mb = 1024

vix.yarn.app.mapreduce.am.staging-dir = /user

vix.splunk.search.recordreader.csv.regex = \.txt$

 

 

This entry creates a provider entry called cdh5, which describes the means by which Hunk can connect to HDFS, the Resource Manager, and the Scheduler. The entry describes where Hadoop is installed (via HADOOP_HOME) and the source of Java (via JAVA_HOME). It specifies HDFS access via the local host name and name node port of 8020. Resource Manager access will be at port 8032, and Scheduler access is at port 8030. The framework is described as YARN, and the location on HDFS that Hunk can use as a working directory is described via the property vix.splunk.home.hdfs.

 

The second file, props.conf, describes the location on HDFS of a data source that is stored under /data/hunk/rdbms/. The first cat command dumps the contents of the file, and the extractcsv value refers to an entry in the file tranforms.conf that describes the contents of the data file:

 

[hadoop@hc2nn local]$ cat props.conf

[source::/data/hunk/rdbms/...]

REPORT-csvreport = extractcsv

 

 

The third file, transforms.conf, contains an entry called extractcsv, which is referenced in the props.conf file above. It has two properties: the DELIMS value describes how the data line fields are delimited (in this case, by commas); and in my case the FIELDS property describes 14 fields of vehicle fuel-consumption data, of course you can use your description of data accordingly.

 

[hadoop@hc2nn local]$ cat transforms.conf

 

[extractcsv]

DELIMS="\,"

FIELDS="year","manufacturer","model","class","engine size","cyclinders","transmission","Fuel

Type","fuel_city_l_100km","fuel_hwy_l_100km","fuel_city_mpg","fuel_hwy_mpg","fuel_l_yr","c02_g_km"

 

Here’s a sampling of the CSV file contents via an HDFS file system cat command, which dumps the contents of the file /data/hunk/rdbms/rawdata.txt. The Linux head command limits the output to five lines:

 

[hadoop@hc2nn local]$ hdfs dfs -cat /data/hunk/rdbms/rawdata.txt | head -5

 

1995,ACURA,INTEGRA,SUBCOMPACT,1.8,4,A4,X,10.2,7,28,40,1760,202

1995,ACURA,INTEGRA,SUBCOMPACT,1.8,4,M5,X,9.6,7,29,40,1680,193

1995,ACURA,INTEGRA GS-R,SUBCOMPACT,1.8,4,M5,Z,9.4,7,30,40,1660,191

1995,ACURA,LEGEND,COMPACT,3.2,6,A4,Z,12.6,8.9,22,32,2180,251

1995,ACURA,LEGEND COUPE,COMPACT,3.2,6,A4,Z,13,9.3,22,30,2260,260

Now that some basic configuration files are set up, I can start Hunk.

 

·        RunningHunk

 

Hunk is started from the bin directory within the installation as the Linux hadoop account user.

 

In either case, you start Hunk by using the splunk command:

 

[hadoop@hc2nn local]$ cd /usr/local/hunk/bin

[hadoop@hc2nn bin]$ ./splunk start --accept-license

 

When you first start Hunk, you must use the --accept-license option; after that, it may be omitted.

 

When starting, Hunk reads its configuration files, so you need to monitor the output for errors in the

configuration files’ error messages, such as:

 

Checking conf files for problems...

Invalid key in stanza [source::/data/hunk/rdbms/...] in /usr/local/hunk/etc/system/local/props.conf, line 3: DELIMS (value: ", ")

 

If any errors occur, you can fix the configuration files and restart Hunk, as follows:

 

[hadoop@hc2nn bin]$ ./splunk restart

 

If all is well, you are presented with a message containing the URL at which to access Hunk’s web-based user interface:

 

The Splunk web interface is at http://hc2nn:8000

 

You will need to login with the account name “admin” and the initial password of “changeme,” which you will immediately be prompted to change. Once logged in, you will see the Virtual Indexes page, which, as shown in Figure bellow displays the provider cdh5 in the family hadoop that was created in the indexes.conf file. If you don’t see the Virtual Indexes page, then select Settings and go to Virtual Indexes from the top menu bar.

 

 

 

You can click the cdh5 entry to examine the provider’s details. The entire list of provider properties is too large to display here, but know that Hunk automatically adds extra entries like vix.splunk.search.recordreader, which defines how CSV files will be read. To represent most of the details in bellow, I arranged the list in two columns.

 

 

 

Note that the Hadoop version in this figure is set to YARN to reflect the CDH5 YARN version. It has not been necessary to specify the Hadoop supplier name.

Now, click Cancel to leave the cdh5 properties view, and click on the Virtual Indexes tab. For our example, this tab shows that a single virtual index has been created in Hunk called cdh5_vindex, as shown in the following figure

 

 

 

Virtual indexes are the means by which hunk accesses the Hadoop cluster-based data. They enable Hunk to use Map Reduce against the data and present the results within Hunk reports. By selecting the cdh5_vindex entry, you can examine the attributes of this virtual index (following figure). Currently, the entry, which was defined in the props.conf file, doesn’t have much detail.It just defines the directory on HDFS where the CSV data is located. Click the Cancel button to exit this property details screen.

 

 

 

 



blog comments powered by Disqus