Storing Apache Solr 5.x Data in HDFS

Introduction

Apache Solr is a leading enterprise search engine based on Apache Lucene. Apache Solr stores the data it indexes in the local filesystem by default. HDFS (Hadoop Distributed File System) provides several benefits, such as a large scale and distributed storage with redundancy and failover capabilities. Apache Solr supports storing data in HDFS. In this article we shall configure Apache Solr to store data in HDFS instead of the local filesystem. This tutorial has the following sections:

Setting the Environment
Deleting an Apache Solr Core
Creating a Solr Collection
Configuring Apache Solr
Configuring Apache Hadoop
Starting HDFS
Starting Apache Solr
Logging in to Apache Solr Admin Console
Indexing Documents
Querying Documents
Listing HDFS Directories

Setting the Environment

The following software is required for this article:

– Apache Hadoop 2.x

– Apache Solr 5.x

– Java

Apache Solr 5.x is used in this tutorial; configuration for Apache Solr 4.x and 6.x could be slightly different. If not already created, create a directory /solr to install the software and set its permissions toglobal (777).

mkdir /solr
chmod -R 777 /solr
cd /solr

Download the Apache Solr 5.x solr-5.3.1.tgz file and extract the tgz file to the /solr directory.

wget http://apache.mirror.vexxhost.com/lucene/solr/5.3.1/solr-5.3.1.tgz
tar -xvf solr-5.3.1.tgz

Download Java 7 from http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html and extract the gz file to the /solr directory.

tar zxvf jdk-7u55-linux-i586.gz

To use HDFS for storage, Hadoop 2.x is required. Download Hadoop 2.5.0 CDH 5.2 and extract the tar.gz file to the /solr directory.

wget http://archive-primary.cloudera.com/cdh5/cdh/5/hadoop-2.5.0-cdh5.2.0.tar.gz
tar -xvf hadoop-2.5.0-cdh5.2.0.tar.gz

Create symlinks for the Hadoop bin directory and the conf directory. Symlinks, also called symbolic links, are references or links to other files and are required due to the packaging structure of the Hadoop binaries.

ln -s  /solr/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1  /solr/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce1/bin
ln -s  /solr/hadoop-2.5.0-cdh5.2.0/etc/hadoop  /solr/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce1/conf

Set the environment variables in the bash shell for Apache Hadoop, Apache Solr, and Java.

vi ~/.bashrc
export HADOOP_PREFIX=/solr/hadoop-2.5.0-cdh5.2.0
export HADOOP_CONF=$HADOOP_PREFIX/etc/hadoop
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.75/jre
export SOLR_HOME=/solr/solr-5.3.1/server/solr
export SOLR_CONF=/solr/solr-5.3.1/server/solr/configsets/basic_configs/conf
export HADOOP_MAPRED_HOME=/solr/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce1
export HADOOP_HOME=/solr/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce1
export HDFS_HOME=/solr/hadoop-2.5.0-cdh5.2.0/share/hadoop/hdfs
export HADOOP_CLASSPATH=$HADOOP_HOME/*:$HADOOP_HOME/lib/*:$HADOOP_CONF:$FLUME_CONF:$HDFS_HOME/*:$HDFS_HOME/lib/*:$HDFS_HOME/webapps/
export PATH=$PATH:$SOLR_HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_MAPRED_HOME/bin
export CLASSPATH=$HADOOP_CLASSPATH
export HADOOP_NAMENODE_USER=hadoop
export HADOOP_DATANODE_USER=hadoop

Creating an Apache Solr Collection

If the storage is to be based in HDFS we must first create a Solr collection and not create a Solr core. A Solr core gets created implicitly when a collection is created. Create a collection called “hdfs” with the Solr instance configuration from the basic_configs using the following command.

solr create_collection  -c hdfs –d /solr/solr-5.3.1/server/solr/configsets/basic_configs

The “hdfs” collection gets created, as shown in Figure 1. A Solr core called hdfs_shard1_replica1 gets created implicitly.

Figure 1. Creating an Apache Solr Collection

Configuring Apache Solr

We need to configure the Solr schema for the fields in a Solr document. We shall be using fields time_stamp, category, type, servername, code, and msg. Declare the fields in theschema.xml file in the /solr/solr-5.3.1/server/solr/configsets/basic_configs/conf directory with a <field/> element for each of the fields. Make the fields indexed by setting indexed to true.

<field name="time_stamp" type="string" indexed=“true”  stored="true"  multiValued="false" />
<field name="category" type="string" indexed=“true”  stored="true"  multiValued="false" />
<field name="type" type="string" indexed=“true”  stored="true"  multiValued="false" />
<field name="servername" type="string" indexed=“true”  stored="true"  multiValued="false" />
<field name="code" type="string" indexed=“true”  stored="true"  multiValued="false" />
<field name="msg" type="string" indexed=“true”  stored="true"  multiValued="false" />

The two fields Solr documents require are the id field and the _version_ field. The id field should be provided when a new document is added and the _version_ field is added automatically by the Solr server. Duplicate fields should be removed if any. The only configuration in the solrconfig.xml is to set auto commit to true.

<autoCommit>
    <maxTime>15000</maxTime>
    <openSearcher>true</openSearcher>
  </autoCommit>

To add the Solr dist jars to the runtime classpath of Solr, copy the jars in the dist directory to the lib directory in the Solr instance directory /solr/solr-5.3.1/server/solr/configsets/basic_configs.

mkdir /solr/solr-5.3.1/server/solr/configsets/basic_configs/lib
chmod -R 777 /solr/solr-5.3.1/server/solr/configsets/basic_configs/lib
cp /solr/solr-5.3.1/dist/*.jar   /solr/solr-5.3.1/server/solr/configsets/basic_configs/lib

Configuring Apache Hadoop

As we shall be using Apache Hadoop for storage, we need to configure Hadoop, but because only the HDFS is made use of for Solr storage and not the MapReduce, we won’t need to configure the MapReduce. We do need to set the fs.defaultFS and hadoop.tmp.dir configuration properties in the core-site.xml file, thefs.defaultFS being the NameNode URI and the hadoop.tmp.dir the Hadoop temporary directory. The core-site.xml may be edited in the vi editor with the following command.

vi   /solr/hadoop-2.5.0-cdh5.2.0/etc/hadoop/core-site.xml

The core-site.xml is listed:

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
  <name>fs.defaultFS</name>
    <value>hdfs://10.0.2.15:8020</value>
    </property>
 <property>
     <name>hadoop.tmp.dir</name>
     <value>file:///var/lib/hadoop-0.20/cache</value>
  </property>
</configuration>

Create the Hadoop directory configured in hadoop.tmp.dir and set its permissions to global (777).

sudo mkdir -p /var/lib/hadoop-0.20/cache
sudo chmod -R 777  /var/lib/hadoop-0.20/cache

We also need to set the following HDFS configuration properties in the hdfs-site.xml file.

Property	Description	Value
dfs.permissions.superusergroup	The Hadoop superusergroup	hadoop
dfs.namenode.name.dir	NameNode storage directory	file:///data/1/dfs/nn
dfs.replication	Replication factor	1
dfs.permissions	Permissions checking. Setting to false disables the permissions checking on HDFS files. The chgrp, chown and chmod always check permissions.	false

The hdfs-site.xml may be edited in a vi editor with the following command.

vi   /solr/hadoop-2.5.0-cdh5.2.0/etc/hadoop/hdfs-site.xml

The hdfs-site.xml is listed:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
 <property>
   <name>dfs.permissions.superusergroup</name>
   <value>hadoop</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///data/1/dfs/nn</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
   </property>
   <property>
     <name>dfs.permissions</name>
     <value>false</value>
   </property>
</configuration>

Create the NameNode storage directory and set its permissions to global (777).

sudo mkdir -p /data/1/dfs/nn
sudo chmod -R 777 /data/1/dfs/nn

Starting HDFS

First, format the NameNode storage.

hadoop namenode –format

After the preceding command exits start the HDFS cluster, which comprises of the NameNode and the DataNode.

hadoop namenode
hadoop datanode

As we shall be storing Solr data in HDFS we need to create a directory in the HDFS for Solr data. Create the Hive warehouse directory called /solr in HDFS and set its permissions to global (777).

hadoop dfs -mkdir -p hdfs://10.0.2.15:8020/solr
hadoop dfs -chmod -R 777 hdfs://10.0.2.15:8020/solr

The output from the preceding commands is shown in Figure 2.

Figure 2. Creating a Directory in HDFS for Apache Solr Storage

Starting Apache Solr

When Solr is started using the default storage we use the following command.

solr start

When using HDFS storage we shall be using a slightly different command with additional command parameters. Solr must use the HdfsDirectoryFactory to store data in HDFS. Specify the HDFS directory in which to store data with the solr.hdfs.home property. We also need to specify the locking mechanism to use as hdfs with the solr.lock.type setting.

Run the following command to start Apache Solr server using HDFS for storage and indexing.

bin/solr start -c -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.hdfs.home=hdfs://10.0.2.15:8020/solr
Solr server gets started. Find the Solr server status with the following command.
solr status

The status lists one Solr node running, as shown in Figure 3.

Figure 3. Starting Apache Solr and finding Status

Logging in to Apache Solr Admin Console

Next, login to the Apache Solr Admin Console using the URL http://localhost:8983/solr/. In the Core selector display the available cores. The hdfs_shard1_replica1 core gets listed, as shown in Figure 4.

Figure 4. Logging in to Apache Solr Admin Console

Indexing Documents

We shall be indexing log data in XML format with the sample data. Create the following Solr format XML document to add.

<add>
<doc>
<field name="id">wlslog1</field>
  <field name="time_stamp">Apr-8-2014-7:06:16-PM-PDT</field>
  <field name="category">Notice</field>
  <field name="type">WebLogicServer</field>
  <field name="servername">AdminServer</field>
  <field name="code">BEA-000365</field>
  <field name="msg">Server state changed to STANDBY</field>
</doc>
<doc>
<field name="id">wlslog2</field>
  <field name="time_stamp">Apr-8-2014-7:06:17-PM-PDT</field>
  <field name="category">Notice</field>
  <field name="type">WebLogicServer</field>
  <field name="servername">AdminServer</field>
  <field name="code">BEA-000365</field>
  <field name="msg">Server state changed to STARTING</field>
</doc>
<doc>
<field name="id">wlslog3</field>
  <field name="time_stamp">Apr-8-2014-7:06:18-PM-PDT</field>
  <field name="category">Notice</field>
  <field name="type">WebLogicServer</field>
  <field name="servername">AdminServer</field>
  <field name="code">BEA-000360</field>
  <field name="msg">Server started in RUNNING mode</field>
</doc>
</add>

Select the hdfs_shard1_replica1 core and select Documents as shown in Figure 5.

Figure 5. Selecting Core>Documents

The default Request-Handler is /update. Select Document Type as XML. In the Document (s) field add the XML document. Click on Submit Document as shown in Figure 6.

Figure 6. Adding Data to Solr

The “success” status shown in Figure 7 indicates that the documents have been indexed (in HDFS).

Figure 7. Status of adding Documents

Querying Documents

The indexed documents may be searched, just like the documents indexed in the local filesystem. Select Query as shown in Figure 8 and set the Request Handler to /select (the default setting). Using the default query of *:* all the documents indexed would get listed when a query is run.

Figure 8. Query

Click on Execute Query as shown in Figure 9.

Figure 9. Execute Query

The three documents indexed get listed, as shown in Figure 10.

Figure 10. Query Result

The _version_ field has been added to each document, as shown in Figure 11.

Figure 11. The _version_ field is added to each Document

Listing HDFS Directories

Earlier, we created the /solr directory in HDFS, and specified the directory when starting Solr server. The Solr data gets indexed in the/solr directory. Run the following command to list the files and directories in the HDFS directory /solr.

hadoop dfs –ls hdfs://10.0.2.15:8020/solr

Two items ,“hdfs” and “wlslog”, get listed, as shown in Figure 12. The hdfs directory is for Apache Solt storage on HDFS. The “wlslog” is some other HDFS data storage directory for Solr not used in the tutorial.

Figure 12. The //solr/hdfs Directory is for Apache Solr Storage

Run the following command to list the files and directories in the /solr/hdfs directory in HDFS.

hadoop dfs –ls hdfs://10.0.2.15:8020/solr/hdfs

The /solr/hdfs/core_node1 used for the Solr core in this article gets listed, as shown in Figure 13.

Figure 13. Listing the HDFS Directory for Apache Solr Core

Summary

In this article we used HDFS, which provides the benefits of reliability and durability in a large scale distributed storage, for indexing Solr data. We installed and configured Apache Hadoop 2x. We started HDFS and created a directory in the HDFS for Solr data. We started Solr server using the HdfsDirectoryFactory and hdfs lock type. Subsequently we indexed and queried data in Solr from the Solr Admin.