Toad World Blog

Indexing MongoDB Data in Apache Solr

Feb 3, 2017 11:09:34 AM by Deepak Vohra

Indexing data for fast and efficient retrieval is one of the data preparation tasks in data science. MongoDB is the most commonly used NoSQL database and the fourth most commonly used database (relational or NoSQL). MongoDB is based on the BSON (binary JSON) document model.

 

Problem

MongoDB provides built-in full-text search capabilities but does not provide advanced indexing and search features. 

 

Solution

Apache Solr is not designed to be primarily a data store, but is designed for indexing documents. Apache Solr is based on the high performance, full-featured text search engine Lucene. MongoLabs provides a Mongo Connector; using it, you can index MongoDB data in Apache Solr.

In this article we shall use the Mongo Connector to index MongoDB data in Apache Solr.

 

Setting the Environment

We have used Linux OS environment on Amazon Elastic Compute Cloud (EC2) for this article. The following software is required for this article.

-Apache Solr

-MongoDB

-Mongo Connector

-Java

Create a directory /solr to install the software and set its permissions to global (777).

mkdir /solr
chmod 777 /solr
cd /solr

Download and extract the Apache Solr tgz file.

wget http://apache.mirror.vexxhost.com/lucene/solr/5.3.1/solr-5.3.1.tgz
tar -xvf solr-5.3.1.tgz

Download and extract the MongoDB tgz file.

curl -O http://downloads.mongodb.org/linux/ mongodb-linux-i686-3.0.6.tgz
tar -zxvf mongodb-linux-i686-3.0.6.tgz

Set environment variables for MongoDB, Apache Solr and Java in the bash script.

vi ~/.bashrc
export MONGODB_HOME=/solr/mongodb-linux-i686-3.0.6
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.75/jre
export SOLR_HOME=/solr/solr-5.3.1/server/solr
export SOLR_CONF=/solr/solr-5.3.1/server/solr/configsets/basic_configs/conf
export PATH=$PATH:$SOLR_HOME/bin:$JAVA_HOME/bin:$MONGODB_HOME/bin

The bash script ~/.bashrc is shown in vi.

 

 

Installing the Mongo Connector

To install the Mongo Connector run the following command.

pip install mongo-connector

The  mongo-connector-2.1 gets installed.

 

Starting Apache Solr Server

Start the Apache Solr server from the /solr/solr-5.3.1 directory. Run the bin/solr start command to start Apache Solr.

cd /solr/solr-5.3.1
bin/solr start

 

Creating a Solr Core

Login to the Solr Admin Console with the URL http://localhost:8983/solr/. We need to create a Solr Core to index MongoDB data. Select Core Admin in Solr Admin Dashboard.

In the dialog displayed specify a name (wlslog for example) for the core. Specify the instanceDir as /solr/solr-5.3.1/server/solr/configsets/basic_configs. Specify dataDir as "data". The instanceDir and the dataDir must exist before a core is created. Specify the config as solrconfig.xml and schema as schema.xml. Click on Add Core.

 

A Solr core called wlslog gets created.

 

Select the wlslog core in the Core Selector.

The Overview tab displays the Statistics and Instance associated with the wlslog core.

 

 

 

Configuring Apache Solr

Next, we need to configure Apache Solr. The fields in the MongoDB documents to be indexed are specified in the schema.xml configuration file. Open the schema.xml in a vi editor.

vi /solr/solr-5.3.1/server/solr/configsets/basic_configs/conf/schema.xml

Add fields time_stamp,category,type,servername,code, and msg. Mongo Connector also stores the metadata associated with the each MongoDB document it indexes in fields ns and _ts. Also add the ns and _ts fields to the schema.xml.

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.5">
<field name="time_stamp" type="string" indexed="true"  stored="true"  multiValued="false" />
<field name="category" type="string" indexed="true"  stored="true"  multiValued="false" />
<field name="type" type="string" indexed="true"  stored="true"  multiValued="false" />
<field name="servername" type="string" indexed="true"  stored="true"  multiValued="false" />
<field name="code" type="string" indexed="true"  stored="true"  multiValued="false" />
<field name="msg" type="string" indexed="true"  stored="true"  multiValued="false" />
<field name="_ts" type="long" indexed="true" stored="true" />
<field name="ns" type="string" indexed="true" stored="true"/>
   <field name="_version_" type="long" indexed="true" stored="true"/>
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
 <uniqueKey>id</uniqueKey>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
</schema>

Fields not defined in schema.xml are not indexed. We also need to configure the org.apache.solr.handler.admin.LukeRequestHandler request handler in the solrconfig.xml. Requests to Solr server are routed through the request handler. Open the solrconfig.xml in the vi editor.

vi ./solr-5.3.1/server/solr/configsets/basic_configs/conf/solrconfig.xml

Specify the request handler for the Mongo Connector.

<requestHandler name="/admin/luke" class="org.apache.solr.handler.admin.LukeRequestHandler" />

Also configure the auto commit to true so that Solr auto commits the data from MongoDB after the configured time.

<autoCommit>
    <maxTime>15000</maxTime>
    <openSearcher>true</openSearcher>
  </autoCommit>

After modifying the schema.xml and solrconfig.xml the Solr server needs to be restarted.

bin/solr restart

 

Starting MongoDB Server

Mongo Connector requires a MongoDB replica set to be running for it to index MongoDB data in Solr. A replica set is a cluster of MongoDB servers that implements replication and automatic failover. A replica set could consist of only a single server, as we shall use. Start a single instance of MongoDB server using the following command in which the port is specified as 27017, the data directory for MongoDB is specified as /data/db and the replica set name is specified as rs0 with the –replSet option.

mongod --port 27017 --dbpath /data/db --replSet rs0

Run the preceding command to start MongoDB.

MongoDB replica set gets started.

 

 

  

Starting MongoDB Shell

Start the MongoDB shell with the following command.

mongo

MongoDB shell gets started. We need to initiate the replica set. Run the following command to initiate the replica set.

rs.initiate()

Replica set gets initiated.

If the replica set has already been initialized and the rs.initiate() command is run an error message gets output that replica set has been “already initialized”.

 

 

Creating a MongoDB Collection

In this section we shall create a MongoDB collection to store documents. Set the MongoDB database to “solr”, which is created implicitly when a new database is initialized by invoking some command on it.

use solr

We shall store MongoDB documents in a collection called “wlslog”. First, find if the collection already exists and does it have documents in it with the following command.

db.wlslog.find()

If some documents get listed the collection, wlslog exists and has documents in it. Drop the wlslog collection with the following command.

db.wlslog.drop()

Create the wlslog collection again with the following command.

db.createCollection("wlslog")

The wlslog collection gets created.

 

Adding Data to MongoDB Collection

Next, we shall add some documents to the wlslog collection. Create JSON for three documents doc1, doc2, and doc3.

doc1 = {"time_stamp":"Apr-8-2014-7:06:16-PM-PDT","category": "Notice","type":"WebLogicServer",
"servername": "AdminServer","code":"BEA-000365","msg": "Server state changed to STANDBY" }


doc2 ={"time_stamp":"Apr-8-2014-7:06:17-PM-PDT","category": "Notice","type":"WebLogicServer",
"servername": "AdminServer","code":"BEA-000365","msg": "Server state changed to STARTING" }


doc3 ={"time_stamp":"Apr-8-2014-7:06:18-PM-PDT","category": "Notice","type":"WebLogicServer",
"servername": "AdminServer","code":"BEA-000360","msg": "Server started in RUNNING mode" }

Three documents get created.

Add the three documents to MongoDB with the following command:

db.wlslog.insert([doc1,doc2,doc3])

The three documents get added. The nInserted value of 3 indicates that 3 documents have been added.

 

Query the wlslog collection with the find() method.

db.wlslog.find()

The three documents get listed.

 

Starting the Mongo Connector and Indexing MongoDB Data

Start the Mongo Connector to connect to MongoDB and Apache Solr and index the MongoDB data. The Mongo Connector is run with the mongo-connector command. Specify the following command parameters.

Parameter

Description

Value

--unique-key

The unique key in Solr server. The unique key in MongoDB documents is _id and is mapped to the id key in Solr.

id

–n

The MongoDB collection to index. The format is database.collection.

solr.wlslog

-m

The MongoDB host and port to connect to.

localhost:27017

-t

The Apache Solr URL including the core.

http://localhost:8983/solr/wlslog

-d

The document manager for Apache Solr.

solr_doc_manager

 

With both the MongoDB and Apache Solr running run the mongo-connector command to index MongoDB data in Apache Solr.

mongo-connector --unique-key=id –n solr.wlslog -m localhost:27017 -t http://localhost:8983/solr/wlslog -d solr_doc_manager

The Mongo Connector gets started.

A MongoDB replica set must be running and if not the following error message gets generated.

 

Querying Apache Solr

Having indexed MongoDB data in Solr navigate to the http://localhost:8983/solr/#/wlslog/query   URL in the Solr Admin Console. Alternative select Query for the wlslog core. The request Handler is /select and the query (q) is *:*.

 

Click on Execute Query.

The three documents indexed in Solr from MongoDB get listed.

 

Each document has the ns, _ts and _version_ added by the document manager.

 

Summary

In this article we used the Mongo Connector to index MongoDB data in Apache Solr. The MongoDB server must be running as a replica set. The Mongo Connector LukeRequestHandler must be configured in Solr server’s solrconfig.xml. In addition to the fields required to be indexed the ns and _ts fields must be configured in schema.xml.  

Tags: NoSQL SQL Server

Deepak Vohra

Written by Deepak Vohra

Deepak Vohra is an Oracle Certified Associate, Oracle Database 10g, and Sun Certified Java Programmer. Deepak has published on OTN and in Oracle Magazine.