Apache Solr – Indexing using data import handler

          In the last post, we have seen how to set up Solr. In this post , we will see how to set up data import handler to Apache Solrimport the data from the database. Here, I am using sample product catalog database for demonstration. The table structure is given below.

prod_cat

The data dump file is available https://github.com/2013techsmarts/prod_cat_dump

Step 1 :

Now, we will prepare document structure for Solr. The configuration of db-data-config.xml is given below.


<dataConfig>

<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/sm_pro_cat" user="root" password="root"/>

<document>

<entity name="product" query="SELECT

`PROD_ID`,

`PROD_NAME`,

`PROD_DESC`,

`QUANTITY`,

`BUY_PRICE`,

`LIST_PRICE`,

`SALE_PRICE`,

(SELECT `RETAILER_NAME` FROM `retailers` WHERE `RETAILER_ID`=P.`RETAILER_ID`) AS `RETAILER_NAME`

FROM `products` P">

<field column="PROD_ID" name="prodId" />

<field column="PROD_NAME" name="prodName" />

<field column="PROD_DESC" name="prodDesc" />

<field column="QUANTITY" name="quantity" />

<field column="BUY_PRICE" name="buyPrice" />

<field column="LIST_PRICE" name="listPrice" />

<field column="SALE_PRICE" name="salePrice" />

<field column="RETAILER_NAME" name="retailerName" />

</entity>

</document>

</dataConfig>

The db-data-config.xml file will have root element as <dataConfig>. Here, the dataSource element is having all the database related details. In this example, I am using MySql as database. So, I configured MySql driver name, connection URL, user name and password. Copy the MySql driver jar file under solr_home lib directory.

In the Solr point of view, the document is nothing but, a row in a database table. Here, we are going to collect the data from various tables. In this example, I am collecting the data from products and retailers tables. We need to wrap the document under document element. Every SQL query will become as an entity under document element. Here, entity element has name and query attributes. For name attribute, we can give any value. In this example I have given product as value.

The query attribute is having SQL query to collect the necessary columns. All the columns information is given under field elements. For each column, there is an associated field defined. Field has column and name attributes. For column attribute give the same value which is there in the SQL query. For name attribute, we can give any suitable name.

Step 2 :

Now, we will configure the schema.xml file. This file contains all the field types information with analysis portion.

Solr supports for various field types. A field type is similar to data type. The sample field types are given below,


<fieldType name="string" sortMissingLast="true" omitNorms="true"/>

<!-- boolean type: "true" or "false" -->

<fieldType name="boolean" sortMissingLast="true" omitNorms="true"/>

<!-- A text field that only splits on whitespace for exact matching of words -->

<fieldType name="text_ws" positionIncrementGap="100">

<analyzer>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

</analyzer>

</fieldType>

After configuring the field types, we need to specify the fields which we are going to use for indexing.

The configuration of fields are given below.


<field name="prodId" type="int" indexed="true" stored="true" required="true"/>

<field name="prodName" type="text_ws" indexed="true" stored="true"/>

<field name="prodDesc" type="text_ws" indexed="false" stored="true"/>

<field name="quantity" type="int" indexed="true" stored="true"/>

<field name="buyPrice" type="double" indexed="true" stored="true"/>

<field name="listPrice" type="double" indexed="true" stored="true"/>

<field name="salePrice" type="double" indexed="true" stored="true"/>

<field name="retailerName" type="string" indexed="true" stored="true"/>

Here, the name of the field should match with the name which we have given in db-data-config.xml. The explanation of attributes are given below.

type → the field type(One of the field type which we have defined earlier)

indexed → the fields which are indexed=true are search-able

stored → the fields which are stored=true are stored as part of index and those fields will be returned as part of search response

required → the fields which are having required=true will be mandatory fields (Same as primary key in database)

Step 3 :

Now, we will configure solrConfig.xml. Here, we are going to configure data import handler to import the data , request handler and the response handler.

The configuration of data import handler is given below.


<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">

<lst name="defaults">

<str name="config">db-data-config.xml</str>

</lst>

</requestHandler>

In the above configuration you can notice the file name is matched which we have given in Step 1.

The request handler configuration to take the search requests is given below.

<requestDispatcher handleSelect="true" >

<!--Make sure your system has some authentication before enabling remote streaming! -->

<requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" formdataUploadLimitInKB="2048" />

<!-- Set HTTP caching related parameters (for proxy caches and clients).

To get the behaviour of Solr 1.2 (ie: no caching related headers)

use the never304="true" option and do not specify a value for

<cacheControl>

-->

<httpCaching never304="true">

<!--httpCaching lastModifiedFrom="openTime"

etagSeed="Solr"-->

<!-- lastModFrom="openTime" is the default, the Last-Modified value

(and validation against If-Modified-Since requests) will all be

relative to when the current Searcher was opened.

You can change it to lastModFrom="dirLastMod" if you want the

value to exactly corrispond to when the physical index was last

modified.

etagSeed="..." is an option you can change to force the ETag

header (and validation against If-None-Match requests) to be

differnet even if the index has not changed (ie: when making

significant changes to your config file)

lastModifiedFrom and etagSeed are both ignored if you use the

never304="true" option.

-->

<!-- If you include a <cacheControl> directive, it will be used to

generate a Cache-Control header, as well as an Expires header

if the value contains "max-age="

By default, no Cache-Control header is generated.

You can use the <cacheControl> option even if you have set

never304="true"

-->

<!-- <cacheControl>max-age=30, public</cacheControl> -->

</httpCaching>

</requestDispatcher>

<!-- requestHandler plugins... incoming queries will be dispatched to the

correct handler based on the path or the 'qt' param.

Names starting with a '/' are accessed with the a path equal to the

registered name. Names without a leading '/' are accessed with:

http://host/app/select?qt=name

If no qt is defined, the requestHandler that declares default="true"

will be used.

-->

<requestHandler name="standard" default="true">

<!-- default values for query parameters -->

<lst name="defaults">

<str name="echoParams">explicit</str>

<!--

<int name="rows">10</int>

<str name="fl">*</str>

<str name="version">2.1</str>

-->

</lst>

</requestHandler>

The query response writer configuration is given below.


<queryResponseWriter name="xml" default="true"/>

<queryResponseWriter name="json"/>

<queryResponseWriter name="python"/>

<queryResponseWriter name="ruby"/>

<queryResponseWriter name="php"/>

<queryResponseWriter name="phps"/>

If we are not going to configure any response writers in solrConfig.xml, by default XML response writer is enabled. During the search request we can specify the response writer using “wt” query parameter. Based on this, the search response will be returned in the respective data format.

Now, all the necessary configuration is done. Its time to index the data by using data import handler. We can invoke the data import handler from admin console.

Solr Admin data import

or we can send http request to the data import handler by using  the below URL

<http>://<host>:<port>/solr/sampleCatalog/dataimport?command=full-import&clean=true

Once the index is completed, we can search for the products based on any indexed field. For example, we can search based on the product name

<http>://<host>:<port>/solr/sampleCatalog/select?q=prodName:Harley%20Davidson&wt=xml&indent=true

Here , I am searching for Harley Davidson key word and we got 2 results from the search response.


<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">500</int>
<lst name="params">
<str name="indent">true</str>
<str name="q">prodName:Harley Davidson</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="2" start="0">
<doc>
<double name="buyPrice">48.81</double>
<double name="listPrice">95.7</double>
<str name="prodId">S10_1678</str>
<str name="prodDesc">
This replica features working kickstand, front suspension, gear-shift lever, footbrake lever, drive chain, wheels and steering. All parts are particularly delicate due to their precise scale and require special care and attention.
</str>
<str name="prodName">1969 Harley Davidson Ultimate Chopper</str>
<str name="retailerName">Min Lin Diecast</str>
<double name="salePrice">127.0</double>
<double name="salePriceDisp">127.0</double>
<int name="quantity">7933</int>
</doc>
<doc>
<double name="buyPrice">24.23</double>
<double name="listPrice">60.57</double>
<str name="prodId">S18_2625</str>
<str name="prodDesc">
Intricately detailed with chrome accents and trim, official die-struck logos and baked enamel finish.
</str>
<str name="prodName">1936 Harley Davidson El Knucklehead</str>
<str name="retailerName">Welly Diecast Productions</str>
<double name="salePrice">162.0</double>
<double name="salePriceDisp">162.0</double>
<int name="quantity">4357</int>
</doc>
</result>
</response>

Let us run SQL query to verify the results.

sql verify

So, the results from the SOLR and the Database are in sync.

Now, we will see, how we can get the Solr response in other formats by applying “wt” parameter. We will try to get json as response.

<http>://<host>:<port>/solr/sampleCatalog/select?q=prodName:Harley%20Davidson&wt=json&indent=true


{
 "responseHeader": {
 "status": 0,
 "QTime": 0,
 "params": {
 "indent": "true",
 "q": "prodName:Harley Davidson",
 "wt": "json"
 }
 },
 "response": {
 "numFound": 2,
 "start": 0,
 "docs": [
 {
 "buyPrice": 48.81,
 "listPrice": 95.7,
 "prodId": "S10_1678",
 "prodDesc": "This replica features working kickstand, front suspension, gear-shift lever, footbrake lever, drive chain, wheels and steering. All parts are particularly delicate due to their precise scale and require special care and attention.",
 "prodName": "1969 Harley Davidson Ultimate Chopper",
 "retailerName": "Min Lin Diecast",
 "salePrice": 127,
 "salePriceDisp": 127,
 "quantity": 7933
 },
 {
 "buyPrice": 24.23,
 "listPrice": 60.57,
 "prodId": "S18_2625",
 "prodDesc": "Intricately detailed with chrome accents and trim, official die-struck logos and baked enamel finish.",
 "prodName": "1936 Harley Davidson El Knucklehead",
 "retailerName": "Welly Diecast Productions",
 "salePrice": 162,
 "salePriceDisp": 162,
 "quantity": 4357
 }
 ]
 }
}

Now, we will sort the results based on the “buyPrice” field.For that we need sort=<field name> <sort option either ASC/DESC>.

<http>://<host>:<port>/solr/sampleCatalog/select?q=prodName:Harley%20Davidson&wt=json&indent=true&sort=buyPrice%20ASC


<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="sort">buyPrice ASC</str>
<str name="indent">true</str>
<str name="q">prodName:Harley Davidson</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="2" start="0">
<doc>
<double name="buyPrice">24.23</double>
<double name="listPrice">60.57</double>
<str name="prodId">S18_2625</str>
<str name="prodDesc">
Intricately detailed with chrome accents and trim, official die-struck logos and baked enamel finish.
</str>
<str name="prodName">1936 Harley Davidson El Knucklehead</str>
<str name="retailerName">Welly Diecast Productions</str>
<double name="salePrice">162.0</double>
<double name="salePriceDisp">162.0</double>
<int name="quantity">4357</int>
</doc>
<doc>
<double name="buyPrice">48.81</double>
<double name="listPrice">95.7</double>
<str name="prodId">S10_1678</str>
<str name="prodDesc">
This replica features working kickstand, front suspension, gear-shift lever, footbrake lever, drive chain, wheels and steering. All parts are particularly delicate due to their precise scale and require special care and attention.
</str>
<str name="prodName">1969 Harley Davidson Ultimate Chopper</str>
<str name="retailerName">Min Lin Diecast</str>
<double name="salePrice">127.0</double>
<double name="salePriceDisp">127.0</double>
<int name="quantity">7933</int>
</doc>
</result>
</response>

From this post, we have learnt how to configure Solr and how to search etc..

The entire set up, you can find at https://github.com/2013techsmarts/Solr-Data-Impot-Handler-Conf

In the coming post, we will see some more Solr features and configurations.

Stay tuned …….

Advertisements

I am Siva Prasad Rao Janapati. Working as a software developer. Has hands on experience on ATG Commerce(DAS/DPS/DCS), Mozu commerce, Broadleaf Commerce, Java, JEE, Spring, Play, JPA, Hibernate, Velocity, JMS, Jboss, Weblogic,Tomcat, Jetty, Apache, Apache Solr, Spring Batch, JQuery, NodeJS, SOAP, REST, MySQL, Oracle, Mongo DB, Memcached, HazelCast, Git, SVN, CVS, Ant, Maven, Gradle, Amazon Web services, Rackspace, Quartz, JMeter, Junit, Open NLP, Facebook Graph,Twitter4J, YouTube Gdata, Bazzarvoice,Yotpo, 4-Tell, Alatest, Shopzilla, Linkshare. I have hands on experience on open sources and commercial technologies.

Tagged with: , , ,
Posted in Apache Solr, Data Import Handler
2 comments on “Apache Solr – Indexing using data import handler
  1. mahesh says:

    In the fetching of data from database i got the following error :
    collection1: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load config for solrconfig.xml…

    how to solve that issue……

    Thanks..
    mahesh

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

DZone

DZone MVB

Java Code Geeks
Java Code Geeks
%d bloggers like this: