Browse Category


Indexing files in Solr using Tika

After installing the latest Solr release I noticed that the schema.xml file (the XML-file that holds the information about how Solr should index and search your data) already was setup for something called Apache Tika that can be used to index all sorts of documents and other files that holds all kinds of metadata (audio-files for instance).

See this post to learn how to install Solr on Windows: Installing Apache Solr on Windows

The great thing about Tika is that you don’t need to do anything in order to make it work – Tika does everything for you – or almost everything. You can also set it up so that it suits your needs just by adding a config-XML file in the Solr directory and a reference in the Solr-config XML-file. I tried to get it to work, but found that quit difficult also because there wasn’t much help to get out on Google. Because of all my problems getting it to work I created this post on StackOverflow. Before I got a reply on the subject though, I found the solution myself burried deep inside some forum posts about SolrNet.

This how you do in order to get it to work inside Solr using SolrNet as client.

Install Solr using the link from my earlier post above or something similiar – the main thing is that you install it using the release from Solr that is already build.
Create a new folder called “lib” inside your Solr-install folder.
Copy the apache-solr-cell-3.4.0.jar file from the “dist”-folder from the Solr zip-file to the newly created “lib”-folder the folder where you installed Solr.
Copy the content of contrib\extraction\lib from the Solr-zip to the same newly created “lib”-folder the folder where you installed Solr.
Now Tika is installed in Solr! Remember to go to http://localhost:8080/solr and confirm that it is installed correctly.

To use it in a .NET client application you can use the newest release of SolrNet (currently the beta version release) and add the DLLs to your .NET project (all of them – seriously!). This is an example of how to use it in C#:

The response will hold all the metadata that has been extracted from the file using Apache Tika. The ExtractParameters is given a FileStream object and an ID for the Solr index (here just “doc1” – can be anything as long as it is unique). The ExtractOnly property can be set to true if you don’t wan’t to get Tika to index the data, but only wan’t it to extract the metadata from the file that is sent. The file is streamed to the Solr API using HTTP POST. You can read more about that here:

In the above code the data sent to Solr is indexed in the last line, where the data is committed to Solr. If you would like Solr to index and commit the files when sent to the service you can set the AutoCommit property to true inside the initiation of ExtractParameters:

Because the commit is done everytime you send a new file to the Solr-API you can search during the indexing and, of course, you don’t need to call the solr.Commit() method after indexing.

You need a request handler inside your solrconfig.xml (inside {your-solr-install-path}/conf) to make Solr understand the request from the client. Below is an example of how the solrconfig.xml looks when you haven’t changed anything after install of Solr. See this for further information about configuring Tika inside Solr:

Your Solr schema.xml file (inside {your-solr-install-path}/conf) needs some fields in order to index the metadata from the files you send to Solr. You can provide the fields you need and index/store/ the metadata as is required for the files you need to index. This is the fields that Solr is installed with:

The fields above is a fits-all scenario, so with this you can both index audio and document files.

Supported formats in Tika can be found here:

Share this blog post:

Installing Apache Solr on Windows

Apache Solr is a Java-based enterprise search platform build on top of the Apache Lucene search engine (the two of them are now merged). It makes all the great search engine features available through a RESTful API (HTTP/XML and JSON): indexing, full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. The best part is that it’s open source and free for all.


I have got a lot of know-how about this great tool both for home and business purpose and would like to share, in this blog-post, how you can install Solr on a Windows system (Windows 6 / 7 / 2008 (R2) Server). This guide is written because I had a hard time finding a guide on this subject out there on the web. So here we go.


First of all, you need to install a webserver that can run Java-servlets. I use the Apache Tomcat webserver. Download the latest Tomcat server (the MSI installer is perfect for this – Binary Distributions -> Core -> 32-bit/64-bit Windows Service Installer) and install it on your system: (right now the latest version is 7.0) After this is installed, check that it is correctly installed and running (go to http://localhost:8080/).


After you have checked that it is running correctly, go to the directory where you installed Tomcat and then open the server.xml file in the conf folder (conf\server.xml). Inside this you then add this attribute to the first Connector XML-tag (Server -> Service -> Connector): URIEncoding=”UTF-8″.


Download and unzip the latest version of Solr into a temporary folder on your system – could be something like ”C:/temp/solr” (I have experienced some problems running version 3.5 on Tomcat – use the 3.4 version for now):


Create a folder on your file system where you would like Solr to be installed. Copy the content from the ”C:\temp\solr\example\solr” folder into the folder you just created.


Stop the Tomcat service. If you installed using the MSI installer you can do this by going to the Tomcat folder inside All Programs in the start menu and click on ”Configure Tomcat” (you might need to do this by right-clicking on it and choose to ”Run as administrator”). Keep the Tomcat configuration window open after you have stopped the service. We are going to use it later.


Copy the *solr*.war file from ”C:\temp\solr\dist” to the webapps folder inside your Tomcat installation folder. The .war file is called apache-solr-3.4.0.war for instance when you have the 3.4 version of Solr. When the file is copied, rename it to ”solr.war”.


Now we need to configure Tomcat so that it recognizes the Solr install folder that you created earlier. This is done by adding a Java Option: Open the Tomcat configuration window mentioned earlier and then go to the ”Java” tab. Here you have a ”Java Options” textbox with alot of lines in it. On the bottom of this textbox add the line ”-Dsolr.solr.home={solr-install-folder}”, where {solr-install-folder} is the path to your Solr install folder.


In the Tomcat configuration window, start the Tomcat service again. After starting the service, try to open a web-browser and navigate to this site (the local Solr administration site): http://localhost:8080/solr/admin. If the site starts nicely, Solr has been installed on your system.

Share this blog post: