Jonas Midstrup's Blog

Programming, IT in general and everything in between.

Archive for the ‘C#’ tag

Indexing files in Solr using Tika

without comments

After installing the latest Solr release I noticed that the schema.xml file (the XML-file that holds the information about how Solr should index and search your data) already was setup for something called Apache Tika that can be used to index all sorts of documents and other files that holds all kinds of metadata (audio-files for instance).

See this post to learn how to install Solr on Windows: Installing Apache Solr on Windows

The great thing about Tika is that you don’t need to do anything in order to make it work – Tika does everything for you – or almost everything. You can also set it up so that it suits your needs just by adding a config-XML file in the Solr directory and a reference in the Solr-config XML-file. I tried to get it to work, but found that quit difficult also because there wasn’t much help to get out on Google. Because of all my problems getting it to work I created this post on StackOverflow. Before I got a reply on the subject though, I found the solution myself burried deep inside some forum posts about SolrNet.

This how you do in order to get it to work inside Solr using SolrNet as client.

  1. Install Solr using the link from my earlier post above or something similiar – the main thing is that you install it using the release from Solr that is already build.
  2. Create a new folder called “lib” inside your Solr-install folder.
  3. Copy the apache-solr-cell-3.4.0.jar file from the “dist”-folder from the Solr zip-file to the newly created “lib”-folder the folder where you installed Solr.
  4. Copy the content of contrib\extraction\lib from the Solr-zip to the same newly created “lib”-folder the folder where you installed Solr.

Now Tika is installed in Solr! Remember to go to http://localhost:8080/solr and confirm that it is installed correctly.

To use it in a .NET client application you can use the newest release of SolrNet (currently the 0.4.0.2002 beta version release) and add the DLLs to your .NET project (all of them – seriously!). This is an example of how to use it in C#:

Startup.Init("YOUR-SOLR-SERVICE-PATH");
var solr = ServiceLocator.Current.GetInstance();
 
using (FileStream fileStream = File.OpenRead("FILE-PATH-FOR-THE-FILE-TO-BE-INDEXED"))
{
   var response =
      solr.Extract(
         new ExtractParameters(fileStream, "doc1")
         {
            ExtractFormat = ExtractFormat.Text,
            ExtractOnly = false
         });
}
 
solr.Commit();

The response will hold all the metadata that has been extracted from the file using Apache Tika. The ExtractParameters is given a FileStream object and an ID for the Solr index (here just “doc1″ – can be anything as long as it is unique). The ExtractOnly property can be set to true if you don’t wan’t to get Tika to index the data, but only wan’t it to extract the metadata from the file that is sent. The file is streamed to the Solr API using HTTP POST. You can read more about that here: http://wiki.apache.org/solr/ExtractingRequestHandler

In the above code the data sent to Solr is indexed in the last line, where the data is committed to Solr. If you would like Solr to index and commit the files when sent to the service you can set the AutoCommit property to true inside the initiation of ExtractParameters:

...
   var response =
      solr.Extract(
         new ExtractParameters(fileStream, "doc1")
         {
            ExtractFormat = ExtractFormat.Text,
            ExtractOnly = false,
            AutoCommit = true
         });
...

Because the commit is done everytime you send a new file to the Solr-API you can search during the indexing and, of course, you don’t need to call the solr.Commit() method after indexing.

You need a request handler inside your solrconfig.xml (inside {your-solr-install-path}/conf) to make Solr understand the request from the client. Below is an example of how the solrconfig.xml looks when you haven’t changed anything after install of Solr. See this for further information about configuring Tika inside Solr: http://wiki.apache.org/solr/ExtractingRequestHandler

  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <!-- All the main content goes into "text"... if you need to return
           the extracted text or do highlighting, use a stored field. -->
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>
 
      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>

Your Solr schema.xml file (inside {your-solr-install-path}/conf) needs some fields in order to index the metadata from the files you send to Solr. You can provide the fields you need and index/store/ the metadata as is required for the files you need to index. This is the fields that Solr is installed with:

   <field name="id" type="string" indexed="true" stored="true" required="true" /> 
   <field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
   <field name="name" type="text_general" indexed="true" stored="true"/>
   <field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>
   <field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
   <field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
 
   <field name="weight" type="float" indexed="true" stored="true"/>
   <field name="price"  type="float" indexed="true" stored="true"/>
   <field name="popularity" type="int" indexed="true" stored="true" />
   <field name="inStock" type="boolean" indexed="true" stored="true" />
 
   <!--
   The following store examples are used to demonstrate the various ways one might _CHOOSE_ to
    implement spatial.  It is highly unlikely that you would ever have ALL of these fields defined.
    -->
   <field name="store" type="location" indexed="true" stored="true"/>
 
   <!-- Common metadata fields, named specifically to match up with
     SolrCell metadata when parsing rich documents such as Word, PDF.
     Some fields are multiValued only because Tika currently may return
     multiple values for them.
   -->
   <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="subject" type="text_general" indexed="true" stored="true"/>
   <field name="description" type="text_general" indexed="true" stored="true"/>
   <field name="comments" type="text_general" indexed="true" stored="true"/>
   <field name="author" type="text_general" indexed="true" stored="true"/>
   <field name="keywords" type="text_general" indexed="true" stored="true"/>
   <field name="category" type="text_general" indexed="true" stored="true"/>
   <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="last_modified" type="date" indexed="true" stored="true"/>
   <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
 
   <!-- catchall field, containing all other searchable text fields (implemented
        via copyField further on in this schema  -->
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
 
   <!-- catchall text field that indexes tokens both normally and in reverse for efficient
        leading wildcard queries. -->
   <field name="text_rev" type="text_general_rev" indexed="true" stored="false" multiValued="true"/>
 
   <!-- non-tokenized version of manufacturer to make it easier to sort or group
        results by manufacturer.  copied from "manu" via copyField -->
   <field name="manu_exact" type="string" indexed="true" stored="false"/>
 
   <field name="payloads" type="payloads" indexed="true" stored="true"/>

The fields above is a fits-all scenario, so with this you can both index audio and document files.

Supported formats in Tika can be found here:
http://tika.apache.org/1.0/formats.html

Written by jonasm

January 22nd, 2012 at 11:49 pm

Posted in .NET,Solr

Tagged with , , ,

CSV Parser

without comments

I’m currently working on a project where I needed a C# console application that was able to read through a Excel CSV (Comma Separated Values) file.

Basically the CSV file format is just a txt file with rows and each column is then separated by a comma (surprise!) or a semicolon. Besides a comma the data in each column can optionally be “framed” by quotation marks.

Therefore i started out with the following code, just as I would read through a normal txt file:

try{
    using (StreamReader readFile = new StreamReader(path))
    {
// Do something here…
    }
}
catch (Exception e)
{
    // Do some error handling here…
}

This is, as you can see, really straight forward. First of all I declare an object of a StreamReader in a using statement. Using the object “readFile” I am able then to navigate the file. The using statement is important as this will do the cleanup for me, by calling StreamReader.Dispose(), when the statement finishes. I always wrap this kind of code in a try…catch because when you work with files, errors just occasionally happen.

Now, to read the data from the CSV file I add the following lines of code inside the using statement:

List<string[]> parsedData = new List<string[]>();
string line;
string[] row;

while ((line = readFile.ReadLine()) != null)
{
    row = line.Split(‘,’);

    parsedData.Add(row);

}

It just declares a new List that can hold an array of strings and the line and row variables is needed when traversing through the file. I then use the readFile object to call the ReadLine() method of the StreamReader class in a while loop. When there is no more lines in the file the line variable will be null. Inside the while loop I use the string.Split() method to split the line into an array of strings (my columns) and I then add this array to my List object (parsedData).

The problem then was that I didn’t know exactly what encoding the file would be in. What to do then? I settled on a solution where I tell the StreamReader what encoding the file probably has and it will then open it in that encoding. This can be done by adding a parameter when calling the constructor on the StreamReader class like this:

using (StreamReader readFile = new StreamReader(path, encoding))

Finally all this can be wrapped in a nice method. I also added a check to be sure that the file I want to parse is actually available. But there you go:

public static List<string[]> ParseCSV(string path, Encoding encoding, char splitter)
{
if (!File.Exists(path))

        return null;

    List<string[]> parsedData = new List<string[]>();

    try

    {
        using (StreamReader readFile = new StreamReader(path, encoding))

        {
            string line;

            string[] row;

            while ((line = readFile.ReadLine()) != null)

            {
row = line.Split(splitter);
parsedData.Add(row);
            }
        }
    }
    catch (Exception e)
{
        // Do some error handling here…
    }

    return parsedData;

}

Written by jonasm

June 23rd, 2011 at 3:47 pm

Posted in .NET

Tagged with , , ,