Jonas Midstrup's Blog

Programming, IT in general and everything in between.

Archive for the ‘.NET’ Category

Indexing files in Solr using Tika

without comments

After installing the latest Solr release I noticed that the schema.xml file (the XML-file that holds the information about how Solr should index and search your data) already was setup for something called Apache Tika that can be used to index all sorts of documents and other files that holds all kinds of metadata (audio-files for instance).

See this post to learn how to install Solr on Windows: Installing Apache Solr on Windows

The great thing about Tika is that you don’t need to do anything in order to make it work – Tika does everything for you – or almost everything. You can also set it up so that it suits your needs just by adding a config-XML file in the Solr directory and a reference in the Solr-config XML-file. I tried to get it to work, but found that quit difficult also because there wasn’t much help to get out on Google. Because of all my problems getting it to work I created this post on StackOverflow. Before I got a reply on the subject though, I found the solution myself burried deep inside some forum posts about SolrNet.

This how you do in order to get it to work inside Solr using SolrNet as client.

  1. Install Solr using the link from my earlier post above or something similiar – the main thing is that you install it using the release from Solr that is already build.
  2. Create a new folder called “lib” inside your Solr-install folder.
  3. Copy the apache-solr-cell-3.4.0.jar file from the “dist”-folder from the Solr zip-file to the newly created “lib”-folder the folder where you installed Solr.
  4. Copy the content of contrib\extraction\lib from the Solr-zip to the same newly created “lib”-folder the folder where you installed Solr.

Now Tika is installed in Solr! Remember to go to http://localhost:8080/solr and confirm that it is installed correctly.

To use it in a .NET client application you can use the newest release of SolrNet (currently the 0.4.0.2002 beta version release) and add the DLLs to your .NET project (all of them – seriously!). This is an example of how to use it in C#:

Startup.Init("YOUR-SOLR-SERVICE-PATH");
var solr = ServiceLocator.Current.GetInstance();
 
using (FileStream fileStream = File.OpenRead("FILE-PATH-FOR-THE-FILE-TO-BE-INDEXED"))
{
   var response =
      solr.Extract(
         new ExtractParameters(fileStream, "doc1")
         {
            ExtractFormat = ExtractFormat.Text,
            ExtractOnly = false
         });
}
 
solr.Commit();

The response will hold all the metadata that has been extracted from the file using Apache Tika. The ExtractParameters is given a FileStream object and an ID for the Solr index (here just “doc1″ – can be anything as long as it is unique). The ExtractOnly property can be set to true if you don’t wan’t to get Tika to index the data, but only wan’t it to extract the metadata from the file that is sent. The file is streamed to the Solr API using HTTP POST. You can read more about that here: http://wiki.apache.org/solr/ExtractingRequestHandler

In the above code the data sent to Solr is indexed in the last line, where the data is committed to Solr. If you would like Solr to index and commit the files when sent to the service you can set the AutoCommit property to true inside the initiation of ExtractParameters:

...
   var response =
      solr.Extract(
         new ExtractParameters(fileStream, "doc1")
         {
            ExtractFormat = ExtractFormat.Text,
            ExtractOnly = false,
            AutoCommit = true
         });
...

Because the commit is done everytime you send a new file to the Solr-API you can search during the indexing and, of course, you don’t need to call the solr.Commit() method after indexing.

You need a request handler inside your solrconfig.xml (inside {your-solr-install-path}/conf) to make Solr understand the request from the client. Below is an example of how the solrconfig.xml looks when you haven’t changed anything after install of Solr. See this for further information about configuring Tika inside Solr: http://wiki.apache.org/solr/ExtractingRequestHandler

  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <!-- All the main content goes into "text"... if you need to return
           the extracted text or do highlighting, use a stored field. -->
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>
 
      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>

Your Solr schema.xml file (inside {your-solr-install-path}/conf) needs some fields in order to index the metadata from the files you send to Solr. You can provide the fields you need and index/store/ the metadata as is required for the files you need to index. This is the fields that Solr is installed with:

   <field name="id" type="string" indexed="true" stored="true" required="true" /> 
   <field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
   <field name="name" type="text_general" indexed="true" stored="true"/>
   <field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>
   <field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
   <field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
 
   <field name="weight" type="float" indexed="true" stored="true"/>
   <field name="price"  type="float" indexed="true" stored="true"/>
   <field name="popularity" type="int" indexed="true" stored="true" />
   <field name="inStock" type="boolean" indexed="true" stored="true" />
 
   <!--
   The following store examples are used to demonstrate the various ways one might _CHOOSE_ to
    implement spatial.  It is highly unlikely that you would ever have ALL of these fields defined.
    -->
   <field name="store" type="location" indexed="true" stored="true"/>
 
   <!-- Common metadata fields, named specifically to match up with
     SolrCell metadata when parsing rich documents such as Word, PDF.
     Some fields are multiValued only because Tika currently may return
     multiple values for them.
   -->
   <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="subject" type="text_general" indexed="true" stored="true"/>
   <field name="description" type="text_general" indexed="true" stored="true"/>
   <field name="comments" type="text_general" indexed="true" stored="true"/>
   <field name="author" type="text_general" indexed="true" stored="true"/>
   <field name="keywords" type="text_general" indexed="true" stored="true"/>
   <field name="category" type="text_general" indexed="true" stored="true"/>
   <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="last_modified" type="date" indexed="true" stored="true"/>
   <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
 
   <!-- catchall field, containing all other searchable text fields (implemented
        via copyField further on in this schema  -->
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
 
   <!-- catchall text field that indexes tokens both normally and in reverse for efficient
        leading wildcard queries. -->
   <field name="text_rev" type="text_general_rev" indexed="true" stored="false" multiValued="true"/>
 
   <!-- non-tokenized version of manufacturer to make it easier to sort or group
        results by manufacturer.  copied from "manu" via copyField -->
   <field name="manu_exact" type="string" indexed="true" stored="false"/>
 
   <field name="payloads" type="payloads" indexed="true" stored="true"/>

The fields above is a fits-all scenario, so with this you can both index audio and document files.

Supported formats in Tika can be found here:
http://tika.apache.org/1.0/formats.html

Written by jonasm

January 22nd, 2012 at 11:49 pm

Posted in .NET,Solr

Tagged with , , ,

CSV Parser

without comments

I’m currently working on a project where I needed a C# console application that was able to read through a Excel CSV (Comma Separated Values) file.

Basically the CSV file format is just a txt file with rows and each column is then separated by a comma (surprise!) or a semicolon. Besides a comma the data in each column can optionally be “framed” by quotation marks.

Therefore i started out with the following code, just as I would read through a normal txt file:

try{
    using (StreamReader readFile = new StreamReader(path))
    {
// Do something here…
    }
}
catch (Exception e)
{
    // Do some error handling here…
}

This is, as you can see, really straight forward. First of all I declare an object of a StreamReader in a using statement. Using the object “readFile” I am able then to navigate the file. The using statement is important as this will do the cleanup for me, by calling StreamReader.Dispose(), when the statement finishes. I always wrap this kind of code in a try…catch because when you work with files, errors just occasionally happen.

Now, to read the data from the CSV file I add the following lines of code inside the using statement:

List<string[]> parsedData = new List<string[]>();
string line;
string[] row;

while ((line = readFile.ReadLine()) != null)
{
    row = line.Split(‘,’);

    parsedData.Add(row);

}

It just declares a new List that can hold an array of strings and the line and row variables is needed when traversing through the file. I then use the readFile object to call the ReadLine() method of the StreamReader class in a while loop. When there is no more lines in the file the line variable will be null. Inside the while loop I use the string.Split() method to split the line into an array of strings (my columns) and I then add this array to my List object (parsedData).

The problem then was that I didn’t know exactly what encoding the file would be in. What to do then? I settled on a solution where I tell the StreamReader what encoding the file probably has and it will then open it in that encoding. This can be done by adding a parameter when calling the constructor on the StreamReader class like this:

using (StreamReader readFile = new StreamReader(path, encoding))

Finally all this can be wrapped in a nice method. I also added a check to be sure that the file I want to parse is actually available. But there you go:

public static List<string[]> ParseCSV(string path, Encoding encoding, char splitter)
{
if (!File.Exists(path))

        return null;

    List<string[]> parsedData = new List<string[]>();

    try

    {
        using (StreamReader readFile = new StreamReader(path, encoding))

        {
            string line;

            string[] row;

            while ((line = readFile.ReadLine()) != null)

            {
row = line.Split(splitter);
parsedData.Add(row);
            }
        }
    }
    catch (Exception e)
{
        // Do some error handling here…
    }

    return parsedData;

}

Written by jonasm

June 23rd, 2011 at 3:47 pm

Posted in .NET

Tagged with , , ,

WCF-services hosted in a Windows service

without comments

Some time ago I ran into one of those problems with no obvious solution. I was on a project where I needed to use a WCF-service for a Silverlight solution. First I started out by making a service that was hosted on the IIS. That was working fine with a connection to a database, but the service was also going to open up some physical files in a specific folder on the Windows server where it was hosted. This could not be done as the service didn’t have access permissions to a folder outside of those under the service. So what to do then?

I searched for a solution to the problem and one that I found was to use Windows impersonation in the service. This “simulates” a user logged in on the server with the rights given to that user. For me this wasn’t an optimal solution for a number of reasons, first of all because it didn’t seem very secure. I quickly started to search for another way to cope with the problem.

The solution I came up with was this: I realized that as I had administrator rights to the server I could host my WCF-services in a Windows service and install it as such. In this way you can run the WCF-service outside of the IIS and run multiple services in the same Windows service as well. Another great thing about it is that if you install it right (as I will show below) you can get it to have access to the file system and it will run as a service under Windows. The example below shows you how the constructor in a Windows service can look like:

        public Service()
        {
            InitializeComponent();
 
            this.ServiceName = "Name of your Service";
            this.EventLog.Log = "Application";
 
            this.AutoLog = true;
            this.CanHandlePowerEvent = true;
            this.CanHandleSessionChangeEvent = true;
            this.CanPauseAndContinue = true;
            this.CanShutdown = true;
            this.CanStop = true;
        }

What happens here is that I provide the name the service has (in Windows this will be the name of the service) and if it should use the application log under Windows to log in. I also tell it that it should log automatically to the application event log when something happens with the service and I tell it that it’s okay to stop, shutdown and pause and continue among others.

The code example below shows the Main-method of the service. As with any other Windows application this has to be provided as the starting point of the application.

        static void Main(string[] args)
        {
            try
            {
                ServiceBase.Run(new Service());
            }
            catch (Exception ex)
            {
                // Some logging or error handling here...
            }
        }

Here I make the service run. Remember the try-catch block because if something bad happens when the service is initialized this will not make everything crash. Also notice the inheritance from the ServiceBase-class. This is what makes our class a service and is needed when we’re going to install it later also and make it run. To make the service do something in certain situations before Windows service events is fired, when it starts, stops, continues, pauses or shuts down (if the server it resides under for instance shuts down), you can override the OnStart, OnStop, OnContinue, OnPause or OnShutdown methods respectively. Up until know I haven’t shown how I combine the Windows and WCF-services, but my next code examples shows just that. What you need first of all is to make your WCF-services start when the Windows service starts. This is done by overriding the OnStart method as mentioned above and then inside this hosting the WCF-services in some service hosts and open them. I learned that a good way to be able to control every service host with the WCF-service inside is to have it declared as variables inside the class. An example on this is provided below.

    partial class Service : ServiceBase
    {
        public ServiceHost serviceHostFirstWcfService = null;
        public ServiceHost serviceHostSecondWcfService = null;
        ....
    }

In my OnStart method I do the following:

       protected override void OnStart(string[] args)
        {
            try
            {
                if (this.serviceHostFirstWcfService != null)
                    this.serviceHostFirstWcfService.Close();
 
                this.serviceHostFirstWcfService = new ServiceHost(typeof(FirstWcfService));
 
                this.serviceHostFirstWcfService.Open();
 
                if (this.serviceHostSecondWcfService != null)
                    this.serviceHostSecondWcfService.Close();
 
                this.serviceHostSecondWcfService = new ServiceHost(typeof(SecondWcfService));
 
                this.serviceHostSecondWcfService.Open();
            }
            catch (Exception ex)
            {
                // Some exception handling...
            }
        }

First of all for every WCF-service I find out whether it has been initialized before. If it has, I close it so I don’t get an exception thrown when I try to open a service that is already open. Afterwards I declare the service host given the type of WCF-service that it should open and open it. As with everything else I pack it all inside a try-catch block to prevent the Windows service from crashing.

Finally you need to provide the normal configuration settings for your WCF-services inside an App.config file in the Windows service class library. The App.config file can look like the one below:

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
    <system.serviceModel>
        <behaviors>
            <serviceBehaviors>
                <behavior name="">
                    <serviceMetadata httpGetEnabled="true" />
                    <serviceDebug includeExceptionDetailInFaults="false" />
                </behavior>
            </serviceBehaviors>
        </behaviors>
        <services>
            <service name="MyNamespace.FirstWcfService">
                <endpoint address="" binding="basicHttpBinding" contract="MyNamespace.IFirstWcfService">
                    <identity>
                        <dns value="Something" />
                    </identity>
                </endpoint>
                <endpoint address="mex" binding="mexHttpBinding" contract="IMetadataExchange" />
                <host>
                    <baseAddresses>
                        <add baseAddress="http://MyFullBaseAddress/MyNamespace/FirstWcfService/" />
                    </baseAddresses>
                </host>
            </service>
            <service name="MyNamespace.SecondWcfService">
                <endpoint address="" binding="basicHttpBinding" contract="MyNamespace.ISecondWcfService">
                    <identity>
                        <dns value="Something" />
                    </identity>
                </endpoint>
                <endpoint address="mex" binding="mexHttpBinding" contract="IMetadataExchange" />
                <host>
                    <baseAddresses>
                        <add baseAddress=" http://MyFullBaseAddress/MyNamespace/SecondWcfService/" />
                    </baseAddresses>
                </host>
            </service>
        </services>
    </system.serviceModel>
</configuration>

There you have it! The service works fine like this and if you have provided the correct base address the service can be called from there (remember that what you provide as address in the baseAddress property as the full address for the service – so when you call it you don’t need the .svc prefix). Tip: You can provide multiple endpoints if you need multiple bindings.

The service worked great until I started to test it with the Silverlight application and realized that the service, of course, needed a clientaccesspolicy.xml file to have the right access rights for the WCF-service, but where do you put this when you don’t have your service on the IIS and you can’t just put the access policy file in the root directory? The solution for this is to understand how the Silverlight application asks for the clientaccesspolicy.xml file when calling the WCF-service. When a Silverlight application calls the service in another domain it automatically assumes that the clientaccesspolicy file is located in the root of the domain where the service resides (the base address provided in the configuration file without the service name). If it isn’t there it gives you the standard 404 error (”File not found” or something like that) when you try to call the service from the application. So if you can ”broadcast” a clientaccesspolicy.xml file on the root address this is the solution, but how to do that? The solution is to add a new WCF-service that, by using the HTTP GET-protocol stream the xml-file as a message. The interface for the new service should look like below. The UriTemplate-property of the WebGetAttribute tells what the URI should be. In our example it is the name of the clientaccesspolicy file.

    [ServiceContract(Namespace = "http://YourService")]
    public interface IClientAccessPolicyService
    {
        [OperationContract]
        [WebGet(UriTemplate = "clientaccesspolicy.xml")]
        Message ProvidePolicyFile();
    }

Then, in the implementation of the service interface, you just open and read the clientaccesspolicy.xml file as a stream, load it into a StringReader, add this to a XmlReader, and make a new instance System.ServiceModel.Channels.Message, add your Xmlreader object to the message and return it.

    public class ClientAccessPolicyService : IClientAccessPolicyService
    {
        public System.ServiceModel.Channels.Message ProvidePolicyFile()
        {
            try
            {
                string fileContent = string.Empty;
 
                StreamReader fileStream = new StreamReader("C:thefullpathtoyourclientaccesspolicy.xml");
                fileContent = fileStream.ReadToEnd();
                fileStream.Close();
 
                StringReader sr = new StringReader(fileContent);
                XmlReader reader = XmlReader.Create(sr);
 
                System.ServiceModel.Channels.Message result = Message.CreateMessage(MessageVersion.None, "", reader);
                return result;
            }
            catch (Exception ex)
            {
                return null;
            }
        }
    }

There you have it! When you now call your service from a Silverlight application, you will see that it gets the clientaccesspolicy.xml from the service.

After my service is done I can install it using a Windows installer or a basic console application. I chose to install it just by running a console application and you can see how this is done in this example: Installing a Windows service

Written by jonasm

August 2nd, 2010 at 5:39 pm

Installing a Windows service

without comments

To install a service you need to use the ServiceProcessInstaller and the ServiceInstaller classes. These handle the installation of the Windows service as a process with the provided information. The Account-property on the ServiceProcessInstaller is used to tell what security context the service should run under when installed. I use LocalSystem here because I need my service to have access to my file system, but here you have to choose what best fits your situation of course. The username and password properties are used to tell what user the service should run as. Next for the properties of the ServiceInstaller you can specify how and when your service is started with the StartType-property and you can set the service name, display name, a description and then you need to assign the ServiceProcessInstaller instance as parent to the ServiceInstaller instance. Then you need to specify the context in which the service should be installed: the full path to the executable that holds the service as a commandline and a path to a log file if needed. Then you install it by calling the Install-method on the ServiceInstaller instance. Then, because we need to start up the service, we takes control of the installed service with the ServiceController-class and call the Start-method. That’s it! Your service is installed.

ServiceProcessInstaller processInstaller =
    new ServiceProcessInstaller();
processInstaller.Account = ServiceAccount.LocalSystem;
processInstaller.Username = null;
processInstaller.Password = null;
 
ServiceInstaller serviceInstaller =
    new ServiceInstaller();
serviceInstaller.StartType = ServiceStartMode.Automatic;
serviceInstaller.ServiceName = ServiceName;
serviceInstaller.DisplayName = ServiceDisplayName;
serviceInstaller.Description = ServiceDescription;
serviceInstaller.Parent = processInstaller; 
 
String path = String.Format("/assemblypath={0}", ServiceExecutablePath);
String[] cmdline = { path };
String logFilePath = “C:\pathtoyourlogfile.txt;
serviceInstaller.Context = new System.Configuration.Install.InstallContext(logFilePath, cmdline);
 
System.Collections.Specialized.ListDictionary state =
    new System.Collections.Specialized.ListDictionary();
serviceInstaller.Install(state);
 
ServiceController serviceController =
    new ServiceController(serviceInstaller.ServiceName);
 
serviceController.Start();

If you need to uninstall your service to update it for instance (that’s right, the service executable cannot be altered as long as the service is installed), you can use the below code to do just that. It should be quite straight forward. You just take control of the service as in the code above, figure out if the service is running, if it’s running then it has to be stopped before we can uninstall it, wait for it to stop and then gives the need context-information and then we uninstall by calling Uninstall on our ServiceInstaller object.

ServiceInstaller serviceInstaller =
    new ServiceInstaller();
serviceInstaller.ServiceName = ServiceName;
 
ServiceController serviceController =
    new ServiceController(serviceInstaller.ServiceName);
 
if ((serviceController.Status == ServiceControllerStatus.Running)
    || (serviceController.Status == ServiceControllerStatus.Paused))
{
    serviceController.Stop();
 
    serviceController.WaitForStatus(ServiceControllerStatus.Stopped, new TimeSpan(0, 0, 0, 15));
 
    serviceController.Close();
}
 
String path = String.Format("/assemblypath={0}", ServiceExecutablePath);
String[] cmdline = { path };
String logFilePath = “C:\pathtoyourlogfile.txt;
serviceInstaller.Context = new System.Configuration.Install.InstallContext(logFilePath, cmdline);
 
serviceInstaller.Uninstall(null);

Written by jonasm

August 2nd, 2010 at 5:35 pm

Posted in Windows service

Tagged with ,

New features in .NET 4.0

without comments

I just found this on MSDN where you can see what’s new in the whole .NET 4.0 framework:
What’s New in the .NET Framework 4

It’s still the RC framework so be patient with it :) As it says above, it’s “subject to change”. But I think it’s as close to the final list as it can be.

One of the cool things that I have forgot I needed in C# that was in C++ is optional parameters in methods. This realy saves time.

Another cool thing is the Enum.TryParse. I have missed that so much as it has been a pain to work with enums. Now I can safely parse enums as I like.

The implementation of String.IsNullOrWhitespace method is also a very cool thing.

Written by jonasm

March 17th, 2010 at 10:42 pm

Posted in .NET

Tagged with ,

Pivot: New ways of working with data

without comments

You should really check this out I was introduced to today. It’s something new from MS Live Labs called Pivot:
www.getpivot.com

There’s a cool introduction on the website but you should especially check this video about it out:
Gary Flake: is Pivot a turning point for web exploration?

It’s some kind of new way to presenting and gathering large sets of data. The video here shows a WPF application and the engine I think is on his own computer, but I saw on the website that they have a Silverlight subset of it.

Definetly something I will try to work with some time. When I get my first hands on approach on this I will write a post about it.

Written by jonasm

March 16th, 2010 at 6:46 pm

Posted in Silverlight,WPF

Tagged with , , ,