Webservices and Incremental Crawls using the If-Modified-Since

Webservices and Incremental Crawls using the If-Modified-Since

If you are working with the Enterprise Search Service application or FAST on SharePoint 2010 (or SharePoint 2013 for that matter) you will end up wanting to use incremental crawls. Incremental crawls are great if it comes to time saving indexing. An incremental crawl will only index changes since the last crawl (incremental or full). Crawling incremental requires less resources and will be processed quicker since a set of changes will most probably be smaller than the whole content source. Now if you would have a custom website or intranet you will most likely also have custom code running that you might want to crawl.

In our case we had the MySites for all users running with some customization and we wanted to have more control over the indexing of that data, so we decided to develop a custom service that would serve our data, and use a web crawl to index that service. So imagine the following scenario:

You have a set of 50.000 user profiles that you want to index using a webcrawl. That will most likely end up in something like a webservice that will return a set of links (over 50.000) of them, and those links are pointing to another webservice that will return details of that specific user. Performing a full crawl on all 50.000 items will take a little over 10 hours. So we would like to implement the option to do an incremental crawl, making the time to index quicker.

If you would read into it you will find some posts explaining that SharePoint will send you an header called "If-Modified-Since" that can be used to determine whether your content has been changed or not. Some posts suggest that other crawlers like Google could suffer from the same issue (and require the same fix). As you can see in the post: Running incremental crawl on a web site recrawls almost entire website we are not the only wanting to do incremental crawls, yet we also see that an incremental crawl takes the same amount of time as a full crawl. Thus it is safe to say that implementing the If-Modified-Since requires some more knowledge. The main problem with handlers and the If-Modified-Since is that you will have to do something with the headers yourself in order to get everything working.

The first thing that is important is to pass a Last-Modified header with each request. If you do not pass that your crawler will not send an If-Modified-Since on incremental crawls, and there is no way for you to determine the freshness of your data. So if you pass that Last-Modified on your request you can use the If-Modified-Since in the next hit and determine if you should throw a 304. So the flow of your code would be something like:

clip_image001

Putting that all to code will end you up with something like the following

/// <summary>
/// Enables processing of HTTP Web requests by a custom HttpHandler that implements the <see cref="T:System.Web.IHttpHandler"/> interface.
/// </summary>
/// <param name="context">An <see cref="T:System.Web.HttpContext"/> object that provides references to the intrinsic server objects (for example, Request, Response, Session, and Server) used to service HTTP requests.</param>
public void ProcessRequest(HttpContext context)
{
    // Determine if the data needs to be returned
    if (string.IsNullOrEmpty(context.Request.Headers["If-Modified-Since"]))
    {
        ReturnData(context.Response);
    }
    else
    {
        // We might want to validate if our data is modified before assuming it is.. 
        ReturnNotModifiedResponse(context.Response);
    }
}

/// <summary>
/// Renders the data to the client
/// </summary>
/// <param name="response"></param>
private void ReturnData(HttpResponse response)
{
    try
    {
        response.ContentType = "text/html";

        response.Write("some data");

        // Set the Last-Modified header to the current crawl date (or retrieve this based on file changed date)
        response.Cache.SetLastModified(DateTime.Now);
    }
    catch (Exception ex)
    {
        // Log the error and return data not found response, you also might want to return a 403 if you 
        // hit an access denied error
        ReturnNotFoundResponse(response);
    }
}

/// <summary>
/// Send a 304 response to the client, indicating this data isn't modified since the last crawl
/// </summary>
/// <param name="response"></param>
private void ReturnNotModifiedResponse(HttpResponse response)
{
    response.Clear();
    response.StatusCode = 304;
}

/// <summary>
/// Send a 404 response to the client, indicating this data no longer exists
/// </summary>
/// <param name="response"></param>
private void ReturnNotFoundResponse(HttpResponse response)
{
    response.Clear();
    response.StatusCode = 404;
}

That will be the basis of a service that returns your data and makes it possible to do incremental crawls. The service that will return the 50.000 links should always return those unless you want to delete them. Links that are ‘gone’ when the crawler passes them will be handled as deletes.

Leave a Reply