Sometimes, a search query does not return the expected results. Or displays some entries which should not be returned. Or returns results in a counter-intuitive order. This article shows ways to troubleshoot those issues.
Luke to the rescue
Luke is the tool of choice to understand search results. It is a free Java utility that lets you analyse a Lucene index.
When trying to understand why some items are not returned, Luke allows to know whether the issue is upstream (the information is not indexed) or downstream (a widget is not reading the index correctly).
When trying to understand why some unexpected items are returned, Luke gives more information about said items.
Last but not least, Luke can give precise details about how an item ranking is computed.
You can download this tool from https://luke.googlecode.com/files/lukeall-3.5.0.jar. To run it, type "java -jar lukeall-3.5.0.jar" in a console and open the Sitefinity Lucene index directory, by default located in ~/App_Data/Sitefinity/Search/[index name]
Luke overview
On the default Overview tab, Luke allows to see the various fields indexed as well as the number of terms on the bottom left table:
The bottom right table shows the top terms for the whole index, or just for a particular field. For instance, if you select the ContentType field and click "Show top terms >>", you will see the number of document indexed per type (e.g. telerik.sitefinity.pages.model.pagenode represents the number of Pages being indexed):
You can also perform some searches in the Search tab (in the rest of this article, search queries will be displayed inside square brackets):
The search works using a series of one or more <field name>:<term> predicates - e.g. [Title:hello]. Note that the query is case sensitive: the field name should always use the exact same case than the field name displayed, but the term should always be lowercase, e.g. [Title:hello]. Adding a + before a predicate indicates it is mandatory, e.g. [+Title:hello +Title:world] searches for content whose title contains both "hello" and "world".
Notice how the ContentType field indicates the type of content you're dealing with. When a result shows up when it shouldn't, it is generally because Sitefinity is indexing a similar item of another type. The ContentType combined with the Id (and sometimes the Link field) helps pinpoint the exact item being indexed.
How are Sitefinity search queries translated to Lucene queries?
When trying to understand the search results, the first step is often to try to reproduce the issue inside Luke by running a Lucene search query similar with what Sitefinity is running under the hood. Here are the general rules:
- In the case of a single-term search, [term] will typically generate a Lucene query like [(Title:term Content:term)], meaning it will search for "term" either in the Title or the Content field
- Sitefinity will however first verify that the term is indexed. If it finds that "term" is not indexed for the field "Title", it will strip this field from the query, e.g. [(Content:term)]
- If it finds other indexed terms starting with "term" it will add them to the query, e.g. [(Content:term Content:term1 Content:term2)]
- In the case of a multiple-term search, [term1 term2] will be typically generate a Lucene query like [(+Title:term1 +Title:term2) (+Content:term1 +Content:term2)]
Exact match search
As the default behavior, searching for [company] will search for any term starting with "company". This is not achieved by using wildcards (even though they are supported by Lucene), but by rewriting the query internally before sending it to Lucene.
When searching for [company], Sitefinity will look for terms in the Lucene index starting with "company" (e.g. "companyA", "companyB"). If it finds such terms, it will rewrite the query internally to search for company, companyA or companyB.
This behavior can be disabled by going to Administration / Settings / Advanced / Search, and checking "Enable exact match"
Note that Sitefinity does NOT support stemming, e.g. searching for [company] will NOT find occurrences of "companies". However, searching for [compan] would look for occurrences of "company", "companies", "companyA" and "companyB" if such terms are already indexed.
Indexing custom fields
By default, Sitefinity is looking at only two fields when performing a search: Title and Content. You can however index and search in extra fields. In the example below, we add "Symptom" (a field added to a dynamic module) by editing the index and adding the field name under "Additional fields for indexing" in the Advanced section:
After a reindex, we now see a new field:
The last step is to update the Search Results widget (in the Advanced properties) to both search for the Symptom field and to highlight any keyword found in that field:
Customized search
A common request is to be able to perform a more granular search than a given type, e.g. search for documents inside a particular library. This is achieved by:
- Writing a custom .NET class that will override the search results
- Registering that class in the toolbox (the easiest way to do so is probably using Thunder)
- In the desired page, use that custom widget as the Search Results widget instead of its standard counterpart
Keep in mind that anything that the search widgets process must be in the Lucene index - whether filtering results or displaying extra fields. Before adding any clause that filters on a particular field, make sure that the index contains actual values for that field. Those values can also give you hints about what to filter.
For example, filtering documents stored in a particular library requires to filter for results whose "Link" field begins with, say "~/docs/default-source/sub-library/". A look at the Link top terms in the Overview tab however indicates that this field is broken down by words. In other words, it is not possible to add a Lucene query filter that looks for results whose Link field starts with "~/docs/default-source/sub-library/". It is thus best to filter the elements after the search result. Below is an implementation example:
using
System;
using
System.Collections.Generic;
using
System.Linq;
using
System.Web;
using
Telerik.Sitefinity.Abstractions;
using
Telerik.Sitefinity.Search;
using
Telerik.Sitefinity.Services.Search;
using
Telerik.Sitefinity.Services.Search.Web.UI.Public;
using
Telerik.Sitefinity.Services.Search.Data;
using
Telerik.Sitefinity.Services.Search.Model;
using
System.Text.RegularExpressions;
using
System.ComponentModel;
namespace
SitefinityWebApp.Custom
{
public
class
MySearch : SearchResults
{
[Category(
"Custom Filter"
)]
public
string
Link {
get
;
set
; }
protected
override
ISearchResultsBuilder GetSearcher()
{
return
new
MySearcher(
this
);
}
public
class
MySearcher : ISearchResultsBuilder
{
public
MySearcher(SearchResults control)
{
this
.control = control;
}
public
IEnumerable<IDocument> Search(
string
query,
string
catalogue,
string
[] searchFields,
string
[] highlightedFields,
int
skip,
int
take,
out
int
hitCount)
{
var control =
this
.control;
var service = Telerik.Sitefinity.Services.ServiceBus.ResolveService<ISearchService>();
var queryBuilder = ObjectFactory.Resolve<IQueryBuilder>();
var searchQuery = queryBuilder.BuildQuery(query, control.SearchFields);
searchQuery.IndexName = catalogue;
searchQuery.Skip = skip;
searchQuery.Take = take;
searchQuery.OrderBy =
null
;
searchQuery.HighlightedFields = control.HighlightedFields;
// Contains the default filter - by current language
var currentFilter = searchQuery.Filter;
var myFilter =
new
SearchFilter();
myFilter.Operator = QueryOperator.And;
MySearch myControl = (MySearch)control;
// Persist the language filter, if exists
if
(currentFilter !=
null
) myFilter.AddFilter(currentFilter);
searchQuery.Filter = myFilter;
IResultSet result = service.Search(searchQuery);
var filtered_result = myControl.Link.IsNullOrEmpty() ?
result :
result.Where(r => r.GetValue(
"Link"
) !=
null
&&
r.GetValue(
"Link"
).ToString().StartsWith(myControl.Link));
List<IDocument> documents = filtered_result.SetContentLinks().ToList<IDocument>();
hitCount = documents.Count();
return
documents;
}
protected
readonly
SearchResults control;
}
}
}
Note that this control defines a Link property which can be accessed when looking at the advanced properties of the widget:
This avoids the need to hard-code the library path in the control itself, making it reusable.
Ranking
Ranking is always a difficult topic, as there will always be some user who disagree with the ranking.
Nonetheless, Luke can help you understand the rationale behind a particular ranking. In the Search tab, select a result and click on the Explain button.
Lucene relies on three scores to determine ranking:
- Term frequency (TF): the number of term occurrences
- Inverse Document Frequency (IDF): this is only useful when searching for multiple terms, as it allows to rate the relative importance of each term of the query. The more a term is used across the whole index, the lower its score. The idea is that, when searching for [company ACME], the term "company" has a weaker weight than "ACME" as it is used more often. As a result, an item containing ten occurrences of "ACME" and one occurrence of "company" will rank higher than an item containing one occurrence of "ACME" and ten occurrences of "company"
- Field Normalization: the longer the whole text, the lower the ranking. In other words, when searching for [ACME], a short news item which contains only "ACME" in its title will have a higher ranking than a news item which contains the same term in its title but with a lot of extra text.