Wednesday, June 1, 2016

Content Indexing vs Site Search

I've had this conversation so many times, I thought I'd capture it here once and for all.

There is a vast difference between content indexing and site search. The following discusses these differences. This is not exhaustive; there are finer nuances that I’ll skip over in order to keep focused on the key concepts.

Content Indexing

Content indexing is the act of storing selected fields of Sitecore content items into a separate index, so that content items can be retrieved rapidly by code. Examples of this are the search box Sitecore uses for item buckets, or a custom rendering that “facets” content e.g. outputs links to every item where “Georgia” is selected in a “Home state” field.

  • Indexes are created by copying raw item data into the index, typically when the item is saved or published.
  • Content indexing is a “data-oriented” operation e.g. a lookup in an index finds an match of content in a field. 
  • A content index has no concept of pages, and does not have any ability to rank on such things as link frequency. 
  • Content Indexing is absolutely required for Sitecore to function. 
  • Sitecore implements content indexing “out of the box”, using Lucene by default, with configurable support for Solr in scaled enterprise environments. 

Site Search

Site search is the act of indexing the content of entire viewable pages, so that whole pages can be found using “free text” search. An example of this is a site visitor entering a few words in a search box and getting back a page of ranked results, akin to a Google search.

  • Indexes are created by “crawling” the site e.g. code uses http requests to pull every page of the site, storing the content in its index, and examining the links on in the page to find more pages to crawl.
  • Site search is a “free text” operation, e.g. a lookup considers all of the visible content of a page.
  • A good site search tool ranks results based on things like semantics e.g. content in <h1> tags will rank higher than body text, or linking e.g. pages with more inbound links will rank higher.
  • A site search solution is only necessary if you want visitors to be able to “free text” search the site e.g. the site has a “search box”. 
  • Sitecore does not implement free-text page search “out of the box”.

Why the distinction is important

Any given page of a Sitecore site may have visible page content derived from many content items. Therefore, out-of-the-box content indexing is not an appropriate solution for site search.

Moreover, a good “free text” search experience requires that the results be well ranked. Consider when you do a Google search. Google isn’t simply returning a flat list of every page that contains your search terms, instead, it is using highly sophisticated ranking algorithms to present the results you are most likely to want first. If you’re familiar with SEO principles, you know that there are many factors that influence rank far beyond the simple content of the page.

Of course there is some overlap. A good site search tool can also include "hard data" in the form of metadata, so that search results can be "faceted". This allows the visitor to "filter" results based on date, geography, product line, or any other "field oriented" data that you include in the page metadata.

We've already deployed Solr. Why can't we use that for site search?

In theory, there is a way to leverage a Solr index to do free text search. This is not a simple matter of “configuration”, but rather, requires extensive coding. The general idea is you build a scheduled processor that programmatically loads every page of the site (via an http request) so it can get the entirety of the content on a given page. It puts that content into a “computed field” of a Solr index. Then, custom “search box” code can search that “computed field” for occurrences of that content. There are a drawbacks to this approach:

  • It is not implemented out of the box.
  • The ranking of search is either non-existent, or at least far short of the ranking quality of a true crawler.
[edited to correct my error about Coveo]

There are “off the shelf” tools that combine the concepts of content indexing and site search.

  • Coveo is an excellent commercial product that uses a proprietary indexing mechanism, with conventional "content indexing" and also crawling. It can index both entire pages and content items. It comes with value-added tools for rapid deployment of faceted search features, and also adds some ranking capabilities, including the ability to manually tweak search ranking. It comes in on-premises, cloud, and a hobbled “free” version. It is arguably the “least effort” solution to implement, since it is very "Sitecore aware" out of the box.
  • There are lots of free and commercial solutions. For example, Arke’s SDK includes a “computed search” module. uses configured field and template types to inject page content into a Solr index. 

There are other “off the shelf” solutions that provide excellent free text search experiences that do not rely on Solr. Most of these have evolved to cloud-hosted rather than on-premises solutions. Google site search and Amazon cloud search are leaders in this space, and Coveo had a cloud edition, but there are many services available. Using one of these services would still require coding, but it would be pure “integration” coding, not an attempt to build a full blown crawler.

In the absence of an “off the shelf” solution, you could build a home-grown Solr-based crawler. It’d require significant time and effort, only to yield a pretty poor user experience due to the lack of any sophisticated ranking.


No comments:

Post a Comment