Andy Uzick's Sitecore Blog: June 2016

Sunday, June 5, 2016

Zip code data in Sitecore

The source code and documentation for this solution are available on GitHub. Download

Good old postal Zip Codes. Not a very exciting subject, but it seems like every year or two, I run into a solution that requires access to a zip code database.

There are several commercial services that provide extensive zip code data, but there is at least one free database available (https://boutell.com/zipcodes/) that includes a decent amount of data, including coordinates, city and state, and time zone.

Latitude and longitude can be useful if for example you needed to put an "approximate" pin in a map.

We recently ran into a situation where we needed to know the visitor's time zone. Sitecore GeoIP data includes zip codes, but not time zones. All I need is to wire that up to a zip code database, and I'm all set. (And before you quibble over the accuracy, don't. I know it's not perfect; I think of this as a "good enough" solution).

So I set up a solution to provide Sitecore with an API for looking up zip codes. I started with a few goals:

I want to be able to use this in any Sitecore solution.,
I don't want to use SQL. It's always administrative and deployment hassle to use custom SQL tables.
I don't want to impose a schema on every application that uses this. Sure, that free zip database is fine for what I need now, but others may have more detailed data they'd like to use.
I want to use a swappable provider so other applications can change how the data is imported, where it is stored, and/or how it is queried.
For my default provider, I want it to "just work". I drop the file into a folder, and my Sitecore app has access to the data (well, I did end up also needing to add a mongo connection string to the connectionstrings.config file).

I decided to use MongoDB to store the data. Once I have a connection string, I can create collections and add data with any schema I want, without bugging the SQL admins. I also added to add a caching layer. I'm probably going to access this data from rules and such, and I want it to be zippy-quick.

The data flow looks like this:

The idea is, the operational data stored in MongoDB, and accessed through a caching layer at runtime. At application start (and via an administrative interface), the data file date is compared to the last time the data was imported, and if the file is newer, it is re-imported.

Installing the module

Install the update package in your Sitecore application. When Sitecore starts, it will populate a MongoDB database with data from a provided zip code document located in App_Data. This file is sourced from https://boutell.com/zipcodes/.

Whenever this file is updated, it will be reloaded the next time Sitecore starts.

Add a connection string to your ConnectionStrings.config file with the name you'd like to use for your Mongo database. For example

<add name="zipinfo" connectionString="mongodb://localhost:27017/zipinfo" />

If you want to use a separate mongo database for each Sitecore instance sharing a common Mongo server, change the connectionString e.g.
connectionString="mongodb://localhost:27017/myapp_zipinfo"

The update package will place a copy of the data file in your App_Data folder. You can relocate this to the Sitecore data folder if you desire.

Using the module

The module exposes a "manager" static class (ZipInfo.ZipInfoManager) with static methods like Get(int zipCode) to access the data. I won't get into all the methods here (see the documentation), but there are methods for both retrieve/update and cache management operations.

The update package will also install a utility at /sitecore/admin/zipinfo.aspx that'll allow you to query the database, manage the cache, and re-import the data.

The module exposes a provider class that can be swapped out with your own provider. If you have more detailed data in a a csv file, you can simply inherit from the default provider, create your own POCO class, and override the LoadRecord method that maps fields on the csv line to the POCO. If you need a different method for loading the data rather than reading it from a csv file, override the Load method. If you don;t want to use mongo, you can replace the entire provider by creating a class that implements IZipInfoProvider. More information about the provider is available in the docs.

The source code and documentation for this solution are available on GitHub. Download

Wednesday, June 1, 2016

Content Indexing vs Site Search

I've had this conversation so many times, I thought I'd capture it here once and for all.

There is a vast difference between content indexing and site search. The following discusses these differences. This is not exhaustive; there are finer nuances that I’ll skip over in order to keep focused on the key concepts.

Content Indexing

Content indexing is the act of storing selected fields of Sitecore content items into a separate index, so that content items can be retrieved rapidly by code. Examples of this are the search box Sitecore uses for item buckets, or a custom rendering that “facets” content e.g. outputs links to every item where “Georgia” is selected in a “Home state” field.

Indexes are created by copying raw item data into the index, typically when the item is saved or published.
Content indexing is a “data-oriented” operation e.g. a lookup in an index finds an match of content in a field.
A content index has no concept of pages, and does not have any ability to rank on such things as link frequency.
Content Indexing is absolutely required for Sitecore to function.
Sitecore implements content indexing “out of the box”, using Lucene by default, with configurable support for Solr in scaled enterprise environments.

Site Search

Site search is the act of indexing the content of entire viewable pages, so that whole pages can be found using “free text” search. An example of this is a site visitor entering a few words in a search box and getting back a page of ranked results, akin to a Google search.

Indexes are created by “crawling” the site e.g. code uses http requests to pull every page of the site, storing the content in its index, and examining the links on in the page to find more pages to crawl.
Site search is a “free text” operation, e.g. a lookup considers all of the visible content of a page.
A good site search tool ranks results based on things like semantics e.g. content in <h1> tags will rank higher than body text, or linking e.g. pages with more inbound links will rank higher.
A site search solution is only necessary if you want visitors to be able to “free text” search the site e.g. the site has a “search box”.
Sitecore does not implement free-text page search “out of the box”.

Why the distinction is important

Any given page of a Sitecore site may have visible page content derived from many content items. Therefore, out-of-the-box content indexing is not an appropriate solution for site search.

Moreover, a good “free text” search experience requires that the results be well ranked. Consider when you do a Google search. Google isn’t simply returning a flat list of every page that contains your search terms, instead, it is using highly sophisticated ranking algorithms to present the results you are most likely to want first. If you’re familiar with SEO principles, you know that there are many factors that influence rank far beyond the simple content of the page.

Of course there is some overlap. A good site search tool can also include "hard data" in the form of metadata, so that search results can be "faceted". This allows the visitor to "filter" results based on date, geography, product line, or any other "field oriented" data that you include in the page metadata.

We've already deployed Solr. Why can't we use that for site search?

In theory, there is a way to leverage a Solr index to do free text search. This is not a simple matter of “configuration”, but rather, requires extensive coding. The general idea is you build a scheduled processor that programmatically loads every page of the site (via an http request) so it can get the entirety of the content on a given page. It puts that content into a “computed field” of a Solr index. Then, custom “search box” code can search that “computed field” for occurrences of that content. There are a drawbacks to this approach:

It is not implemented out of the box.
The ranking of search is either non-existent, or at least far short of the ranking quality of a true crawler.

[edited to correct my error about Coveo]

There are “off the shelf” tools that combine the concepts of content indexing and site search.

Coveo is an excellent commercial product that uses a proprietary indexing mechanism, with conventional "content indexing" and also crawling. It can index both entire pages and content items. It comes with value-added tools for rapid deployment of faceted search features, and also adds some ranking capabilities, including the ability to manually tweak search ranking. It comes in on-premises, cloud, and a hobbled “free” version. It is arguably the “least effort” solution to implement, since it is very "Sitecore aware" out of the box.
There are lots of free and commercial solutions. For example, Arke’s SDK includes a “computed search” module. uses configured field and template types to inject page content into a Solr index.

There are other “off the shelf” solutions that provide excellent free text search experiences that do not rely on Solr. Most of these have evolved to cloud-hosted rather than on-premises solutions. Google site search and Amazon cloud search are leaders in this space, and Coveo had a cloud edition, but there are many services available. Using one of these services would still require coding, but it would be pure “integration” coding, not an attempt to build a full blown crawler.

In the absence of an “off the shelf” solution, you could build a home-grown Solr-based crawler. It’d require significant time and effort, only to yield a pretty poor user experience due to the lack of any sophisticated ranking.