Andy Uzick's Sitecore Blog: 2016

Sunday, June 5, 2016

Zip code data in Sitecore

The source code and documentation for this solution are available on GitHub. Download

Good old postal Zip Codes. Not a very exciting subject, but it seems like every year or two, I run into a solution that requires access to a zip code database.

There are several commercial services that provide extensive zip code data, but there is at least one free database available (https://boutell.com/zipcodes/) that includes a decent amount of data, including coordinates, city and state, and time zone.

Latitude and longitude can be useful if for example you needed to put an "approximate" pin in a map.

We recently ran into a situation where we needed to know the visitor's time zone. Sitecore GeoIP data includes zip codes, but not time zones. All I need is to wire that up to a zip code database, and I'm all set. (And before you quibble over the accuracy, don't. I know it's not perfect; I think of this as a "good enough" solution).

So I set up a solution to provide Sitecore with an API for looking up zip codes. I started with a few goals:

I want to be able to use this in any Sitecore solution.,
I don't want to use SQL. It's always administrative and deployment hassle to use custom SQL tables.
I don't want to impose a schema on every application that uses this. Sure, that free zip database is fine for what I need now, but others may have more detailed data they'd like to use.
I want to use a swappable provider so other applications can change how the data is imported, where it is stored, and/or how it is queried.
For my default provider, I want it to "just work". I drop the file into a folder, and my Sitecore app has access to the data (well, I did end up also needing to add a mongo connection string to the connectionstrings.config file).

I decided to use MongoDB to store the data. Once I have a connection string, I can create collections and add data with any schema I want, without bugging the SQL admins. I also added to add a caching layer. I'm probably going to access this data from rules and such, and I want it to be zippy-quick.

The data flow looks like this:

The idea is, the operational data stored in MongoDB, and accessed through a caching layer at runtime. At application start (and via an administrative interface), the data file date is compared to the last time the data was imported, and if the file is newer, it is re-imported.

Installing the module

Install the update package in your Sitecore application. When Sitecore starts, it will populate a MongoDB database with data from a provided zip code document located in App_Data. This file is sourced from https://boutell.com/zipcodes/.

Whenever this file is updated, it will be reloaded the next time Sitecore starts.

Add a connection string to your ConnectionStrings.config file with the name you'd like to use for your Mongo database. For example

<add name="zipinfo" connectionString="mongodb://localhost:27017/zipinfo" />

If you want to use a separate mongo database for each Sitecore instance sharing a common Mongo server, change the connectionString e.g.
connectionString="mongodb://localhost:27017/myapp_zipinfo"

The update package will place a copy of the data file in your App_Data folder. You can relocate this to the Sitecore data folder if you desire.

Using the module

The module exposes a "manager" static class (ZipInfo.ZipInfoManager) with static methods like Get(int zipCode) to access the data. I won't get into all the methods here (see the documentation), but there are methods for both retrieve/update and cache management operations.

The update package will also install a utility at /sitecore/admin/zipinfo.aspx that'll allow you to query the database, manage the cache, and re-import the data.

The module exposes a provider class that can be swapped out with your own provider. If you have more detailed data in a a csv file, you can simply inherit from the default provider, create your own POCO class, and override the LoadRecord method that maps fields on the csv line to the POCO. If you need a different method for loading the data rather than reading it from a csv file, override the Load method. If you don;t want to use mongo, you can replace the entire provider by creating a class that implements IZipInfoProvider. More information about the provider is available in the docs.

The source code and documentation for this solution are available on GitHub. Download

Wednesday, June 1, 2016

Content Indexing vs Site Search

I've had this conversation so many times, I thought I'd capture it here once and for all.

There is a vast difference between content indexing and site search. The following discusses these differences. This is not exhaustive; there are finer nuances that I’ll skip over in order to keep focused on the key concepts.

Content Indexing

Content indexing is the act of storing selected fields of Sitecore content items into a separate index, so that content items can be retrieved rapidly by code. Examples of this are the search box Sitecore uses for item buckets, or a custom rendering that “facets” content e.g. outputs links to every item where “Georgia” is selected in a “Home state” field.

Indexes are created by copying raw item data into the index, typically when the item is saved or published.
Content indexing is a “data-oriented” operation e.g. a lookup in an index finds an match of content in a field.
A content index has no concept of pages, and does not have any ability to rank on such things as link frequency.
Content Indexing is absolutely required for Sitecore to function.
Sitecore implements content indexing “out of the box”, using Lucene by default, with configurable support for Solr in scaled enterprise environments.

Site Search

Site search is the act of indexing the content of entire viewable pages, so that whole pages can be found using “free text” search. An example of this is a site visitor entering a few words in a search box and getting back a page of ranked results, akin to a Google search.

Indexes are created by “crawling” the site e.g. code uses http requests to pull every page of the site, storing the content in its index, and examining the links on in the page to find more pages to crawl.
Site search is a “free text” operation, e.g. a lookup considers all of the visible content of a page.
A good site search tool ranks results based on things like semantics e.g. content in <h1> tags will rank higher than body text, or linking e.g. pages with more inbound links will rank higher.
A site search solution is only necessary if you want visitors to be able to “free text” search the site e.g. the site has a “search box”.
Sitecore does not implement free-text page search “out of the box”.

Why the distinction is important

Any given page of a Sitecore site may have visible page content derived from many content items. Therefore, out-of-the-box content indexing is not an appropriate solution for site search.

Moreover, a good “free text” search experience requires that the results be well ranked. Consider when you do a Google search. Google isn’t simply returning a flat list of every page that contains your search terms, instead, it is using highly sophisticated ranking algorithms to present the results you are most likely to want first. If you’re familiar with SEO principles, you know that there are many factors that influence rank far beyond the simple content of the page.

Of course there is some overlap. A good site search tool can also include "hard data" in the form of metadata, so that search results can be "faceted". This allows the visitor to "filter" results based on date, geography, product line, or any other "field oriented" data that you include in the page metadata.

We've already deployed Solr. Why can't we use that for site search?

In theory, there is a way to leverage a Solr index to do free text search. This is not a simple matter of “configuration”, but rather, requires extensive coding. The general idea is you build a scheduled processor that programmatically loads every page of the site (via an http request) so it can get the entirety of the content on a given page. It puts that content into a “computed field” of a Solr index. Then, custom “search box” code can search that “computed field” for occurrences of that content. There are a drawbacks to this approach:

It is not implemented out of the box.
The ranking of search is either non-existent, or at least far short of the ranking quality of a true crawler.

[edited to correct my error about Coveo]

There are “off the shelf” tools that combine the concepts of content indexing and site search.

Coveo is an excellent commercial product that uses a proprietary indexing mechanism, with conventional "content indexing" and also crawling. It can index both entire pages and content items. It comes with value-added tools for rapid deployment of faceted search features, and also adds some ranking capabilities, including the ability to manually tweak search ranking. It comes in on-premises, cloud, and a hobbled “free” version. It is arguably the “least effort” solution to implement, since it is very "Sitecore aware" out of the box.
There are lots of free and commercial solutions. For example, Arke’s SDK includes a “computed search” module. uses configured field and template types to inject page content into a Solr index.

There are other “off the shelf” solutions that provide excellent free text search experiences that do not rely on Solr. Most of these have evolved to cloud-hosted rather than on-premises solutions. Google site search and Amazon cloud search are leaders in this space, and Coveo had a cloud edition, but there are many services available. Using one of these services would still require coding, but it would be pure “integration” coding, not an attempt to build a full blown crawler.

In the absence of an “off the shelf” solution, you could build a home-grown Solr-based crawler. It’d require significant time and effort, only to yield a pretty poor user experience due to the lack of any sophisticated ranking.

Thursday, April 14, 2016

Using ARR to enable FXM

Ever used Sitecore's Federated Experience Manager (FXM)? Effectively, it lets you use Sitecore to content manage, track, personalize and test external sites which are not hosted in Sitecore.

The motivations I often hear for using FXM are...

We've bought a license and plan to migrate to Sitecore later, but we want to start personalizing and gathering analytics on our site now.
We're moving our main site to Sitecore, but we have some related sites we just don't have time and budget to move now.
We want to do a demo or POC using content from a non-Sitecore site, but don’t want to re-create the content in Sitecore.

With FXM you can do that. All you need is to place a small bit of script on the external sites. Sadly, that's often not possible. Sometimes the old site is literally on a server that nobody knows how to access. Sometimes you're just doing a POC and nobody wants to edit the old site for that.

IIS's Application Request Routing (ARR) to the rescue.

IIS has features called ARR and the URL Rewrite module that amount to a reverse proxy that allows you have a “man in the middle” that can manipulate the HTML before it is returned to the browser.

We set up a IIS instance with ARR with a public-facing URL (in this example, “demo.mysite.com”), and configure ARR to do the following

Take the path from the inbound request, and form a URL using the Sitecore server’s host name.
Fetch the HTML from the Sitecore host.
Inject the FXM beacon script into the HTML
Change the URLS within the HTML for such things as images, scripts, CSS, iframes, etc, so that they will be requested from the ARR and not the Sitecore site.
Strip out the “X-Frame-Settings” header (if it exists), which can interfere with FXM Experience Editor.

This results in a topology like this:

The URL structure in this example would give us a demo/POC website (“demo.mysite.com”) where we can show how a site can be tracked and manipulated with FXM. This could in theory be used for a live site by changing DNS to point www.mysite.com to the ARR, and change the hostname of the Sitecore server to something like Sitecore.mysite.com.

Setting up the Reverse Proxy

From an infrastructure perspective, setting up the proxy server is pretty simple. Install the ARR and URL Rewrite extensions, and create a new site in IIS. Set the binding up so it answers requests from the desired host (in the example above, “demo.mysite.com”). The site folder doesn’t need much; a default.htm page, and an empty web config.

The magic all happens in web.config. The URL Rewrite module is governed by rules. There are two sets, one just called “rules” which are used to route requests to the Sitecore server, and another called “outbound rules” which are used to manipulate the responses from Sitecore before they are returned to the browser. Outbound rules also allow you to define “preconditions” that allow you to restrict when an outbound rule will apply.

The IIS management console provides an interface for building up all the XML in the config file for all of this. I find that when I’m working with it, I flip between IIS and Notepad++ until I get everything just right.

The referenced articles provide good guidance for how to use the URL Rewrite module and set up rules. This example web.config could be used to implement our example.

 <?xml version="1.0" encoding="utf-8"?>  
 <configuration>  
  <system.web>  
  </system.web>  
  <system.webServer>  
   <rewrite>  
    <rules>  
     <!--  
     This rule routes requests everything to the external site.  
     The use of "HTTP_ACCEPT_ENCODING" ensures that external servers   
     will send responses in the clear (not zipped or otherwise encoded)  
     -->  
     <rule name="Route to external site" stopProcessing="true">  
      <match url="(.*)" />  
      <action type="Rewrite" url="http://www.mysite.com/{R:1}" />  
      <serverVariables>  
       <set name="HTTP_ACCEPT_ENCODING" value="" />  
      </serverVariables>  
     </rule>  
    </rules>  
    <outboundRules>  
     <!--  
     This rule converts proxied pages' urls to relative urls (so they'll be requested through the ARR server and avoid cross-domain issues)  
     -->  
     <rule name="Rewrite External Absolute Paths" preCondition="Request is for html">  
      <match filterByTags="A, Area, Base, Form, Frame, Head, IFrame, Img, Input, Link, Script" pattern="^http(s)?://www.mysite.com/(.*)" />  
      <action type="Rewrite" value="/{R:2}" />  
     </rule>  
     <!--  
     This rule removes the X_Frame_Options header, which can prevent the Experience editor from working.  
     -->  
     <rule name="Strip x-frame-options" preCondition="Request is for html" patternSyntax="ECMAScript">  
      <match serverVariable="RESPONSE_X_Frame_Options" pattern="(.+)" />  
      <action type="Rewrite" value="" />  
     </rule>  
     <!--  
     This rule removes adds "(via proxy)" to the Server header, to aid troubleshooting.  
     -->  
     <rule name="Change Server Header">  
      <match serverVariable="RESPONSE_Server" pattern="(.+)" />  
      <action type="Rewrite" value="{R:0} (via proxy)" />  
     </rule>  
     <!--  
     This rule injects the FXM script into the HTML from the external site.  
     -->  
     <rule name="Add FXM script to tb" preCondition="Request is for html" patternSyntax="ExactMatch">  
      <match filterByTags="None" pattern="&lt;/head>" />  
      <action type="Rewrite" value="&lt;script src=&quot;//sitecore.mysite.com/bundle/beacon&quot;>&lt;/script>&quot;/head>" />  
     </rule>  
     <preConditions>  
      <!--  
      This precondition allows the outbound rules to only act on html responses.  
      -->  
      <preCondition name="Request is for html">  
       <add input="{RESPONSE_CONTENT_TYPE}" pattern="text/html" />  
      </preCondition>  
     </preConditions>  
    </outboundRules>  
   </rewrite>  
  </system.webServer>  
 </configuration>

Tuesday, February 2, 2016

Dear John...

No I’m not writing to say I’ve left you for another CMS. But John West’s announcement today leaves me wanting to take a short detour down Memory Lane. If you came here looking for technical tidbits, I’ll be hangin’ a right back down Architecture Avenue shortly.

I had the great good fortune to work directly with John on my very first Sitecore project. It was one of the first projects to be done at scale in North America, and it was my first foray into a true enterprise-level .net CMS. When Lars Nielsen flew out to conduct our first training, John was there, both to learn and to advise. He remained tightly connected throughout the project, providing strategic advice and technical leadership (and answers to my incessant questions). John’s enthusiasm for Sitecore was infectious. His spirit of adventure set the tone for that project, and indeed for my entire Sitecore career.

John’s thought leadership has been at the bedrock of Sitecore’s growth. His quiet, unassuming tone underlies a deep passion for Sitecore. Owing to John’s example, today’s Sitecore ecosystem is infused with a sense of excitement, wonder, and a craving to learn, create and explore. His blog is a hallmark of his motivational style. John provides the signposts leading to the new and evolving capabilities of the product, while never asserting his knowledge is definitive, never assuming his observations are comprehensive, and never insisting his conclusions are absolute. Being the good teacher he is, he leaves application as an exercise for the student. And exercise we do! Many talented Sitecore professionals share valuable learnings from their Sitecore journeys. But those journeys began with John’s unspoken challenge to “Go West, young man!” (Yes, I went there.)

Over the years, as John as gone from teacher to mentor to friend, I’ve felt immense pride to be part of this dynamic community that John was so instrumental in creating. Though we have gone from speaking almost every day to interacting only sporadically, every time we see each other it seems we are picking up in mid-sentence. There has never been a “goodbye” with him, and there is not one now. Talk to you soon, friend!

(And goodbye forever, XSLT!)