Schema and guidelines for creating a staticSearch engine for your HTML5 site
Martin Holmes
Joey Takeda
2019-2022

This documentation provides instructions on how to use the Project Endings staticSearch Generator to provide a fully-functional search ‘engine’ to your website without any dependency on server-side code such as a database.

7 How do I use it?

First, you will have to make sure your site pages are correctly configured so that the Generator can parse them. Then, you will have to create a configuration file specifying what options you want to use. Then you run the generator, and the search functionality should be added to your site.

The generator is expecting to parse well-formed XHTML5 web pages. That means web pages which are well-formed XML, using the XHTML namespace. If your site is just raggedy tag-soup, then you can't use this tool. You can tidy up your HTML using HTML Tidy.

7.1 Configuring your site: search filters

Next, you will need to decide whether you want search filters or not. If you want to allow your users to search (for example) only in poems, or only in articles, or only in blog posts, or any combination of these document types, you will need to add <meta> tags to the heads of your documents to specify what these filters are. staticSearch supports four filter types.

7.1.1 Description filters

The description (desc) filter is a word or phrase describing or associated with the document. Here is a simple example:
<meta name="Document type"
 class="staticSearch_desccontent="Poems"/>
This specifies that there is to be a descriptive search filter called ‘Document type’, and one of the types is ‘Poems’; the document containing this <meta> tag is one of the Poems. Another type might be:
<meta name="Document type"
 class="staticSearch_desccontent="Short stories"/>
If the Generator finds such meta tags when it is indexing, it will create a set of filter controls on the search page, enabling the user to constrain the search to a specific set of filter settings.

7.1.1.1 Sort order for description filters

Description filter labels may be plain text such as ‘Short stories’ or ‘Poems’, but they may also be more obscure labels relating to document categories in indexing systems or archival series identifiers. When the search page is generated, these labels are turned into a series of labelled checkboxes, sorted in alphabetical order. However, the strict alphabetical order of items may not be exactly what you want; you may want to sort ‘305 2’ before ‘305 10’ for example. To deal with cases like this, in addition to the content attribute, you can also supply a custom data-ssfiltersortkey attribute, providing a sort key for each label. Here is are a couple of examples:
<meta name="Archival series"
 class="staticSearch_descdata-ssfiltersortkey="305_02"
 content="305 2"/>

<meta name="Archival series"
 class="staticSearch_descdata-ssfiltersortkey="305_10"
 content="305 10"/>
In this case, the first item will sort in the filter list before the second item based on the sort key; without it, they would sort in reverse order based on the content attribute. Note that the data-ssfiltersortkey attribute name is all-lower-case, to comply with the XHTML5 schema.

7.1.2 Date filters

Another slightly different kind of search control is a document date. If your collection of documents has items from different dates, you can add a <meta> tag like this:
<meta name="Date of publication"
 class="staticSearch_datecontent="1895-01-05"/>
The date may take any of the following forms:
  • 1895 (year only)
  • 1895-01 (year and month)
  • 1895-01-05 (year, month and day)
For some documents, it may not be possible to specify a single date in this form, so you can specify a range instead, using a slash to separate the start and end dates of the range (following ISO 8601):
  • 1895/1897
  • 1903-01-02/1905-05-31

7.1.3 Number filters

You can also configure a range filter based on a numeric value (integer or decimal). For example, you might want to allow people to filter documents in the search results based on their word-count:
<meta name="Word-count"
 class="staticSearch_numcontent="2193"/>
Users would then be able to set a minimum and/or maximum word-count when searching for documents.

7.1.4 Boolean filters

A fourth filter type is the boolean (true/false) filter. To use boolean filters, add meta tags like this to your documents:
<meta name="Peer-reviewed"
 class="staticSearch_boolcontent="true"/>

7.1.5 Feature filters

The feature (feat) filter, just like a description filter, is a word or phrase describing or associated with the document. Here is a simple example:
<meta name="People mentioned"
 class="staticSearch_featcontent="Nelson Mandela"/>
This specifies that there is to be a feature search filter called ‘People mentioned’, and one of the possible people is ‘Nelson Mandela’. This differs from a description filter in that the number of possible people (or other feature) is expected to be large. If there are (for example) three hundred people mentioned in the document collection, then using a description filter would result in three hundred checkboxes in the search page; this is clearly impractical. Instead, for large feature sets like this, the feature filter provides a text box into which the user can type some part of the feature they're searching for, and get a drop-down list of matches. Selecting a match creates a checkbox in the search page, which then functions just like a description filter. Here's another example:
<meta name="Postcodes mentioned"
 class="staticSearch_featcontent="V8W 1P4"/>
Obviously, a feature filter must be based on data that a searcher will be able to predict. Searchers can also search for names or postcodes using the text search facility, of course, but names might appear in different forms in different texts, so a feature filter allows the user to find all instances of a particular person using the canonical form of their name.

Note that feature filters have one drawback: they slow down the initialization of the search page noticeably. In order to provide the typeahead/drop-down functionality, the JSON for each feature filter will need to be retrieved as the page loads, so if you have many feature filters, or particularly large ones, the user may have to wait a few seconds before the page becomes fully functional.

7.2 Configuring your site: document titles

When the indexing process runs over your document collection, by default it will use the document title that it finds in the <title> element in the document header; that title will then be shown as a link to the document when it comes up in search results. However, that may not be the ideal title for this purpose; for example, all of your documents may include the site title as the first part of their document title, but it would be pointless to include this in the search result links. Therefore you can override the document title value by providing another meta tag, like this:
<meta name="docTitle"
 class="staticSearch_docTitlecontent="What I did in my holidays"/>

7.3 Configuring your site: document sort keys

When a user searches for text on your site, the documents retrieved will be presented in a sequence based on the ‘hit score’ or ‘relevance score’; documents with the highest scores will be presented first, and the list will be in descending order of relevance. However, if you have search filters on your site, it is possible that users will not enter any search text at all; they may simply select some filters and get a list of matching documents. In this case, there will be no relevance scores, so the documents will be presented in a random order. However, you may wish to control the order in which documents without hit scores, or sequences of documents with the same hit score, are presented. You can do this by adding a single meta tag to the document providing a ‘sort key’, which can be used to sort the list of hits. This is an example:
<meta name="docSortKey"
 class="staticSearch_docSortKeycontent="d_1854-01-02"/>

7.4 Configuring your site: document thumbnails

When a document is returned as a result of a search hit, you may want to include with it a thumbnail image. This may be for aesthetic reasons, or because the focus of the document itself is actually an image (perhaps your site is a set of pages dealing with works of art, for instance). Whatever the reason, you can supply a link to a thumbnail image like this:
<meta name="docImage"
 class="staticSearch_docImagecontent="images/thisPage.jpg"/>
The content attribute value should either be the path to an image relative to the document itself or the URL to an external image; so in the example above, there would be a folder called images which is a sibling of the HTML file containing the tag, and that folder would contain a file called thisPage.jpg.

7.5 Creating a configuration file

The configuration file is an XML document which tells the Generator where to find your site, and what search features you would like to include. The configuration file conforms to a schema which is documented here.

There are three main sections of the configuration file:

Only the <params> element is necessary, but, as we discuss shortly, we highly suggest taking advantage of the <rules> (see 7.5.3 Specifying rules (optional)) and <contexts> (7.5.4 Specifying contexts (optional)) for the best results.

For examples of full configuration files, see the staticSearch GitHub repository as well as the list of projects in 11 Projects using staticSearch, which provides a link to each site’s configuration file.

7.5.1 The <config> element

The configuration element has a root <config> element in the staticSearch namespace (http://hcmc.uvic.ca/ns/staticSearch):
<config xmlns="http://hcmc.uvic.ca/ns/staticSearch">
 <params>
<!--Parameters-->
 </params>
 <rules>
<!--Rules-->
 </rules>
 <contexts>
<!--Contexts-->
 </contexts>
</config>
The <config> element may bear the optional version attribute, which provides the major version number (i.e. a single integer value) of staticSearch that corresponds with the configuration file.
  • config (The root element for the Search Generator configuration file.)
    version (specifies the major version of staticSearch to which this configuration file corresponds. If this attribute is not used, the configuration file is assumed to have an version value of 1.)
Though optional, the version attribute is recommended for all configuration files; future version of staticSearch may introduce breaking changes to the structure and syntax of the configuration file, which can be more easily detected by staticSearch when an version is specified.

7.5.2 Specifying parameters

7.5.2.1 Required parameters

The <params> element has two required elements for determining the resource collection that you wish to index:

  • searchFile (The search file (aka page) that will be the primary access point for the staticSearch. Note that this page must be at the root of the collection directory.)
  • recurse (Whether to recurse into subdirectories of the collection directory or not.)

The search page is a regular HTML page which forms part of your site. The only important characteristic it must have is a <div> element with id=staticSearch, whose contents will be rewritten by the staticSearch build process. See 7.6 Creating a search page.

The <searchFile> element is a relative URI (resolved, like all URIs specified in the config file, against the configuration file location) that points directly to the search page that will be the primary access point for the search. Since the search file must be at the root of the directory that you wish to index (i.e. the directory that contains all of the XHTML you want the search to index), the searchFile parameter provides the necessary information for knowing what document collection to index and where to put the output JSON. In other words, in specifying the location of your search page, you are also specifying the location of your document collection. See Creating a search page for more information on how to configure this file.

Note that all output files will be in a directory that is a sibling to the search page. For instance, in a document collection that looks something like:

  • myProject
    • novel.html
    • poem.html
    • shortstory.html
    • search.html

The collection of Javascript and JSON files will be in a directory like so:

  • myProject
    • novel.html
    • poem.html
    • shortstory.html
    • search.html
    • staticSearch

We also require the <recurse> element in the case where the document collection may be nested (as is common with static sites generated from Jekyll or Wordpress). The <recurse> element is a boolean (true or false) that determines whether or not to recurse into the subdirectories of the collection and index those files.

7.5.2.2 Optional parameters

The following parameters are optional, but most projects will want to specify some of them:

  • versionFile (The relative path to a text file containing a single version identifier (such as 1.5, 123456, or 06ad419). This will be used to create unique filenames for JSON resources, so that the browser will not use cached versions of older index files.)

<versionFile> enables you to specify the path to a plain-text file containing a simple version number for the project. This might take the form of a software-release-style version number such as 1.5, or it might be a Subversion revision number or a Git commit hash. It should not contain any spaces or punctuation. If you provide a version file, the version string will be used as part of the filenames for all the JSON resources created for the search. This is useful because it allows the browser to cache such resources when users repeatedly visit the search page, but if the project is rebuilt with a new version, those cached files will not be used because the new version will have different filenames. The path specified is relative to the location of the configuration file (or absolute, if you wish).

  • phrasalSearch (Whether or not to support phrasal searches. If this is true, then the maxContexts setting will be ignored, because all contexts are required to properly support phrasal search.)

Phrasal search functionality enables your users to search for specific phrases by surrounding them with quotation marks ("), as in many search engines. In order to support this kind of search, <createContexts> must also be set to true as we store contexts for all hits for each token in each document. Turning this on will make the index larger, because all contexts must be stored, but once the index is built, it has very little impact on the speed of searches, so we recommend turning this on. The default value is true. However, if your site is very large and your user base is unlikely to use phrasal searching, it may not be worth the additional build time and increased index size.

  • stemmerFolder (The name of a folder inside the staticSearch /stemmers/ folder, in which the JavaScript and XSLT implementations of stemmers can be found. If left blank, then the staticSearch default English stemmer (en) will be used.)

The staticSearch project currently has only two real stemmers: an implementation of the Porter 2 algorithm for modern English, and an implementation of the French Snowball stemmer. These appear in the /stemmers/en/ and /stemmers/fr/ folders. The default value for this parameter is en; to use the French stemmer, use fr. We will be adding more stemmers as the project develops. However, if your document collection is not English or French, you have a couple of options, one hard and one easy.

  • Hard option: implement your own stemmers. You will need to write two implementations of the stemmer algorithm, one in XSLT (which must be named ssStemmer.xsl) and one in JavaScript (ssStemmer.js), and confirm that they both generate the same results. The XSLT stemmer is used in the generation of the index files at build time, and the JavaScript version is used to stem the user's input in the search page. You can look at the existing implementations in the /stemmers/en/ folder to see how the stemmers need to be constructed. Place your stemmers in a folder called /stemmers/[yourlang]/, and specify yourlang in the configuration file.
  • Easy option: Use the identity stemmer (which is equivalent to turning off stemming completely), and make sure wildcard searching is turned on. Then your users can search using wildcards instead of having their search terms automatically stemmed. To do this, specify the value identity in your configuration file.

Another alternative is the stripDiacritics stemmer. Like the identity stemmer, this is not really a stemmer at all; what it does is to strip out all combining diacritics from tokens. This is a useful approach if you document collection contains texts with accents and diacritics, but your users may be unfamiliar with the use of diacritics and will want to search just with plain unaccented characters. For example, if a text contains the word élève, but you would like searchers to be able to find the word simply by typing the ascii string eleve, then this is a good option. Combined with wildcards, it can provide a very flexible and user-friendly search engine in the absence of a sophisticated stemmer, or for cases where there are mixed languages so a single stemmer will not do. To use this option, specify the value stripDiacritics in your configuration file.

  • scoringAlgorithm (The scoring algorithm to use for ranking keyword results. Default is "raw" (i.e. weighted counts))

<scoringAlgorithm> is an optional element that specifies which scoring algorithm to use when calculating the score of a term and thus the order in which the results from a search are sorted. There are currently two options:

  • raw: This is the default option (and so does not need to be set explicitly). The raw score is simply the sum of all instances of a term (optionally multipled by a configured weight via the <rule>/weight configuration) in a document. This will usually provide good results for most document collections.
  • tf-idf: The tf-idf algorithm (term frequency-inverse document frequency) computes the mathematical relevance of a term within a document relative to the rest of the document collection. The staticSearch implementation of tf-idf basically follows the textbook definition of tf-idf: tf-idf = ($instancesOfTerm / $totalTermsInDoc) * log( $allDocumentsCount / $docsWithThisTermCount ) This is fairly crude compared to other search engines, like Lucene, but it may provide useful results for document collections of varying lengths or in instances where the raw score may be insufficient or misleading. There are a number of resources on tf-idf scoring, including: Wikipedia and Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

  • createContexts (Whether to include keyword-in-context extracts in the index.)

<createContexts> is a boolean parameter that specifies whether you want the indexer to store keyword-in-context extracts for each of the hits in a document. This increases the size of the index, but of course it makes for much more user-friendly search results; instead of seeing just a score for each document found, the user will see a series of short text strings with the search keyword(s) highlighted. Note that contexts are necessary for phrasal searching or wildcard searching.

  • minWordLength (The minimum length of a term to be indexed. Default is 3 characters.)

<minWordLength> specifies the minimum length in characters of a sequence of text that will be considered to be a word worth indexing. The default is 3, on the basis that in most European languages, words of one or two letters are typically not worth indexing, being articles, prepositions and so on. If you set this to a lower limit for reasons specific to your project, you should ensure that your stopword list excludes any very common words that would otherwise make the indexing process lengthy and increase the index size.

  • maxKwicsToHarvest (This controls the maximum number of keyword-in-context extracts that will be stored for each term in a document.)

<maxKwicsToHarvest> controls the number of keyword-in-context extracts that will be harvested from the data for each term in a document. For example, if a user searches for the word ‘elephant’, and it occurs 27 times in a document, but the <maxKwicsToHarvest> value is set to 5, then only the first five (sorted in document order) of these keyword-in-context strings will be stored in the index. (This does not affect the score of the document in the search results, of course.) If you set this to a low number, the size of the JSON files will be constrained, but of course the user will only be able to see the KWICs that have been harvested in their search results. If <phrasalSearch> is set to true, the <maxKwicsToHarvest> setting is ignored, because phrasal searches will only work properly if all contexts are stored.

  • maxKwicsToShow (This controls the maximum number of keyword-in-context extracts that will be shown in the search page for each hit document returned.)

A user may search for multiple common words, so hundreds of hits could be found in a single document. If the keyword-in-context strings for all these hits are shown on the results page, it would be too long and too difficult to navigate. This setting controls how many of those hits you want to show for each document in the result set.

  • totalKwicLength (If createContexts is set to true, then this parameter controls the length (in words) of the harvested keyword-in-context string.)

Obviously, the longer the keyword-in-context strings are, the larger the individual index files will be, but the more useful the KWICs will be for users looking at the search results. Note that the phrasal searching relies on the KWICs and thus longer KWICs allow for longer phrasal searches.

  • kwicTruncateString (The string that will be used to signal ellipsis at the beginning and end of a keyword-in-context extract. Conventionally three periods, or an ellipsis character (which is the default value).)

The only reason you might need to specify a value for this parameter is if the language of your search page conventionally uses a different ellipsis character. Japanese, for example, uses the 3-dot-leader character.

  • linkToFragmentId (Whether to link keyword-in-context extracts to the nearest id in the document. Default is true.)

<linkToFragmentId> is a boolean parameter that specifies whether you want the search engine to link each keyword-in-context extract with the closest element that has an id. If the element has an ancestor with an id, then the indexer will associate that keyword-in-context extract with that id; if there are no suitable ancestor elements that have an id, then the extract is associated with first preceding element with an id. We strongly recommend that you ensure your target documents have id attributes for any significant divisions so that this parameter can be used effectively. With lots of ids throughout your documents, and this parameter turned on, each keyword-in-context in the results page will be linked directly to the section of the document in which the hit appears, making the search results much more useful.

  • scrollToTextFragment (WARNING: Experimental technology. This turns on a feature currently only supported by a subset of browsers, enabling links from keyword-in-context results directly to the specific text string in the target document.)

Google has proposed a browser feature called Text Fragments, which would support a special kind of link that targets a specific string of text inside a page. When clicking on such a link, the browser would scroll to, and then highlight, the target text. This has been implemented in Chrome-based browsers (Chrome, Chromium and Edge) at the time of writing, but other browser producers are sceptical with regard to the specification and worried about possible security implications. The specification is subject to radical change. <scrollToTextFragment> is a boolean parameter that specifies whether you want to turn on this feature for browsers that support it. It depends on the availability of keyword-in-context strings, so <createContexts> must also be turned on to make it work. The feature is automatically suppressed for browsers which do not support it. We recommend only using this feature on sites which are in steady development, so that if necessary it can be turned off, our the staticSearch implementation can be updated to take account of changes. For sites intended to remain unchanged or archived for any length of time, this feature should be left turned off. It is off by default. A more reliable alternative is probably to use JavaScript to highlight hits on the target page.

  • resultsPerPage (The maximum number of document results to be displayed per page. All results are displayed by default; setting resultsPerPage to a positive integer creates a Show More/Show All widget at the bottom of the batch of results.)

For most sites, where the number of results is likely to be in the low thousands, it's perfectly practical to show all the results at once, because the staticSearch processor is so fast. However, if you have tens of thousands of documents, and it's possible that users will do (for example) filter-only searches that retrieve a large proportion of them, you can constrain the number of results which are shown initially using this setting. All the results are still generated and output to the page, but since most of them are hidden until the ‘Show More’ or ‘Show All’ button is clicked, the browser will render them much more quickly.

  • verbose (Turns on more detailed reporting during the indexing process.)

The staticSearch build process can be lengthy, and if something goes wrong it can be difficult to figure out where the problem was. If you're having build problems, turn on verbose messages in the configuration file to help with debugging, and you'll see a lot more output at the command line.

  • stopwordsFile (The relative path (from the config file) to a text file containing a list of stopwords (words to be ignored when indexing). )

A stopword is a word that will not be indexed, because it is too common (the, a, you and so on). There are common stopwords files for most languages available on the Web, but it is probably a good idea to take one of these and customize it for your project, since there will be words in the website which are so common that it makes no sense to index them, but they are not ordinary stopwords. For example, in a Website dedicated to the work of John Keats, the name keats should probably be added to the stopwords file, since almost every page will include it, and searching for it will be pointless. The project has a built-in set of common stopwords for English, which you'll find in xsl/english_stopwords.txt. One way to find appropriate stopwords for your site is to generate your index, then search for the largest JSON index files that are generated, to see if they might be too common to be useful as search terms. You can also use the Word Frequency table in the generated staticSearch report (see 7.9 Generated report).

  • dictionaryFile (The relative path (from the config file) to a dictionary file (one word per line) which will be used to check tokens when indexing.)

The indexing process checks each word as it builds the index, and keeps a record of all words which are not found in the configured dictionary. Though this does not have any direct effect in the indexing process, all words not found in the dictionary are listed in the staticSearch report (see 7.9 Generated report). This can be very useful: all words listed are either foreign (not part of the language of the dictionary) or perhaps misspelled (in which case they may not be correctly stemmed and index, and should be corrected). There is a default dictionary in xsl/english_words.txt which you might copy and adapt if you're working in English; lots of dictionaries for other languages are available on the Web.

  • indentJSON (Whether or not to indent code in the JSON index files. Indenting increases the file size, but it can be useful if you need to read the files for debugging purposes.)

Normally, the JSON index files will only be processed by JavaScript, so there is no need to indent them, but if you are debugging or investigating how the code works, you may find it useful to generate them in a more human-readable layout.

  • outputFolder (The name of the output folder into which the index data and JavaScript will be placed in the site search. This should conform with the XML Name specification.)

When the staticSearch build process creates its output, many files need to be added to the website for which an index is being created. For convenience, all of these files are stored in a single folder. This element is used to specify the name of that folder. The default is staticSearch, but if you would prefer something else, you can specify it here. You may also use this element if you are defining two different searches within the same site, so that their files are kept in different locations.

7.5.3 Specifying rules (optional)

  • rules (The set of rules that control weighting of search terms found in specific contexts.)
  • rule (A rule that specifies a document path as XPath in the match attribute, and provides weighting for search terms found in that context.)
    match [att.match] (An XPath equivalent to the @match attribute of an xsl:template, which specifies a context in a document.)
    weight (The weighting to give to a search token found in the context specified by the match attribute. Set to 0 to completely suppress indexing for a specific context, or greater than 1 to give stronger weighting.)

The rule element is used to identify nodes in the XHTML document collection which should be treated in a special manner when indexed; either they might be ignored (if weight=0), or any words found in them might be given greater weight than words in normal contexts weight>1. Words appearing in headings or titles, for example, might be weighted more heavily, while navigation menus, banners, or footers might be ignored completely.

The <rules> elements specifies a list of conditions (using the <rule> element) that tell the parser, using XPath statements in the match attribute, specific weights to assign to particular parts of each document. For instance, if you wanted all heading elements (<h1>, <h2>, etc) in documents to be given a greater weight and thus receive a higher score in the results, you can do so using a rule like so:
<rules>
 <rule weight="2"
  match="h1 | h2 | h3 | h4 | h5 | h6"/>

</rules>
Since we're using XPath 3.0 and XSLT 3.0, this can also be simplified to:
<rules>
 <rule weight="2"
  match="*[matches(local-name(),'^h\d+$')]"/>

</rules>
(It is worth noting, however, the above example is unnecessary: all heading elements are given a weight of 2 by default, which is the only preconfigured weight in staticSearch.)

The value of the match attribute is transformed in a XSLT template match attribute, and thus must follow the same rules (i.e. no complex rules like p/ancestor::div). See the W3C XSLT Specification for further details on allowable pattern rules.

Often, there will be elements that you want the tokenizer to ignore completely; for instance, if you have the same header in every document, then there's no reason to index its contents on every page. These elements can be ignored simply by using a <rule> and setting its weight to 0. For instance, if you want to remove the header and the footer from the search indexing process, you could write something like:
<rule weight="0match="footer | header"/>
Or if you want to remove XHTML anchor tags (<a>) whose text is identical to the URL specified in its href, you could do something like:
<rule weight="0match="a[@href=./text()]"/>

Note that the indexer does not tokenize any content in the <head> of the document (but as noted in 7.1 Configuring your site: search filters, metadata can be configured into filters) and that all elements in the <body> of a document are considered tokenizable. However, common elements that you might want to exclude include:

  • <script>
  • <style>
  • <code>

7.5.4 Specifying contexts (optional)

  • contexts (The set of context elements that identify contexts for keyword-in-context fragments.)
  • context (A context definition, providing a match attribute that identifies the context, allowing keyword-in-context fragments to be bounded by a specific context.)
    match [att.match] (An XPath equivalent to the @match attribute of an xsl:template, which specifies a context in a document.)

When the indexer is extracting keyword-in-context strings for each word, it uses a common-sense approach based on common element definitions, so that for example when it reaches the end of a paragraph, it will not continue into the next paragraph to get more context words. You may have special runs of text in your document collection which do not appear to be bounding contexts, but actually are; for example, you may have span elements with class=note that appear in the middle of sentences but are not actually part of them. Use <context> elements to identify these special contexts so that the indexer knows the right boundaries from which to retrieve its keyword-in-context strings.

When the staticSearch creates the keywords-in-context strings (the "kwic" or "snippets") for each token, it does so by looking for the nearest block-level element that it can use as its context. Take, for instance, this unordered list:
<ul>
 <li>Keyword-in-context search results. This is also configurable, since including contexts
   increases the size of the index.</li>
 <li>Search filtering using any metadata you like, allowing users to limit their search to specific
   document types.</li>
</ul>
Each <li> elements is, by default, a context element, meaning that the snippet generated for each token will not extend beyond the <li> element boundaries; in this case, if the <li> was not a context attribute, the term ‘search’ would produce a context that looks something like:
"...the size of the index.Search filtering using any metadata you like,..."
Using the <contexts> element, you can control what elements operate as contexts. For instance, say a page contained a marginal note, encoded as a <span> in your document beside its point of attachment:1
<p>About that program I shall have nothing to say here,<span class="sidenote">Some information on this subject can be found in "Second Thoughts"</span> [...]
</p>
Using CSS, the footnote might be alongside the text of the document in margin, or made into a clickable object using Javascript. However, since the tokenizer is unaware of any server-side processing, it understands the <span> as an inline element and assumes the <p> constitutes the context of the element. A search for ‘information’ might then return:
"...nothing to say here,Some information on this subject can be found...
To tell the tokenizer that the <span> constitutes the context block for any of its tokens, use the <context> element with an match pattern:
<contexts>
 <context match="span[contains-token(@class,'sidenote')]"/>
</contexts>
You can also configure it the other way: if a <div>, which is by default a context block, should not be understood as a context block, then you can tell the parser to not consider it as such using context set to false:
<contexts>
 <context match="divcontext="false"/>
</contexts>

The default contexts elements are:

  • <body>
  • <div>
  • <blockquote>
  • <p>
  • <li>
  • <section>
  • <article>
  • <nav>
  • <h1>
  • <h2>
  • <h3>
  • <h4>
  • <h5>
  • <h6>
  • <td>

7.5.5 Specifying searchable contexts (‘search only in’)

Pages may contain different kinds of blocks, or ‘contexts’, that need to be differentiated. For example, consider a page for an online journal article, which includes the article’s title, an abstract, the body of the article, and footnotes. Users may want to search for terms only within abstracts or they may want to search only within the body of the article, ignoring editorial and paratextual material.

The <context> mechanism provides a way to specify particular components of a page that can be searched within using the label attribute.
  • contexts (The set of context elements that identify contexts for keyword-in-context fragments.)
  • context (A context definition, providing a match attribute that identifies the context, allowing keyword-in-context fragments to be bounded by a specific context.)
    label [att.labelled] (A string identifier specifying the name for a given context.)

When the indexer is extracting keyword-in-context strings for each word, it uses a common-sense approach based on common element definitions, so that for example when it reaches the end of a paragraph, it will not continue into the next paragraph to get more context words. You may have special runs of text in your document collection which do not appear to be bounding contexts, but actually are; for example, you may have span elements with class=note that appear in the middle of sentences but are not actually part of them. Use <context> elements to identify these special contexts so that the indexer knows the right boundaries from which to retrieve its keyword-in-context strings.

The text in the label serves as the text for a filter option on the generated search page, which allows users to perform a search within only a particular component of the page. For instance, for a page structured like the journal article mentioned above, we could specify the abstract, the notes, and the document’s body like so:
<contexts>
 <context match="article[@id='article_content']"
  label="Article text only"/>

 <context match="div[contains-token(@class,'footnote')]"
  label="Notes only"/>

 <context match="section[@id='abstract']"
  label="Abstracts only"/>

 <context match="span[contains-token(@class,'inline-note')]"
  label="Notes only"/>

</contexts>
The generated search page will then contain a set of checkboxes derived from the distinct label values. There is no requirement for the label values to be distinct, but any identical labels will be treated as identical contexts (i.e. in the example above, searching for a string within "Notes only" will return all results found within both the div elements with a class="footnote" and the span elements with class="inline-note".)

7.5.6 Specifying exclusions (optional)

  • excludes (The set of exclusions, expressed as exclude elements, that control the subset of documents or filters used for a particular search.)
  • exclude (An exclusion definition, which excludes either documents or filters as defined by an XPath in the match attribute.)
    type
    match [att.match] (An XPath equivalent to the @match attribute of an xsl:template, which specifies a context in a document.)

<exclude> can be used to identify documents or parts of documents that are to be omitted from indexing, but, unlike setting <weight> to zero, should be retained during the indexing process. This is helpful in cases where the text itself should be ignored by the indexer, but should still appear in KWICs. Another common use is for multiple search engines/pages that each have their own special features; in this case, you may want one specific search index/page to ignore filter controls (HTML <meta> elements, as described in 4 Search facet features) which are provided to support other search pages.

A complex site may have two or more search pages targetting specific types of document or content, each of which may need its own particular search controls and indexes. This can easily be achieved by specifying a different <searchFile> and <outputFolder> in the configuration file for each search.

For these searches to be different from each other, they will also probably have different contexts and rules. For example, imagine that you are creating a special search page that focuses only on the text describing images or figures in your documents. You might do it like this:
<rules>
 <rule match="text()[not(ancestor::div[@class='figure']or ancestor::title)]"
  weight="0"/>

</rules>
This specifies that all text nodes which are not part of the document title or descendants of <div class="figure"> should be ignored (weight=0), so only your target nodes will be indexed.

However, it's also likely that you will want to exclude certain features or documents from a specialized search page, and this is done using the <excludes> section and its child <exclude> elements.

Here is an example:
<excludes>
<!-- We only index files which have illustrations in them. -->
 <exclude type="index"
  match="html[not(descendant::meta[@name='Has illustration(s)'][@content='true'])]"/>

<!-- We ignore the document type filter, because we are only indexing one type of document anyway. -->
 <exclude type="filter"
  match="meta[@name='Document type']"/>

<!-- We exclude the filter that specifies these documents because it's pointless. -->
 <exclude type="filter"
  match="meta[@name='Has illustration(s)']"/>

</excludes>
Here we use <exclude type="index"/> to specify that all documents which do not contain <meta name="Has illustration(s)" content="true"/>> should be ignored. Then we use two <exclude type="filter"/> tags to specify first that the Document type filter should be ignored (i.e. it should not appear on the search page), and second, that the boolean filter Has illustrations(s) should also be excluded.

Using exclusions, you can create multiple specialized search pages which have customized form controls within the same document collection. This is at the expense of additional disk space and build time, of course; each of these searches needs to be built separately.

7.6 Creating a search page

You'll obviously want the search page for your site to conform with the look and feel of the rest of your site. You can create a complete HTML document (which must of course also be well-formed XML, so it can be processed), containing all the site components you need, and then the search build process will insert all the necessary components into that file. The only requirement is that the page contains one <div> element with the correct id attribute:
<div id="staticSearch">
[...content will be supplied by the build process...]
</div>
This <div> will be empty initially. The build process will find insert the search controls, scripts and results <div> into this container. Then whenever you rebuild the search for your site, the contents will be replaced. There is no need to make sure it's empty every time.
The search process will also add a link to the staticSearch CSS file to the <head> of the document:
<link rel="stylesheet"
 href="staticSearch/ssSearch.cssid="ssCss"/>
You can customize this CSS by providing your own CSS that overrides it, using <style>, or <link>, placed after it in the <head> element, or by replacing the inserted CSS after the build process. Note that some features, like <resultsPerPage> or the ‘Searching’ loading dialog, rely on rules included in the base staticSearch CSS; if you do remove or disable the CSS, then some features may not work properly.

Note that once your file has been processed and all this content has been added, you can process it again at any time; there is no need to start every time with a clean, empty version of the search page.

You can take a look at the test/search.html page for an example of how to configure the search page (although note that since this page has already been processed, it has the CSS and the search controls embedded in it; it also has some additional JavaScript which we use for testing the search build results, which is not necessary for your site).

7.7 Running the search build process

Once you have configured your HTML and your configuration file, you're ready to create a search index and a search page for your site. This requires that you run ant in the root folder of the staticSearch project that you have downloaded or cloned.

Note: you will need Java and Apache Ant installed, as well as ant-contrib.

Before running the search on your own site, you can test that your system is able to do the build by doing the (very quick) build of the test materials. If you simply run the ant command, like this:

mholmes@linuxbox:~/Documents/staticSearch$ ant

you should see a build process proceed using the small test collection of documents, and at the end, a results page should open up giving you a report on what was done. If this fails, then you'll need to troubleshoot the problem based on any error messages you see. (Do you have Java, Ant and ant-contrib installed and working on your system?).

If the test succeeds, you can view the results by uploading the test folder and all its contents to a web server, or by running a local webserver on your machine in that folder, using the Python HTTP server or PHP's built-in web server.

If the tests all work, then you're ready to build a search for your own site. Now you need to run the same command, but this time, tell the build process where to find your custom configuration file:2

ant -DssConfigFile=/home/mholmes/mysite/config_staticSearch.xml

The same process should run, and if it's successful, you should have a modified search.html page as well as a lot of index files in JSON format in your site HTML folder. Now you can test your own search in the same ways suggested above.

7.8 Running staticSearch from Ant

staticSearch can be integrated into existing ant build processes quite easily. Once cloned (or downloaded) from GitHub, the staticSearch codebase can be called by a build process using the ant target with a nested <property> value that points to your staticSearch config file (either relative to staticSearch using ssConfig or an absolute path using ssConfigFile). Assuming that the build file, your config file, and your staticSearch directory are all at the root of the project, you could call the staticSearch build in ant like so:
<ant antfile="${basedir}/staticSearch"
 inheritall="false">

 <property name="ssConfig"
  value="staticSearch_config.xml"/>

</ant>
Note that any arguments passed to ant at the command line arguments will be passed on to the staticSearch build. This can cause issues when the main build requires the use of the -lib parameter (since the project's version of Saxon may conflict, for instance, with the version used by staticSearch). If your build requires the use of the -lib parameter, then an alternative approach for calling staticSearch from your build is to use the exec task like so:
<exec executable="antdir="staticSearch">
 <arg value="-DssConfig=../config_staticSearch.xml"/>
</exec>

7.9 Generated report

After indexing your HTML files, the staticSearch build then generates an HTML report of helpful statistics and diagnostics about your document collection, which can be found in the directory specified by <outputFolder>. We recommend looking at this file regularly, especially if you're encountering unexpected behaviour by the staticSearch engine, as it contains information that can often help diagnose issues with configured filters or the HTML document collection that, if fixed, can improve staticSearch results.

By default, the report includes only basic information about the number of stem files created, the the filters used, and any problems encountered. However, if you run the build process using the additional parameter ssVerboseReport:

ant -DssVerboseReport=true -DssConfigFile=...

then the report will also include a number of tables that outline some statistics about your project. However, please note that compiling these statistics is very memory-intensive and if your site is large, it may cause the build process to run out of memory.

As of version 1.4, the word frequency table is a separate document and is no longer included as part of the verbose report. Instead, after running a build, you can then build just the word frequency table with the special concordance target:

ant -DssConfigFile=path/to/your/config.xml concordance

While the chart itself is not necessary for the core functionality of staticSearch, it is particularly useful during the initial development of a project’s search engine; it can be used to create and fine-tune the project-specific stopword list (i.e. if a word appears multiple times in every document, then it probably ought to be a stopword) or, in some cases, to locate potential typos in the document collection. Alongside the HTML chart, the concordance target also provides a (large) JSON file that represents that statistical data used to generate the HTML view. It contains a JSON object that lists each stem, its variants; each variant lists each document and the number of times it appears within that document.

The concordance target comes with the same warning as above: compiling these statistics is memory-intensive and may cause the build process to run out of memory.

7.10 Advanced features

7.10.1 Custom attributes

As described in 7.5.2.2 Optional parameters, if you set the <linkToFragmentId> parameter to true, the search index will link each keyword-in-context extract with the closest element that has an id, so that each keyword-in-context string in a search result will point directly to a specific page fragment. You can also add your own attributes to any element in your document in order to have those attributes appear on the keyword-in-context (KWIC) result string (which is in the form of an HTML <li> element).

Imagine that some of the paragraphs in your documents are special in some way. You could add an attribute whose name begins with data-ss- to each of those paragraphs, like this:
<p data-ss-type="special">This paragraph is special for some reason or other...</p>
When the staticSearch indexer creates KWIC extracts, it automatically harvests any attribute whose name begins with data-ss- from the containing element or its ancestors, and adds them to the keyword-in-context record in the index. Then when that KWIC string is displayed as the result of a search, the attribute will be added to the HTML <li> element on the page:
<li data-ss-type="special">[KWIC with marked search hit, link, etc.]</li>
This means that you can add your own CSS or JavaScript to make that KWIC appear distinct from other KWICs which come from non-special paragraphs.

You can add as many custom attributes as you like (although bear in mind that they increase the size of the index JSON files slightly and may add to the build time).

One specific custom attribute has built-in handling that you may find useful. If you add the attribute data-ss-img with a value that points to an image, that image will be displayed to the left of the KWIC string. For example, if you do this:
<p data-ss-img="images/elephant.png">This paragraph is all about elephants...</p>
then any KWIC results from that paragraph will show the elephant.png image to the left of the KWIC text. This can be especially useful if your site contains large documents which are broken into sections, and those sections can be helpfully represented by images; the search results will be easier for the user to understand by virtue of the associated images. Image URLs should be relative to the location of the search page, not the original source document, since the images will be displayed on the search page.

7.10.2 Highlighting search hits on target pages

When you use a conventional search engine which is based on a backend database and configured for a specific dynamic website, it is not unusual to find that when you follow a link to a search hit on a target page, the hit will be highlighted on that page, because the backend database is able to pre-process it. With staticSearch, that's not straightforward, because there is no ‘engine’ that is preprocessing your HTML pages. However, if this is important to you, there is a workaround that you can use. Since staticSearch does not modify your web pages, though, the implementation has to be done by you.

By default, when you turn on <linkToFragmentId> in the configuration, each keyword-in-context result that shows on the search page will have its own specific link to the fragment of the document which contains the hit. Those links are also provided with a search string, like this: https://example.com/egPage.html?ssMark=elephant#animals This link points to the section of the document which has id=animals, but it also says ‘the hit text is the word elephant.’ Some JavaScript that runs on the target page, egPage.html (which you control) will be able to parse the value of the query parameter ssMark in order to find the hit text, and highlight it in some way.

Obviously you can implement this any way you like (or just ignore it), but we also supply a small demonstration JavaScript library which implements this functionality, called ssHighlight.js. This JS file is included into the staticSearch output folder (see <outputFolder>) by default, and if you include it into the header of your own pages, it will probably do the highlighting without further intervention. If, however, you have lots of existing JavaScript that runs when the page loads, there may be some interference between this library and your own code, so you may have to make some adjustments to the code.

Note that another alternative is to use the <scrollToTextFragment> feature; this does not require any special JavaScript, but at the time of writing it is supported only on some Chromium-based browsers and is non-standard and somewhat unreliable.

Notes
1
This example taken from Thomas S. Kuhn, The Structure of Scientific Revolutions (50th anniversary edition), University of Chicago Press, 2012: p. 191.
2
Note that you can use either ssConfigFile or ssConfig to provide the build with the full path or relative path, respectively, of your configuration file.
Martin Holmes and Joey Takeda. Date: 2019-2022