Schema and guidelines for creating a staticSearch engine for your HTML5 site
Martin Holmes
Joey Takeda
2019–2026

This documentation provides instructions on how to use the Project Endings staticSearch Generator to provide a fully-functional search ‘engine’ to your website without any dependency on server-side code such as a database.

Appendix A Schema specification and tag documentation

Appendix A.1 Elements

Appendix A.1.1 <config>

<config> (The root element for the Search Generator configuration file.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
version (specifies the major version of staticSearch to which this configuration file corresponds. If this attribute is not used, the configuration file is assumed to have a version value of 1.)
Status Recommended
Datatype nonNegativeInteger
Default 1
<config version="1">
<params>
<!--Config options-->
</params>
</config>
Note

The version attribute only needs to specify the major version of staticSearch with which the configuration file is compatible. While minor versions may introduce new, optional configuration options, backwards-incompatible changes to the config will only occur across major versions (e.g. a configuration file created for staticSearch 1.1 will work with staticSearch 1.4, but is not guaranteed to work with staticSearch 2.0).

Contained by
May contain
Content model
<content>
 <sequence minOccurs="1" maxOccurs="1">
  <elementRef key="params"/>
  <elementRef key="rules" minOccurs="0"/>
  <elementRef key="contexts" minOccurs="0"/>
  <elementRef key="excludes" minOccurs="0"/>
  <elementRef key="filters" minOccurs="0"/>
 </sequence>
</content>
    
Schema Declaration
element config
{
   attribute version { text }?,
   ( params, rules?, contexts?, excludes?, filters? )
}

Appendix A.1.2 <context>

<context> (A context definition, providing a match attribute that identifies the context, allowing keyword-in-context fragments to be bounded by a specific context.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
context
Status Optional
Datatype ssdata.boolean
Contained by
May contain Empty element
Note

When the indexer is extracting keyword-in-context strings for each word, it uses a common-sense approach based on common element definitions, so that for example when it reaches the end of a paragraph, it will not continue into the next paragraph to get more context words. You may have special runs of text in your document collection which do not appear to be bounding contexts, but actually are; for example, you may have span elements with class=note that appear in the middle of sentences but are not actually part of them. Use <context> elements to identify these special contexts so that the indexer knows the right boundaries from which to retrieve its keyword-in-context strings.

Schematron
<sch:ns uri="http://hcmc.uvic.ca/ns/staticSearch" prefix="ss"/> <sch:pattern> <sch:rule context="ss:context"> <sch:assert test="not(@label and @context = 'false')"> ERROR: If a context has a label, it must be a context for the purposes of indexing. </sch:assert> </sch:rule> </sch:pattern>
Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element context
{
   att.match.attributes,
   att.labelled.attributes,
   attribute context { text }?,
   empty
}

Appendix A.1.3 <contexts>

<contexts> (The set of context that identify contexts for keyword-in-context fragments.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: config
May contain
Content model
<content>
 <elementRef key="context" minOccurs="0"
  maxOccurs="unbounded"/>
</content>
    
Schema Declaration
element contexts { context* }

Appendix A.1.4 <createContexts>

<createContexts> (Whether to include keyword-in-context extracts in the index.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
Contained by
ss: params
May contain Empty element
Note

Note that contexts are necessary for phrasal searching or wildcard searching.

Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element createContexts
{
   (
      ( attribute create { "false" }? )
    | (
         attribute create { "true" }?,
         attribute phrasalSearch { text }?,
         attribute wildcardSearch { text }?,
         attribute maxKwicsToHarvest { text }?,
         attribute maxKwicLength { text }?,
         attribute kwicTruncateString { text }?
      )
   ),
   empty
}

Appendix A.1.5 <dictionary>

<dictionary> (Specifies a dictionary against which tokens may be checked during indexing.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
file (The relative path (from the config file) to a dictionary file (one word per line).)
Derived from ss.atts.file
Status Required
Datatype anyURI
Default dicts/words_en.txt
Contained by
ss: params
May contain Empty element
Note

The indexing process checks each word as it builds the index, and keeps a record of all words which are not found in the configured dictionary. Though this does not have any direct effect in the indexing process, all words not found in the dictionary are listed in the staticSearch report (see 8.9 Generated report). This can be very useful: all words listed are either foreign (not part of the language of the dictionary) or perhaps misspelled (in which case they may not be correctly stemmed and index, and should be corrected).

staticSearch provides a default dictionary in xsl/english_words.txt that can be copied and adapted if working in English; lots of dictionaries for other languages are available on the Web.

Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element dictionary { attribute file { text }, empty }

Appendix A.1.6 <exclude>

<exclude> (An exclusion definition, which excludes either documents or filters as defined by an XPath in the match attribute.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
type
Status Required
Legal values are:
index
(Index exclusion) An exclusion that specifies HTML fragment (which itself can be the root HTML element) to exclude from the document index.
filter
(Filter exclusion) An exclusion that matches an HTML meta tag to exclude from the filter controls on the search page.
Contained by
May contain Empty element
Note

<exclude> can be used to identify documents or parts of documents that are to be omitted from indexing, but, unlike setting <weight> to zero, should be retained during the indexing process. This is helpful in cases where the text itself should be ignored by the indexer, but should still appear in KWICs. Another common use is for multiple search engines/pages that each have their own special features; in this case, you may want one specific search index/page to ignore filter controls (HTML <meta> elements, as described in 5 Search facet features) which are provided to support other search pages.

Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element exclude
{
   att.match.attributes,
   attribute type { "index" | "filter" },
   empty
}

Appendix A.1.7 <excludes>

<excludes> (The set of exclusions, expressed as exclude elements, that control the subset of documents or filters used for a particular search.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: config
May contain
Content model
<content>
 <elementRef key="exclude" minOccurs="1"
  maxOccurs="unbounded"/>
</content>
    
Schema Declaration
element excludes { exclude+ }

Appendix A.1.8 <filter>

<filter> (Allows specification of a custom label for a filter on the search page) Filters are identified through plain-text labels defined as XHTML meta/@name attributes in the document headers. This element allows a user to specify a richer label for a particular filter by using HTML code. Multiple labels may be specified in different languages. The filterName attribute must be supplied, and must be identical to the XHTML meta/@name attribute for the filter as it appears in the HTML document heads.
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
filterName
Status Required
Datatype teidata.text
Contained by
May contain
ssHTML: span
Content model
<content>
 <elementRef key="span" minOccurs="1"
  maxOccurs="unbounded"/>
</content>
    
Schema Declaration
element filter { attribute filterName { text }, span+ }

Appendix A.1.9 <filters>

<filters> (Wrapper for filter elements in the configuration file) The optional filters/filter part of the configuration file allows the user to specify custom labels with HTML markup for specific filters, so they are not limited by the label supplied in the XHTML meta/@name attribute.
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: config
May contain
ss: filter
Content model
<content>
 <elementRef key="filter" minOccurs="1"
  maxOccurs="unbounded"/>
</content>
    
Schema Declaration
element filters { filter+ }

Appendix A.1.10 <index>

<index> (Configures options relating to indexing.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
recurse (Determines whether or not to recurse into the subdirectories of the collection and index those files.)
Status Recommended
Datatype ssdata.boolean
Default false
Note

This is useful for static sites that create nested directory structures (such as those generated from Jekyll or Wordpress).

Contained by
ss: params
May contain Empty element

Appendix A.1.11 <output>

<output> (Sets the folder into which the index data and JavaScript will be placed.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
dir (A pointer to a local directory.)
Derived from ss.atts.dir
Status Required
Datatype NCName
Default staticSearch
Note

This should conform with the XML Name specification.

Contained by
ss: params
May contain Empty element
Note

When the staticSearch build process creates its output, many files need to be added to the website for which an index is being created. For convenience, all of these files are stored in a single folder. This element is used to specify the name of that folder. The default is staticSearch, but if you would prefer something else, you can specify it here. You may also use this element if you are defining two different searches within the same site, so that their files are kept in different locations.

Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element output { attribute dir { text }, empty }

Appendix A.1.12 <params>

<params> (Element containing most of the settings which enable the Generator to find the target website content and process it appropriately.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: config
May contain
Content model
<content>
 <elementRef key="searchPage"/>
 <elementRef key="index" minOccurs="0"/>
 <elementRef key="stopwords" minOccurs="0"/>
 <elementRef key="dictionary" minOccurs="0"/>
 <elementRef key="tokenizer" minOccurs="0"/>
 <elementRef key="scoringAlgorithm"
  minOccurs="0"/>
 <elementRef key="stemmer" minOccurs="0"/>
 <elementRef key="createContexts"
  minOccurs="0"/>
 <elementRef key="results" minOccurs="0"/>
 <elementRef key="version" minOccurs="0"/>
 <elementRef key="output"/>
</content>
    
Schema Declaration
element params {  }

Appendix A.1.13 <results>

<results> (Controls the configuration of the results page.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
resultsPerPage (The maximum number of document results to be displayed per page. All results are displayed by default; setting resultsPerPage to a positive integer creates a Show More/Show All widget at the bottom of the batch of results.)
Status Optional
Datatype nonNegativeInteger
Default 0
Note

For most sites, where the number of results is likely to be in the low thousands, it's perfectly practical to show all the results at once, because the staticSearch processor is so fast. However, if you have tens of thousands of documents, and it's possible that users will do (for example) filter-only searches that retrieve a large proportion of them, you can constrain the number of results which are shown initially using this setting. All the results are still generated and output to the page, but since most of them are hidden until the ‘Show More’ or ‘Show All’ button is clicked, the browser will render them much more quickly.

maxKwicsToShow (Controls the maximum number of keyword-in-context extracts that will be shown in the search page for each hit document returned.)
Status Optional
Datatype nonNegativeInteger
Default 25
Note

maxKwicsToShow is useful for avoiding situations where a given query may result in hundreds of results (especially when searching for common words, et cetera) and make the results page difficult to navigate.

maxResults (The maximum number of results that can be returned for any search before returning an error; if the number of documents in a result set exceeds this number, then staticSearch will not render the results and will provide a message saying that the search returned too many results.)
Status Optional
Datatype nonNegativeInteger
Default 2000
Note

This configuration option is meant to prevent errors for sites where a given set of filters or search terms can return a set of document that can cause a browser's rendering engine to fail. For smaller collections, it's unlikely that this limit would ever be reached, but setting a limit may be helpful for large document collections, projects that want to constrain the number of possible results, or projects with memory-intensive or complex rendering.

This is set to 2000 by default, but you may want to have a higher or lower limit, depending on the specific structure of your project.

Contained by
ss: params
May contain Empty element

Appendix A.1.14 <rule>

<rule> (A rule that specifies a document path as XPath in the match attribute, and provides weighting for search terms found in that context.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
weight (The weighting to give to a search token found in the context specified by the match attribute. Set to 0 to completely suppress indexing for a specific context, or greater than 1 to give stronger weighting.)
Status Required
Datatype nonNegativeInteger
Contained by
ss: rules
May contain Empty element
Note

The rule element is used to identify nodes in the XHTML document collection which should be treated in a special manner when indexed; either they might be ignored (if weight=0), or any words found in them might be given greater weight than words in normal contexts weight>1. Words appearing in headings or titles, for example, might be weighted more heavily, while navigation menus, banners, or footers might be ignored completely.

Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element rule { att.match.attributes, attribute weight { text }, empty }

Appendix A.1.15 <rules>

<rules> (The set of rules that control weighting of search terms found in specific contexts.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: config
May contain
ss: rule
Content model
<content>
 <elementRef key="rule" minOccurs="1"
  maxOccurs="unbounded"/>
</content>
    
Schema Declaration
element rules { rule+ }

Appendix A.1.16 <scoringAlgorithm>

<scoringAlgorithm> (The scoring algorithm to use for ranking keyword results.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
name (Specifies the name of the scoring algorithm to use.)
Status Recommended
Legal values are:
raw
(Default: Calculate the score based on the weighted number of instances of a term in a text.) raw score[Default]
tf-idf
(Calculate the score using the tf-idf scoring algorithm.) tf-idf (term frequency-inverse document frequency)
BM25
(EXPERIMENTAL: Calculate the score using the BM25 scoring algorithm.) BM25 (best matching 25)
BM25L
(EXPERIMENTAL: Calculate the score using Lv and Zhai's modification of BM25.) BM25L (best matching 25 for long documents)
Contained by
ss: params
May contain Empty element
Note

<scoringAlgorithm> is an optional element that specifies which scoring algorithm to use when calculating the score of a term and thus the order in which the results from a search are sorted.

Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element scoringAlgorithm
{
   attribute name { "raw" | "tf-idf" | "BM25" | "BM25L" }?,
   empty
}

Appendix A.1.17 <searchPage>

<searchPage> (The search page that will be the primary access point for staticSearch. This page may or may not exist, but its location is used for determining the collection that will be indexed, so it must be at the root of the collection directory.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
file (The path (relative to the config file) to the search page.)
Derived from ss.atts.file
Status Required
Datatype anyURI
Contained by
ss: params
May contain Empty element
Note

The search page is a regular HTML page which forms part of your site. The only important characteristic it must have is a <div> element with id=staticSearch, whose contents will be rewritten by the staticSearch build process. See 8.6 Creating a search page.

Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element searchPage { attribute file { text }, empty }

Appendix A.1.18 <span>

<span> (An HTML span) A standard HTML span element, within which other HTML content can be included. We do not prescribe or validate the content model; see e.g. https://developer.mozilla.org/en-US/docs/Web/HTML/Element/span for information.
Namespace http://www.w3.org/1999/xhtml
Module ssHTML — Schema specification and tag documentation
Attributes
lang (The language of this span) Multiple spans may be supplied in different languages if required. The value of this attribute should be a valid language identifier per BCP 47.
Status Recommended
Datatype teidata.language
Contained by
ss: filter
May contain ANY
Content model
<content>
 <alternate minOccurs="1"
  maxOccurs="unbounded">
  <anyElement require="http://www.w3.org/1999/xhtml"
   minOccurs="0" maxOccurs="unbounded"/>
  <textNode/>
 </alternate>
</content>
    
Schema Declaration
element span { attribute lang { text }?, ( anyElement_span_1* | text )+ }

Appendix A.1.19 <stemmer>

<stemmer> (The name of a folder inside the staticSearch /stemmers/ folder, in which the JavaScript and XSLT implementations of stemmers can be found. If not specified, then the staticSearch default English stemmer (en) will be used.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
dir (The path (relative to the config file) of the directory to use for stemming.)
Derived from ss.atts.dir
Status Required
Datatype anyURI
Suggested values include:
stemmers/en/
English stemmer[Default]
stemmers/fr/
French stemmer
stemmers/identity
Identity stemmer
stemmers/stripDiacritics
Diacritic stripping stemmer
Contained by
ss: params
May contain Empty element
Note

The staticSearch project currently has only two real stemmers: an implementation of the Porter 2 algorithm for modern English, and an implementation of the French Snowball stemmer. These appear in the /stemmers/en/ and /stemmers/fr/ folders. The default value for this parameter is en; to use the French stemmer, use fr. We will be adding more stemmers as the project develops. However, if your document collection is not English or French, you have a couple of options, one hard and one easy.

  • Hard option: implement your own stemmers. You will need to write two implementations of the stemmer algorithm, one in XSLT (which must be named ssStemmer.xsl) and one in JavaScript (ssStemmer.js), and confirm that they both generate the same results. The XSLT stemmer is used in the generation of the index files at build time, and the JavaScript version is used to stem the user's input in the search page. You can look at the existing implementations in the /stemmers/en/ folder to see how the stemmers need to be constructed. Place your stemmers in a folder called /stemmers/[yourlang]/, and specify yourlang in the configuration file.
  • Easy option: Use the identity stemmer (which is equivalent to turning off stemming completely), and make sure wildcard searching is turned on. Then your users can search using wildcards instead of having their search terms automatically stemmed. To do this, specify the value identity in your configuration file.

Another alternative is the stripDiacritics stemmer. Like the identity stemmer, this is not really a stemmer at all; what it does is to strip out all combining diacritics from tokens. This is a useful approach if you document collection contains texts with accents and diacritics, but your users may be unfamiliar with the use of diacritics and will want to search just with plain unaccented characters. For example, if a text contains the word élève, but you would like searchers to be able to find the word simply by typing the ascii string eleve, then this is a good option. Combined with wildcards, it can provide a very flexible and user-friendly search engine in the absence of a sophisticated stemmer, or for cases where there are mixed languages so a single stemmer will not do. To use this option, specify the value stripDiacritics in your configuration file.

Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element stemmer
{
   attribute dir
   {
      "stemmers/en/"
    | "stemmers/fr/"
    | "stemmers/identity"
    | "stemmers/stripDiacritics"
   },
   empty
}

Appendix A.1.20 <stopwords>

<stopwords> (Specifies a list of stopwords--that is, words to be ignored when indexing.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
file (The path (relative to the config file) to a text file containing a list of words to be ignored by the indexer (one word per line).)
Derived from ss.atts.file
Status Required
Datatype anyURI
Default stopwords/stopwords_en.txt
Contained by
ss: params
May contain Empty element
Note

A stopword is a word that will not be indexed, because it is too common (the, a, you and so on). There are common stopwords files for most languages available on the Web, but it is probably a good idea to take one of these and customize it for your project, since there will be words in the website which are so common that it makes no sense to index them, but they are not ordinary stopwords. For example, in a website dedicated to the work of John Keats, the name keats should probably be added to the stopwords file, since almost every page will include it, and searching for it will be pointless.

staticSearch provides a default set of common stopwords for English, which you'll find in xsl/english_stopwords.txt. One way to find appropriate stopwords for your site is to generate your index, then search for the largest JSON index files that are generated, to see if they might be too common to be useful as search terms. You can also use the Word Frequency table in the generated staticSearch report (see 8.9 Generated report).

Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element stopwords { attribute file { text }, empty }

Appendix A.1.21 <tokenizer>

<tokenizer> (Configures options for the tokenizing process.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
minWordLength (Specifies the minimum length in characters of a sequence of text that will be considered to be a word worth indexing.)
Status Recommended
Datatype nonNegativeInteger
Default 2
Note

Values of 3 or above may be useful for European languages to exclude common prepositions, articles, et cetera. If you set this to a lower limit for reasons specific to your project, you should ensure that your stopword list excludes any very common words that would otherwise make the indexing process lengthy and increase the index size.

Contained by
ss: params
May contain Empty element

Appendix A.1.22 <version>

<version> (Specifies the unique version to append to the index, so that the browser will not use cached versions of older index files.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
file (The path (relative to the config file) to a text file containing a single version identifier (such as 1.5, 123456, or 06ad419).)
Derived from ss.atts.file
Status Required
Datatype anyURI
Contained by
ss: params
May contain Empty element
Note

<version> enables you to specify the path to a plain-text file containing a simple version number for the project. This might take the form of a software-release-style version number such as 1.5, or it might be a Subversion revision number or a Git commit hash. It should not contain any spaces or punctuation. If you provide a version file, the version string will be used as part of the filenames for all the JSON resources created for the search. This is useful because it allows the browser to cache such resources when users repeatedly visit the search page, but if the project is rebuilt with a new version, those cached files will not be used because the new version will have different filenames. The path specified is relative to the location of the configuration file (or absolute, if you wish).

Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element version { attribute file { text }, empty }

Appendix A.2 Attribute classes

Appendix A.2.1 att.labelled

att.labelled (A class providing a label attribute that can be used to identify/describe contexts.)
Module ss — Schema specification and tag documentation
Members context
Attributes
label (A string identifier specifying the name for a given context.)
Status Optional
Datatype string
Note

When describing a <context>, the label attribute names a component of the page that can be searched within (see 8.5.6 Specifying searchable contexts (search only in)).

Appendix A.2.2 att.match

att.match (A class providing attributes that enable specification of document locations.)
Module ss — Schema specification and tag documentation
Members context exclude rule
Attributes
match (An XPath equivalent to the @match attribute of an xsl:template, which specifies a context in a document.)
Status Required
Datatype string

Appendix A.3 Datatypes

Appendix A.3.1 ssdata.boolean

ssdata.boolean (Custom boolean datatype, which restricts boolean values to true or false for ease of processing.)
Module ss — Schema specification and tag documentation
Used by
Element:
Content model
<content>
 <valList>
  <valItem ident="true"/>
  <valItem ident="false"/>
 </valList>
</content>
    
Declaration
ssdata.boolean = "true" | "false"
Martin Holmes and Joey Takeda. Date: 2019–2026