Schema and guidelines for creating a staticSearch engine for your HTML5 site
Martin Holmes
Joey Takeda
2019-2022

This documentation provides instructions on how to use the Project Endings staticSearch Generator to provide a fully-functional search ‘engine’ to your website without any dependency on server-side code such as a database.

Appendix A Schema specification and tag documentation

Appendix A.1 Elements

Appendix A.1.1 <config>

<config> (The root element for the Search Generator configuration file.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes
version (specifies the major version of staticSearch to which this configuration file corresponds. If this attribute is not used, the configuration file is assumed to have an version value of 1.)
Status Recommended
Datatype nonNegativeInteger
Default 1
<config version="1">
 <params>
<!--Config options-->
 </params>
</config>
Note

The version attribute only needs to specify the major version of staticSearch with which the configuration file is compatible. While minor versions may introduce new, optional configuration options, backwards-incompatible changes to the config will only occur across major versions (e.g. a configuration file created for staticSearch 1.1 will work with staticSearch 1.4, but is not guaranteed to work with staticSearch 2.0).

Contained by
May contain
Content model
<content>
 <elementRef key="params"/>
 <elementRef key="rules" minOccurs="0"/>
 <elementRef key="contexts" minOccurs="0"/>
 <elementRef key="excludes" minOccurs="0"/>
</content>
    
Schema Declaration
element config
{
   attribute version { text }?,
   params,
   rules?,
   contexts?,
   excludes?
}

Appendix A.1.2 <context>

<context> (A context definition, providing a match attribute that identifies the context, allowing keyword-in-context fragments to be bounded by a specific context.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes att.match (@match) att.labelled (@label)
context
Status Optional
Datatype boolean
Contained by
May contain Empty element
Note

When the indexer is extracting keyword-in-context strings for each word, it uses a common-sense approach based on common element definitions, so that for example when it reaches the end of a paragraph, it will not continue into the next paragraph to get more context words. You may have special runs of text in your document collection which do not appear to be bounding contexts, but actually are; for example, you may have span elements with class=note that appear in the middle of sentences but are not actually part of them. Use <context> elements to identify these special contexts so that the indexer knows the right boundaries from which to retrieve its keyword-in-context strings.

Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element context
{
   att.match.attributes,
   att.labelled.attributes,
   attribute context { text }?,
   empty
}

Appendix A.1.3 <contexts>

<contexts> (The set of context elements that identify contexts for keyword-in-context fragments.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: config
May contain
Content model
<content>
 <elementRef key="context" minOccurs="1"
  maxOccurs="unbounded"/>
</content>
    
Schema Declaration
element contexts { context+ }

Appendix A.1.4 <createContexts>

<createContexts> (Whether to include keyword-in-context extracts in the index.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD boolean
Note

<createContexts> is a boolean parameter that specifies whether you want the indexer to store keyword-in-context extracts for each of the hits in a document. This increases the size of the index, but of course it makes for much more user-friendly search results; instead of seeing just a score for each document found, the user will see a series of short text strings with the search keyword(s) highlighted.

Note that contexts are necessary for phrasal searching or wildcard searching.

Content model
<content>
 <dataRef name="boolean"/>
</content>
    
Schema Declaration
element createContexts { xsd:boolean }

Appendix A.1.5 <dictionaryFile>

<dictionaryFile> (The relative path (from the config file) to a dictionary file (one word per line) which will be used to check tokens when indexing.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD anyURI
Note

The indexing process checks each word as it builds the index, and keeps a record of all words which are not found in the configured dictionary. Though this does not have any direct effect in the indexing process, all words not found in the dictionary are listed in the staticSearch report (see 7.9 Generated report). This can be very useful: all words listed are either foreign (not part of the language of the dictionary) or perhaps misspelled (in which case they may not be correctly stemmed and index, and should be corrected). There is a default dictionary in xsl/english_words.txt which you might copy and adapt if you're working in English; lots of dictionaries for other languages are available on the Web.

Content model
<content>
 <dataRef name="anyURI"/>
</content>
    
Schema Declaration
element dictionaryFile { xsd:anyURI }

Appendix A.1.6 <exclude>

<exclude> (An exclusion definition, which excludes either documents or filters as defined by an XPath in the match attribute.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes att.match (@match)
type
Status Required
Legal values are:
index
(Index exclusion) An exclusion that specifies HTML fragment (which itself can be the root HTML element) to exclude from the document index.
filter
(Filter exclusion) An exclusion that matches an HTML meta tag to exclude from the filter controls on the search page.
Contained by
May contain Empty element
Note

<exclude> can be used to identify documents or parts of documents that are to be omitted from indexing, but, unlike setting <weight> to zero, should be retained during the indexing process. This is helpful in cases where the text itself should be ignored by the indexer, but should still appear in KWICs. Another common use is for multiple search engines/pages that each have their own special features; in this case, you may want one specific search index/page to ignore filter controls (HTML <meta> elements, as described in 4 Search facet features) which are provided to support other search pages.

Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element exclude
{
   att.match.attributes,
   attribute type { "index" | "filter" },
   empty
}

Appendix A.1.7 <excludes>

<excludes> (The set of exclusions, expressed as exclude elements, that control the subset of documents or filters used for a particular search.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: config
May contain
Content model
<content>
 <elementRef key="exclude" minOccurs="1"
  maxOccurs="unbounded"/>
</content>
    
Schema Declaration
element excludes { exclude+ }

Appendix A.1.8 <indentJSON>

<indentJSON> (Whether or not to indent code in the JSON index files. Indenting increases the file size, but it can be useful if you need to read the files for debugging purposes.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD boolean
Note

Normally, the JSON index files will only be processed by JavaScript, so there is no need to indent them, but if you are debugging or investigating how the code works, you may find it useful to generate them in a more human-readable layout.

Content model
<content>
 <dataRef name="boolean"/>
</content>
    
Schema Declaration
element indentJSON { xsd:boolean }

Appendix A.1.9 <kwicTruncateString>

<kwicTruncateString> (The string that will be used to signal ellipsis at the beginning and end of a keyword-in-context extract. Conventionally three periods, or an ellipsis character (which is the default value).)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain Character data only
Note

The only reason you might need to specify a value for this parameter is if the language of your search page conventionally uses a different ellipsis character. Japanese, for example, uses the 3-dot-leader character.

Content model
<content>
 <textNode/>
</content>
    
Schema Declaration
element kwicTruncateString { text }

Appendix A.1.10 <linkToFragmentId>

<linkToFragmentId> (Whether to link keyword-in-context extracts to the nearest id in the document. Default is true.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD boolean
Note

<linkToFragmentId> is a boolean parameter that specifies whether you want the search engine to link each keyword-in-context extract with the closest element that has an id. If the element has an ancestor with an id, then the indexer will associate that keyword-in-context extract with that id; if there are no suitable ancestor elements that have an id, then the extract is associated with first preceding element with an id.

We strongly recommend that you ensure your target documents have id attributes for any significant divisions so that this parameter can be used effectively. With lots of ids throughout your documents, and this parameter turned on, each keyword-in-context in the results page will be linked directly to the section of the document in which the hit appears, making the search results much more useful.

Content model
<content>
 <dataRef name="boolean"/>
</content>
    
Schema Declaration
element linkToFragmentId { xsd:boolean }

Appendix A.1.11 <maxKwicsToHarvest>

<maxKwicsToHarvest> (This controls the maximum number of keyword-in-context extracts that will be stored for each term in a document.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD nonNegativeInteger
Note

<maxKwicsToHarvest> controls the number of keyword-in-context extracts that will be harvested from the data for each term in a document. For example, if a user searches for the word ‘elephant’, and it occurs 27 times in a document, but the <maxKwicsToHarvest> value is set to 5, then only the first five (sorted in document order) of these keyword-in-context strings will be stored in the index. (This does not affect the score of the document in the search results, of course.) If you set this to a low number, the size of the JSON files will be constrained, but of course the user will only be able to see the KWICs that have been harvested in their search results.

If <phrasalSearch> is set to true, the <maxKwicsToHarvest> setting is ignored, because phrasal searches will only work properly if all contexts are stored.

Content model
<content>
 <dataRef name="nonNegativeInteger"/>
</content>
    
Schema Declaration
element maxKwicsToHarvest { xsd:nonNegativeInteger }

Appendix A.1.12 <maxKwicsToShow>

<maxKwicsToShow> (This controls the maximum number of keyword-in-context extracts that will be shown in the search page for each hit document returned.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD nonNegativeInteger
Note

A user may search for multiple common words, so hundreds of hits could be found in a single document. If the keyword-in-context strings for all these hits are shown on the results page, it would be too long and too difficult to navigate. This setting controls how many of those hits you want to show for each document in the result set.

Content model
<content>
 <dataRef name="nonNegativeInteger"/>
</content>
    
Schema Declaration
element maxKwicsToShow { xsd:nonNegativeInteger }

Appendix A.1.13 <minWordLength>

<minWordLength> (The minimum length of a term to be indexed. Default is 3 characters.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD nonNegativeInteger
Note

<minWordLength> specifies the minimum length in characters of a sequence of text that will be considered to be a word worth indexing. The default is 3, on the basis that in most European languages, words of one or two letters are typically not worth indexing, being articles, prepositions and so on. If you set this to a lower limit for reasons specific to your project, you should ensure that your stopword list excludes any very common words that would otherwise make the indexing process lengthy and increase the index size.

Content model
<content>
 <dataRef name="nonNegativeInteger"/>
</content>
    
Schema Declaration
element minWordLength { xsd:nonNegativeInteger }

Appendix A.1.14 <outputFolder>

<outputFolder> (The name of the output folder into which the index data and JavaScript will be placed in the site search. This should conform with the XML Name specification.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD NCName
Note

When the staticSearch build process creates its output, many files need to be added to the website for which an index is being created. For convenience, all of these files are stored in a single folder. This element is used to specify the name of that folder. The default is staticSearch, but if you would prefer something else, you can specify it here. You may also use this element if you are defining two different searches within the same site, so that their files are kept in different locations.

Content model
<content>
 <dataRef name="NCName"/>
</content>
    
Schema Declaration
element outputFolder { xsd:NCName }

Appendix A.1.15 <params>

<params> (Element containing most of the settings which enable the Generator to find the target website content and process it appropriately.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: config
May contain
Content model
<content>
 <elementRef key="searchFile"/>
 <elementRef key="versionFile"
  minOccurs="0"/>
 <elementRef key="stemmerFolder"
  minOccurs="0"/>
 <elementRef key="recurse"/>
 <elementRef key="linkToFragmentId"
  minOccurs="0"/>
 <elementRef key="minWordLength"
  minOccurs="0"/>
 <elementRef key="scrollToTextFragment"
  minOccurs="0"/>
 <elementRef key="scoringAlgorithm"
  minOccurs="0"/>
 <elementRef key="phrasalSearch"
  minOccurs="0"/>
 <elementRef key="wildcardSearch"
  minOccurs="0"/>
 <elementRef key="createContexts"
  minOccurs="0"/>
 <elementRef key="maxKwicsToHarvest"
  minOccurs="0"/>
 <elementRef key="maxKwicsToShow"
  minOccurs="0"/>
 <elementRef key="totalKwicLength"
  minOccurs="0"/>
 <elementRef key="kwicTruncateString"
  minOccurs="0"/>
 <elementRef key="verbose" minOccurs="0"/>
 <elementRef key="stopwordsFile"
  minOccurs="0"/>
 <elementRef key="dictionaryFile"
  minOccurs="0"/>
 <elementRef key="indentJSON" minOccurs="0"/>
 <elementRef key="outputFolder"
  minOccurs="0"/>
 <elementRef key="resultsPerPage"
  minOccurs="0"/>
 <elementRef key="resultsLimit"
  minOccurs="0"/>
</content>
    
Schema Declaration
element params {  }

Appendix A.1.16 <phrasalSearch>

<phrasalSearch> (Whether or not to support phrasal searches. If this is true, then the maxContexts setting will be ignored, because all contexts are required to properly support phrasal search.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD boolean
Note

Phrasal search functionality enables your users to search for specific phrases by surrounding them with quotation marks ("), as in many search engines. In order to support this kind of search, <createContexts> must also be set to true as we store contexts for all hits for each token in each document. Turning this on will make the index larger, because all contexts must be stored, but once the index is built, it has very little impact on the speed of searches, so we recommend turning this on. The default value is true.

However, if your site is very large and your user base is unlikely to use phrasal searching, it may not be worth the additional build time and increased index size.

Content model
<content>
 <dataRef name="boolean"/>
</content>
    
Schema Declaration
element phrasalSearch { xsd:boolean }

Appendix A.1.17 <recurse>

<recurse> (Whether to recurse into subdirectories of the collection directory or not.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD boolean
Content model
<content>
 <dataRef name="boolean"/>
</content>
    
Schema Declaration
element recurse { xsd:boolean }

Appendix A.1.18 <resultsLimit>

<resultsLimit> (The maximum number of results that can be returned for any search before returning an error; if the number of documents in a result set exceeds this number, then staticSearch will not render the results and will provide a message saying that the search returned too many results. This is usually set to 2000 by default, but you may want to have a higher or lower limit, depending on the specific structure of your project. )
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD nonNegativeInteger
Note

This configuration option is meant to prevent errors for sites where a given set of filters or search terms can return a set of document that can cause a browser's rendering engine to fail. For smaller collections, it's unlikely that this limit would ever be reached, but setting a limit may be helpful for large document collections, projects that want to constrain the number of possible results, or projects with memory-intensive or complex rendering.

Content model
<content>
 <dataRef name="nonNegativeInteger"/>
</content>
    
Schema Declaration
element resultsLimit { xsd:nonNegativeInteger }

Appendix A.1.19 <resultsPerPage>

<resultsPerPage> (The maximum number of document results to be displayed per page. All results are displayed by default; setting resultsPerPage to a positive integer creates a Show More/Show All widget at the bottom of the batch of results.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD nonNegativeInteger
Note

For most sites, where the number of results is likely to be in the low thousands, it's perfectly practical to show all the results at once, because the staticSearch processor is so fast. However, if you have tens of thousands of documents, and it's possible that users will do (for example) filter-only searches that retrieve a large proportion of them, you can constrain the number of results which are shown initially using this setting. All the results are still generated and output to the page, but since most of them are hidden until the ‘Show More’ or ‘Show All’ button is clicked, the browser will render them much more quickly.

Content model
<content>
 <dataRef name="nonNegativeInteger"/>
</content>
    
Schema Declaration
element resultsPerPage { xsd:nonNegativeInteger }

Appendix A.1.20 <rule>

<rule> (A rule that specifies a document path as XPath in the match attribute, and provides weighting for search terms found in that context.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Attributes att.match (@match)
weight (The weighting to give to a search token found in the context specified by the match attribute. Set to 0 to completely suppress indexing for a specific context, or greater than 1 to give stronger weighting.)
Status Required
Datatype nonNegativeInteger
Contained by
ss: rules
May contain Empty element
Note

The rule element is used to identify nodes in the XHTML document collection which should be treated in a special manner when indexed; either they might be ignored (if weight=0), or any words found in them might be given greater weight than words in normal contexts weight>1. Words appearing in headings or titles, for example, might be weighted more heavily, while navigation menus, banners, or footers might be ignored completely.

Content model
<content>
 <empty/>
</content>
    
Schema Declaration
element rule { att.match.attributes, attribute weight { text }, empty }

Appendix A.1.21 <rules>

<rules> (The set of rules that control weighting of search terms found in specific contexts.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: config
May contain
ss: rule
Content model
<content>
 <elementRef key="rule" minOccurs="1"
  maxOccurs="unbounded"/>
</content>
    
Schema Declaration
element rules { rule+ }

Appendix A.1.22 <scoringAlgorithm>

<scoringAlgorithm> (The scoring algorithm to use for ranking keyword results. Default is "raw" (i.e. weighted counts))
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain Empty element
Note

<scoringAlgorithm> is an optional element that specifies which scoring algorithm to use when calculating the score of a term and thus the order in which the results from a search are sorted. There are currently two options:

  • raw: This is the default option (and so does not need to be set explicitly). The raw score is simply the sum of all instances of a term (optionally multipled by a configured weight via the <rule>/weight configuration) in a document. This will usually provide good results for most document collections.
  • tf-idf: The tf-idf algorithm (term frequency-inverse document frequency) computes the mathematical relevance of a term within a document relative to the rest of the document collection. The staticSearch implementation of tf-idf basically follows the textbook definition of tf-idf: tf-idf = ($instancesOfTerm / $totalTermsInDoc) * log( $allDocumentsCount / $docsWithThisTermCount ) This is fairly crude compared to other search engines, like Lucene, but it may provide useful results for document collections of varying lengths or in instances where the raw score may be insufficient or misleading. There are a number of resources on tf-idf scoring, including: Wikipedia and Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
Content model
<content>
 <valList type="closed">
  <valItem ident="raw">
   <desc>raw score</desc>
   <gloss>Default: Calculate the score based off of the weighted number of
       instances of a term in a text.</gloss>
  </valItem>
  <valItem ident="tf-idf">
   <gloss>Calculate the score based off of the tf-idf scoring algorithm.</gloss>
  </valItem>
 </valList>
</content>
    
Legal values are:
raw
(Default: Calculate the score based off of the weighted number of instances of a term in a text.) raw score
tf-idf
(Calculate the score based off of the tf-idf scoring algorithm.)
Schema Declaration
element scoringAlgorithm { "raw" | "tf-idf" }
Legal values are:
raw
(Default: Calculate the score based off of the weighted number of instances of a term in a text.) raw score
tf-idf
(Calculate the score based off of the tf-idf scoring algorithm.)

Appendix A.1.23 <scrollToTextFragment>

<scrollToTextFragment> (WARNING: Experimental technology. This turns on a feature currently only supported by a subset of browsers, enabling links from keyword-in-context results directly to the specific text string in the target document.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD boolean
Note

Google has proposed a browser feature called Text Fragments, which would support a special kind of link that targets a specific string of text inside a page. When clicking on such a link, the browser would scroll to, and then highlight, the target text. This has been implemented in Chrome-based browsers (Chrome, Chromium and Edge) at the time of writing, but other browser producers are sceptical with regard to the specification and worried about possible security implications. The specification is subject to radical change. <scrollToTextFragment> is a boolean parameter that specifies whether you want to turn on this feature for browsers that support it. It depends on the availability of keyword-in-context strings, so <createContexts> must also be turned on to make it work. The feature is automatically suppressed for browsers which do not support it. We recommend only using this feature on sites which are in steady development, so that if necessary it can be turned off, our the staticSearch implementation can be updated to take account of changes. For sites intended to remain unchanged or archived for any length of time, this feature should be left turned off. It is off by default. A more reliable alternative is probably to use JavaScript to highlight hits on the target page.

Content model
<content>
 <dataRef name="boolean"/>
</content>
    
Schema Declaration
element scrollToTextFragment { xsd:boolean }

Appendix A.1.24 <searchFile>

<searchFile> (The search file (aka page) that will be the primary access point for the staticSearch. Note that this page must be at the root of the collection directory.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD anyURI
Note

The search page is a regular HTML page which forms part of your site. The only important characteristic it must have is a <div> element with id=staticSearch, whose contents will be rewritten by the staticSearch build process. See 7.6 Creating a search page.

Content model
<content>
 <dataRef name="anyURI"/>
</content>
    
Schema Declaration
element searchFile { xsd:anyURI }

Appendix A.1.25 <stemmerFolder>

<stemmerFolder> (The name of a folder inside the staticSearch /stemmers/ folder, in which the JavaScript and XSLT implementations of stemmers can be found. If left blank, then the staticSearch default English stemmer (en) will be used.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD NCName
Note

The staticSearch project currently has only two real stemmers: an implementation of the Porter 2 algorithm for modern English, and an implementation of the French Snowball stemmer. These appear in the /stemmers/en/ and /stemmers/fr/ folders. The default value for this parameter is en; to use the French stemmer, use fr. We will be adding more stemmers as the project develops. However, if your document collection is not English or French, you have a couple of options, one hard and one easy.

  • Hard option: implement your own stemmers. You will need to write two implementations of the stemmer algorithm, one in XSLT (which must be named ssStemmer.xsl) and one in JavaScript (ssStemmer.js), and confirm that they both generate the same results. The XSLT stemmer is used in the generation of the index files at build time, and the JavaScript version is used to stem the user's input in the search page. You can look at the existing implementations in the /stemmers/en/ folder to see how the stemmers need to be constructed. Place your stemmers in a folder called /stemmers/[yourlang]/, and specify yourlang in the configuration file.
  • Easy option: Use the identity stemmer (which is equivalent to turning off stemming completely), and make sure wildcard searching is turned on. Then your users can search using wildcards instead of having their search terms automatically stemmed. To do this, specify the value identity in your configuration file.

Another alternative is the stripDiacritics stemmer. Like the identity stemmer, this is not really a stemmer at all; what it does is to strip out all combining diacritics from tokens. This is a useful approach if you document collection contains texts with accents and diacritics, but your users may be unfamiliar with the use of diacritics and will want to search just with plain unaccented characters. For example, if a text contains the word élève, but you would like searchers to be able to find the word simply by typing the ascii string eleve, then this is a good option. Combined with wildcards, it can provide a very flexible and user-friendly search engine in the absence of a sophisticated stemmer, or for cases where there are mixed languages so a single stemmer will not do. To use this option, specify the value stripDiacritics in your configuration file.

Content model
<content>
 <dataRef name="NCName"/>
</content>
    
Schema Declaration
element stemmerFolder { xsd:NCName }

Appendix A.1.26 <stopwordsFile>

<stopwordsFile> (The relative path (from the config file) to a text file containing a list of stopwords (words to be ignored when indexing). )
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD anyURI
Note

A stopword is a word that will not be indexed, because it is too common (the, a, you and so on). There are common stopwords files for most languages available on the Web, but it is probably a good idea to take one of these and customize it for your project, since there will be words in the website which are so common that it makes no sense to index them, but they are not ordinary stopwords. For example, in a Website dedicated to the work of John Keats, the name keats should probably be added to the stopwords file, since almost every page will include it, and searching for it will be pointless. The project has a built-in set of common stopwords for English, which you'll find in xsl/english_stopwords.txt. One way to find appropriate stopwords for your site is to generate your index, then search for the largest JSON index files that are generated, to see if they might be too common to be useful as search terms. You can also use the Word Frequency table in the generated staticSearch report (see 7.9 Generated report).

Content model
<content>
 <dataRef name="anyURI"/>
</content>
    
Schema Declaration
element stopwordsFile { xsd:anyURI }

Appendix A.1.27 <totalKwicLength>

<totalKwicLength> (If createContexts is set to true, then this parameter controls the length (in words) of the harvested keyword-in-context string.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD nonNegativeInteger
Note

Obviously, the longer the keyword-in-context strings are, the larger the individual index files will be, but the more useful the KWICs will be for users looking at the search results. Note that the phrasal searching relies on the KWICs and thus longer KWICs allow for longer phrasal searches.

Content model
<content>
 <dataRef name="nonNegativeInteger"/>
</content>
    
Schema Declaration
element totalKwicLength { xsd:nonNegativeInteger }

Appendix A.1.28 <verbose>

<verbose> (Turns on more detailed reporting during the indexing process.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD boolean
Note

The staticSearch build process can be lengthy, and if something goes wrong it can be difficult to figure out where the problem was. If you're having build problems, turn on verbose messages in the configuration file to help with debugging, and you'll see a lot more output at the command line.

Content model
<content>
 <dataRef name="boolean"/>
</content>
    
Schema Declaration
element verbose { xsd:boolean }

Appendix A.1.29 <versionFile>

<versionFile> (The relative path to a text file containing a single version identifier (such as 1.5, 123456, or 06ad419). This will be used to create unique filenames for JSON resources, so that the browser will not use cached versions of older index files.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD anyURI
Note

<versionFile> enables you to specify the path to a plain-text file containing a simple version number for the project. This might take the form of a software-release-style version number such as 1.5, or it might be a Subversion revision number or a Git commit hash. It should not contain any spaces or punctuation. If you provide a version file, the version string will be used as part of the filenames for all the JSON resources created for the search. This is useful because it allows the browser to cache such resources when users repeatedly visit the search page, but if the project is rebuilt with a new version, those cached files will not be used because the new version will have different filenames. The path specified is relative to the location of the configuration file (or absolute, if you wish).

Content model
<content>
 <dataRef name="anyURI"/>
</content>
    
Schema Declaration
element versionFile { xsd:anyURI }

Appendix A.1.30 <wildcardSearch>

<wildcardSearch> (Whether or not to support wildcard searches. Note that wildcard searches are more effective when phrasal searching is also turned on, because the contexts available for phrasal searches are also used to provide wildcard results.)
Namespace http://hcmc.uvic.ca/ns/staticSearch
Module ss — Schema specification and tag documentation
Contained by
ss: params
May contain
XSD boolean
Note

Wildcard searching can coexist with stemmed searching, but it is especially useful when stemming is not available, either because there is no available stemmer for the language of the site, or because the site contains multiple languages. Unless your site is particularly large, we recommend turning on wildcard searching, and therefore also phrasal searching (<phrasalSearch>).

Content model
<content>
 <dataRef name="boolean"/>
</content>
    
Schema Declaration
element wildcardSearch { xsd:boolean }

Appendix A.2 Attribute classes

Appendix A.2.1 att.labelled

att.labelled (A class providing a label attribute that can be used to identify/describe contexts.)
Module ss — Schema specification and tag documentation
Members context
Attributes
label (A string identifier specifying the name for a given context.)
Status Optional
Datatype string
Note

When describing a <context>, the label attribute names a component of the page that can be searched within (see 7.5.5 Specifying searchable contexts (search only in)).

Appendix A.2.2 att.match

att.match (A class providing attributes that enable specification of document locations.)
Module ss — Schema specification and tag documentation
Members context exclude rule
Attributes
match (An XPath equivalent to the @match attribute of an xsl:template, which specifies a context in a document.)
Status Required
Datatype string
Martin Holmes and Joey Takeda. Date: 2019-2022