This documentation provides instructions on how to use the Project Endings staticSearch Generator to provide a fully-functional search ‘engine’ to your website without any dependency on server-side code such as a database.
| <config> (The root element for the Search Generator configuration file.) | |||||||||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||||||||
| Module | ss — Schema specification and tag documentation | ||||||||||||
| Attributes |
|
||||||||||||
| Contained by |
—
|
||||||||||||
| May contain | |||||||||||||
| Content model |
<content>
<sequence minOccurs="1" maxOccurs="1">
<elementRef key="params"/>
<elementRef key="rules" minOccurs="0"/>
<elementRef key="contexts" minOccurs="0"/>
<elementRef key="excludes" minOccurs="0"/>
<elementRef key="filters" minOccurs="0"/>
</sequence>
</content>
⚓
|
||||||||||||
| Schema Declaration |
element config
{
attribute version { text }?,
( params, rules?, contexts?, excludes?, filters? )
}⚓
|
||||||||||||
| <context> (A context definition, providing a match attribute that identifies the context, allowing keyword-in-context fragments to be bounded by a specific context.) | |||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||
| Module | ss — Schema specification and tag documentation | ||||||
| Attributes |
|
||||||
| Contained by |
ss: contexts
|
||||||
| May contain | Empty element | ||||||
| Note |
When the indexer is extracting keyword-in-context strings for each word, it uses a common-sense approach based on common element definitions, so that for example when it reaches the end of a paragraph, it will not continue into the next paragraph to get more context words. You may have special runs of text in your document collection which do not appear to be bounding contexts, but actually are; for example, you may have span elements with class=note that appear in the middle of sentences but are not actually part of them. Use <context> elements to identify these special contexts so that the indexer knows the right boundaries from which to retrieve its keyword-in-context strings. |
||||||
| Schematron |
<sch:ns uri="http://hcmc.uvic.ca/ns/staticSearch"
prefix="ss"/>
<sch:pattern>
<sch:rule context="ss:context">
<sch:assert test="not(@label and @context = 'false')"> ERROR: If a context has a label,
it must be a context for the purposes of indexing.
</sch:assert>
</sch:rule>
</sch:pattern>
|
||||||
| Content model |
<content>
<empty/>
</content>
⚓
|
||||||
| Schema Declaration |
element context
{
att.match.attributes,
att.labelled.attributes,
attribute context { text }?,
empty
}⚓
|
||||||
| <contexts> (The set of context that identify contexts for keyword-in-context fragments.) | |
| Namespace | http://hcmc.uvic.ca/ns/staticSearch |
| Module | ss — Schema specification and tag documentation |
| Contained by |
ss: config
|
| May contain |
ss: context
|
| Content model |
<content>
<elementRef key="context" minOccurs="0"
maxOccurs="unbounded"/>
</content>
⚓
|
| Schema Declaration |
element contexts { context* }⚓
|
| <createContexts> (Whether to include keyword-in-context extracts in the index.) | |
| Namespace | http://hcmc.uvic.ca/ns/staticSearch |
| Module | ss — Schema specification and tag documentation |
| Attributes |
|
| Contained by |
ss: params
|
| May contain | Empty element |
| Note |
Note that contexts are necessary for phrasal searching or wildcard searching. |
| Content model |
<content>
<empty/>
</content>
⚓
|
| Schema Declaration |
element createContexts
{
(
( attribute create { "false" }? )
| (
attribute create { "true" }?,
attribute phrasalSearch { text }?,
attribute wildcardSearch { text }?,
attribute maxKwicsToHarvest { text }?,
attribute maxKwicLength { text }?,
attribute kwicTruncateString { text }?
)
),
empty
}⚓
|
| <dictionary> (Specifies a dictionary against which tokens may be checked during indexing.) | |||||||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||||||
| Module | ss — Schema specification and tag documentation | ||||||||||
| Attributes |
|
||||||||||
| Contained by |
ss: params
|
||||||||||
| May contain | Empty element | ||||||||||
| Note |
The indexing process checks each word as it builds the index, and keeps a record of all words which are not found in the configured dictionary. Though this does not have any direct effect in the indexing process, all words not found in the dictionary are listed in the staticSearch report (see 8.9 Generated report). This can be very useful: all words listed are either foreign (not part of the language of the dictionary) or perhaps misspelled (in which case they may not be correctly stemmed and index, and should be corrected). staticSearch provides a default dictionary in xsl/english_words.txt that can be copied and adapted if working in English; lots of dictionaries for other languages are available on the Web. |
||||||||||
| Content model |
<content>
<empty/>
</content>
⚓
|
||||||||||
| Schema Declaration |
element dictionary { attribute file { text }, empty }⚓
|
||||||||||
| <exclude> (An exclusion definition, which excludes either documents or filters as defined by an XPath in the match attribute.) | |||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||
| Module | ss — Schema specification and tag documentation | ||||||
| Attributes |
|
||||||
| Contained by |
ss: excludes
|
||||||
| May contain | Empty element | ||||||
| Note |
<exclude> can be used to identify documents or parts of documents that are to be omitted from indexing, but, unlike setting <weight> to zero, should be retained during the indexing process. This is helpful in cases where the text itself should be ignored by the indexer, but should still appear in KWICs. Another common use is for multiple search engines/pages that each have their own special features; in this case, you may want one specific search index/page to ignore filter controls (HTML <meta> elements, as described in 5 Search facet features) which are provided to support other search pages. |
||||||
| Content model |
<content>
<empty/>
</content>
⚓
|
||||||
| Schema Declaration |
element exclude
{
att.match.attributes,
attribute type { "index" | "filter" },
empty
}⚓
|
||||||
| <excludes> (The set of exclusions, expressed as exclude elements, that control the subset of documents or filters used for a particular search.) | |
| Namespace | http://hcmc.uvic.ca/ns/staticSearch |
| Module | ss — Schema specification and tag documentation |
| Contained by |
ss: config
|
| May contain |
ss: exclude
|
| Content model |
<content>
<elementRef key="exclude" minOccurs="1"
maxOccurs="unbounded"/>
</content>
⚓
|
| Schema Declaration |
element excludes { exclude+ }⚓
|
| <filter> (Allows specification of a custom label for a filter on the search page) Filters are identified through plain-text labels defined as XHTML meta/@name attributes in the document headers. This element allows a user to specify a richer label for a particular filter by using HTML code. Multiple labels may be specified in different languages. The filterName attribute must be supplied, and must be identical to the XHTML meta/@name attribute for the filter as it appears in the HTML document heads. | |||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||
| Module | ss — Schema specification and tag documentation | ||||||
| Attributes |
|
||||||
| Contained by |
ss: filters
|
||||||
| May contain |
ssHTML: span
|
||||||
| Content model |
<content>
<elementRef key="span" minOccurs="1"
maxOccurs="unbounded"/>
</content>
⚓
|
||||||
| Schema Declaration |
element filter { attribute filterName { text }, span+ }⚓
|
||||||
| <filters> (Wrapper for filter elements in the configuration file) The optional filters/filter part of the configuration file allows the user to specify custom labels with HTML markup for specific filters, so they are not limited by the label supplied in the XHTML meta/@name attribute. | |
| Namespace | http://hcmc.uvic.ca/ns/staticSearch |
| Module | ss — Schema specification and tag documentation |
| Contained by |
ss: config
|
| May contain |
ss: filter
|
| Content model |
<content>
<elementRef key="filter" minOccurs="1"
maxOccurs="unbounded"/>
</content>
⚓
|
| Schema Declaration |
element filters { filter+ }⚓
|
| <index> (Configures options relating to indexing.) | |||||||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||||||
| Module | ss — Schema specification and tag documentation | ||||||||||
| Attributes |
|
||||||||||
| Contained by |
ss: params
|
||||||||||
| May contain | Empty element | ||||||||||
| <output> (Sets the folder into which the index data and JavaScript will be placed.) | |
| Namespace | http://hcmc.uvic.ca/ns/staticSearch |
| Module | ss — Schema specification and tag documentation |
| Attributes | |
| Contained by |
ss: params
|
| May contain | Empty element |
| Note |
When the staticSearch build process creates its output, many files need to be added to the website for which an index is being created. For convenience, all of these files are stored in a single folder. This element is used to specify the name of that folder. The default is staticSearch, but if you would prefer something else, you can specify it here. You may also use this element if you are defining two different searches within the same site, so that their files are kept in different locations. |
| Content model |
<content>
<empty/>
</content>
⚓
|
| Schema Declaration |
element output { attribute dir { text }, empty }⚓
|
| <params> (Element containing most of the settings which enable the Generator to find the target website content and process it appropriately.) | |
| Namespace | http://hcmc.uvic.ca/ns/staticSearch |
| Module | ss — Schema specification and tag documentation |
| Contained by |
ss: config
|
| May contain | |
| Content model |
<content>
<elementRef key="searchPage"/>
<elementRef key="index" minOccurs="0"/>
<elementRef key="stopwords" minOccurs="0"/>
<elementRef key="dictionary" minOccurs="0"/>
<elementRef key="tokenizer" minOccurs="0"/>
<elementRef key="scoringAlgorithm"
minOccurs="0"/>
<elementRef key="stemmer" minOccurs="0"/>
<elementRef key="createContexts"
minOccurs="0"/>
<elementRef key="results" minOccurs="0"/>
<elementRef key="version" minOccurs="0"/>
<elementRef key="output"/>
</content>
⚓
|
| Schema Declaration |
element params { }⚓
|
| <results> (Controls the configuration of the results page.) | |||||||||||||||||||||||||||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||||||||||||||||||||||||||
| Module | ss — Schema specification and tag documentation | ||||||||||||||||||||||||||||||
| Attributes |
|
||||||||||||||||||||||||||||||
| Contained by |
ss: params
|
||||||||||||||||||||||||||||||
| May contain | Empty element | ||||||||||||||||||||||||||||||
| <rule> (A rule that specifies a document path as XPath in the match attribute, and provides weighting for search terms found in that context.) | |||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||
| Module | ss — Schema specification and tag documentation | ||||||
| Attributes |
|
||||||
| Contained by |
ss: rules
|
||||||
| May contain | Empty element | ||||||
| Note |
The rule element is used to identify nodes in the XHTML document collection which should be treated in a special manner when indexed; either they might be ignored (if weight=0), or any words found in them might be given greater weight than words in normal contexts weight>1. Words appearing in headings or titles, for example, might be weighted more heavily, while navigation menus, banners, or footers might be ignored completely. |
||||||
| Content model |
<content>
<empty/>
</content>
⚓
|
||||||
| Schema Declaration |
element rule { att.match.attributes, attribute weight { text }, empty }⚓
|
||||||
| <rules> (The set of rules that control weighting of search terms found in specific contexts.) | |
| Namespace | http://hcmc.uvic.ca/ns/staticSearch |
| Module | ss — Schema specification and tag documentation |
| Contained by |
ss: config
|
| May contain |
ss: rule
|
| Content model |
<content>
<elementRef key="rule" minOccurs="1"
maxOccurs="unbounded"/>
</content>
⚓
|
| Schema Declaration |
element rules { rule+ }⚓
|
| <scoringAlgorithm> (The scoring algorithm to use for ranking keyword results.) | |||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||
| Module | ss — Schema specification and tag documentation | ||||||
| Attributes |
|
||||||
| Contained by |
ss: params
|
||||||
| May contain | Empty element | ||||||
| Note |
<scoringAlgorithm> is an optional element that specifies which scoring algorithm to use when calculating the score of a term and thus the order in which the results from a search are sorted. |
||||||
| Content model |
<content>
<empty/>
</content>
⚓
|
||||||
| Schema Declaration |
element scoringAlgorithm
{
attribute name { "raw" | "tf-idf" | "BM25" | "BM25L" }?,
empty
}⚓
|
||||||
| <searchPage> (The search page that will be the primary access point for staticSearch. This page may or may not exist, but its location is used for determining the collection that will be indexed, so it must be at the root of the collection directory.) | |||||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||||
| Module | ss — Schema specification and tag documentation | ||||||||
| Attributes |
|
||||||||
| Contained by |
ss: params
|
||||||||
| May contain | Empty element | ||||||||
| Note |
The search page is a regular HTML page which forms part of your site. The only important characteristic it must have is a <div> element with id=staticSearch, whose contents will be rewritten by the staticSearch build process. See 8.6 Creating a search page. |
||||||||
| Content model |
<content>
<empty/>
</content>
⚓
|
||||||||
| Schema Declaration |
element searchPage { attribute file { text }, empty }⚓
|
||||||||
| <span> (An HTML span) A standard HTML span element, within which other HTML content can be included. We do not prescribe or validate the content model; see e.g. https://developer.mozilla.org/en-US/docs/Web/HTML/Element/span for information. | |||||||
| Namespace | http://www.w3.org/1999/xhtml | ||||||
| Module | ssHTML — Schema specification and tag documentation | ||||||
| Attributes |
|
||||||
| Contained by |
ss: filter
|
||||||
| May contain | ANY | ||||||
| Content model |
<content>
<alternate minOccurs="1"
maxOccurs="unbounded">
<anyElement require="http://www.w3.org/1999/xhtml"
minOccurs="0" maxOccurs="unbounded"/>
<textNode/>
</alternate>
</content>
⚓
|
||||||
| Schema Declaration |
element span { attribute lang { text }?, ( anyElement_span_1* | text )+ }⚓
|
||||||
| <stemmer> (The name of a folder inside the staticSearch /stemmers/ folder, in which the JavaScript and XSLT implementations of stemmers can be found. If not specified, then the staticSearch default English stemmer (en) will be used.) | |||||||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||||||
| Module | ss — Schema specification and tag documentation | ||||||||||
| Attributes |
|
||||||||||
| Contained by |
ss: params
|
||||||||||
| May contain | Empty element | ||||||||||
| Note |
The staticSearch project currently has only two real stemmers: an implementation of the Porter 2 algorithm for modern English, and an implementation of the French Snowball stemmer. These appear in the /stemmers/en/ and /stemmers/fr/ folders. The default value for this parameter is en; to use the French stemmer, use fr. We will be adding more stemmers as the project develops. However, if your document collection is not English or French, you have a couple of options, one hard and one easy.
Another alternative is the stripDiacritics stemmer. Like the identity stemmer, this is not really a stemmer at all; what it does is to strip out all combining diacritics from tokens. This is a useful approach if you document collection contains texts with accents and diacritics, but your users may be unfamiliar with the use of diacritics and will want to search just with plain unaccented characters. For example, if a text contains the word élève, but you would like searchers to be able to find the word simply by typing the ascii string eleve, then this is a good option. Combined with wildcards, it can provide a very flexible and user-friendly search engine in the absence of a sophisticated stemmer, or for cases where there are mixed languages so a single stemmer will not do. To use this option, specify the value stripDiacritics in your configuration file. |
||||||||||
| Content model |
<content>
<empty/>
</content>
⚓
|
||||||||||
| Schema Declaration |
element stemmer
{
attribute dir
{
"stemmers/en/"
| "stemmers/fr/"
| "stemmers/identity"
| "stemmers/stripDiacritics"
},
empty
}⚓
|
||||||||||
| <stopwords> (Specifies a list of stopwords--that is, words to be ignored when indexing.) | |||||||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||||||
| Module | ss — Schema specification and tag documentation | ||||||||||
| Attributes |
|
||||||||||
| Contained by |
ss: params
|
||||||||||
| May contain | Empty element | ||||||||||
| Note |
A stopword is a word that will not be indexed, because it is too common (the, a, you and so on). There are common stopwords files for most languages available on the Web, but it is probably a good idea to take one of these and customize it for your project, since there will be words in the website which are so common that it makes no sense to index them, but they are not ordinary stopwords. For example, in a website dedicated to the work of John Keats, the name keats should probably be added to the stopwords file, since almost every page will include it, and searching for it will be pointless. staticSearch provides a default set of common stopwords for English, which you'll find in xsl/english_stopwords.txt. One way to find appropriate stopwords for your site is to generate your index, then search for the largest JSON index files that are generated, to see if they might be too common to be useful as search terms. You can also use the Word Frequency table in the generated staticSearch report (see 8.9 Generated report). |
||||||||||
| Content model |
<content>
<empty/>
</content>
⚓
|
||||||||||
| Schema Declaration |
element stopwords { attribute file { text }, empty }⚓
|
||||||||||
| <tokenizer> (Configures options for the tokenizing process.) | |||||||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||||||
| Module | ss — Schema specification and tag documentation | ||||||||||
| Attributes |
|
||||||||||
| Contained by |
ss: params
|
||||||||||
| May contain | Empty element | ||||||||||
| <version> (Specifies the unique version to append to the index, so that the browser will not use cached versions of older index files.) | |||||||||
| Namespace | http://hcmc.uvic.ca/ns/staticSearch | ||||||||
| Module | ss — Schema specification and tag documentation | ||||||||
| Attributes |
|
||||||||
| Contained by |
ss: params
|
||||||||
| May contain | Empty element | ||||||||
| Note |
<version> enables you to specify the path to a plain-text file containing a simple version number for the project. This might take the form of a software-release-style version number such as 1.5, or it might be a Subversion revision number or a Git commit hash. It should not contain any spaces or punctuation. If you provide a version file, the version string will be used as part of the filenames for all the JSON resources created for the search. This is useful because it allows the browser to cache such resources when users repeatedly visit the search page, but if the project is rebuilt with a new version, those cached files will not be used because the new version will have different filenames. The path specified is relative to the location of the configuration file (or absolute, if you wish). |
||||||||
| Content model |
<content>
<empty/>
</content>
⚓
|
||||||||
| Schema Declaration |
element version { attribute file { text }, empty }⚓
|
||||||||
| att.labelled (A class providing a label attribute that can be used to identify/describe contexts.) | |||||||||
| Module | ss — Schema specification and tag documentation | ||||||||
| Members | context | ||||||||
| Attributes |
|
||||||||
| att.match (A class providing attributes that enable specification of document locations.) | |||||||
| Module | ss — Schema specification and tag documentation | ||||||
| Members | context exclude rule | ||||||
| Attributes |
|
||||||
| ssdata.boolean (Custom boolean datatype, which restricts boolean values to true or false for ease of processing.) | |
| Module | ss — Schema specification and tag documentation |
| Used by |
Element:
|
| Content model |
<content>
<valList>
<valItem ident="true"/>
<valItem ident="false"/>
</valList>
</content>
⚓
|
| Declaration |
ssdata.boolean = "true" | "false"⚓ |