This documentation provides instructions on how to use the Project Endings staticSearch Generator to provide a fully-functional search ‘engine’ to your website without any dependency on server-side code such as a database.
First, you will have to make sure your site pages are correctly configured so that the Generator can parse them. Then, you will have to create a configuration file specifying what options you want to use. Then you run the generator, and the search functionality should be added to your site.
The generator is expecting to parse well-formed XHTML5 web pages. That means web pages which are well-formed XML, using the XHTML namespace. If your site is just raggedy tag-soup, then you can't use this tool. You can tidy up your HTML using HTML Tidy.
Next, you will need to decide whether you want search filters or not. If you want to allow your users to search (for example) only in poems, or only in articles, or only in blog posts, or any combination of these document types, you will need to add <meta> tags to the heads of your documents to specify what these filters are. staticSearch supports four filter types.
Note that feature filters have one drawback: they slow down the initialization of the search page noticeably. In order to provide the typeahead/drop-down functionality, the JSON for each feature filter will need to be retrieved as the page loads, so if you have many feature filters, or particularly large ones, the user may have to wait a few seconds before the page becomes fully functional.
The configuration file is an XML document which tells the Generator where to find your site, and what search features you would like to include. The configuration file conforms to a schema which is documented here.
There are three main sections of the configuration file:
Only the <params> element is necessary, but, as we discuss shortly, we highly suggest taking advantage of the <rules> (see 8.5.4 Specifying rules (optional)) and <contexts> (8.5.5 Specifying contexts (optional)) for the best results.
For examples of full configuration files, see the staticSearch GitHub repository as well as the list of projects in 13 Projects using staticSearch, which provides a link to each site’s configuration file.
| version | (specifies the major version of staticSearch to which this configuration file corresponds. If this attribute is not used, the configuration file is assumed to have a version value of 1.) |
The <params> element has only one required element, which is used for determining the resource collection that you wish to index:
| file | (The path (relative to the config file) to the search page.) |
The search page is a regular HTML page which forms part of your site. The only important characteristic it must have is a <div> element with id=staticSearch, whose contents will be rewritten by the staticSearch build process. See 8.6 Creating a search page.
The <searchPage> element's file attribute specifies a relative URI (resolved, like all URIs specified in the config file, against the configuration file location) that points directly to the search page that will be the primary access point for the search. Since the search file must be at the root of the directory that you wish to index (i.e. the directory that contains all of the XHTML you want the search to index), the searchPage parameter provides the necessary information for knowing what document collection to index and where to put the output JSON. In other words, in specifying the location of your search page, you are also specifying the location of your document collection. See Creating a search page for more information on how to configure this file.
Note that all output files will be in a directory that is a sibling to the search page. For instance, in a document collection that looks something like:
The collection of Javascript and JSON files will be in a directory like so:
The following parameters are optional, but most projects will want to specify some of them:
| recurse | (Determines whether or not to recurse into the subdirectories of the collection and index those files.) |
| file | (The path (relative to the config file) to a text file containing a list of words to be ignored by the indexer (one word per line).) |
A stopword is a word that will not be indexed, because it is too common (the, a, you and so on). There are common stopwords files for most languages available on the Web, but it is probably a good idea to take one of these and customize it for your project, since there will be words in the website which are so common that it makes no sense to index them, but they are not ordinary stopwords. For example, in a website dedicated to the work of John Keats, the name keats should probably be added to the stopwords file, since almost every page will include it, and searching for it will be pointless. staticSearch provides a default set of common stopwords for English, which you'll find in xsl/english_stopwords.txt. One way to find appropriate stopwords for your site is to generate your index, then search for the largest JSON index files that are generated, to see if they might be too common to be useful as search terms. You can also use the Word Frequency table in the generated staticSearch report (see 8.9 Generated report).
| file | (The relative path (from the config file) to a dictionary file (one word per line).) |
The indexing process checks each word as it builds the index, and keeps a record of all words which are not found in the configured dictionary. Though this does not have any direct effect in the indexing process, all words not found in the dictionary are listed in the staticSearch report (see 8.9 Generated report). This can be very useful: all words listed are either foreign (not part of the language of the dictionary) or perhaps misspelled (in which case they may not be correctly stemmed and index, and should be corrected). staticSearch provides a default dictionary in xsl/english_words.txt that can be copied and adapted if working in English; lots of dictionaries for other languages are available on the Web.
| minWordLength | (Specifies the minimum length in characters of a sequence of text that will be considered to be a word worth indexing.) |
| name | (Specifies the name of the scoring algorithm to use.) |
<scoringAlgorithm> is an optional element that specifies which scoring algorithm to use when calculating the score of a term and thus the order in which the results from a search are sorted.
| dir | (The path (relative to the config file) of the directory to use for stemming.) |
The staticSearch project currently has only two real stemmers: an implementation of the Porter 2 algorithm for modern English, and an implementation of the French Snowball stemmer. These appear in the /stemmers/en/ and /stemmers/fr/ folders. The default value for this parameter is en; to use the French stemmer, use fr. We will be adding more stemmers as the project develops. However, if your document collection is not English or French, you have a couple of options, one hard and one easy.
Another alternative is the stripDiacritics stemmer. Like the identity stemmer, this is not really a stemmer at all; what it does is to strip out all combining diacritics from tokens. This is a useful approach if you document collection contains texts with accents and diacritics, but your users may be unfamiliar with the use of diacritics and will want to search just with plain unaccented characters. For example, if a text contains the word élève, but you would like searchers to be able to find the word simply by typing the ascii string eleve, then this is a good option. Combined with wildcards, it can provide a very flexible and user-friendly search engine in the absence of a sophisticated stemmer, or for cases where there are mixed languages so a single stemmer will not do. To use this option, specify the value stripDiacritics in your configuration file.
| create | Specifies whether the indexer stores keyword-in-context extracts for each hit in a document. |
| create | Specifies whether the indexer stores keyword-in-context extracts for each hit in a document. |
| phrasalSearch | (Whether or not to support phrasal searches. If this is true, then the maxContexts setting will be ignored, because all contexts are required to properly support phrasal search.) |
| wildcardSearch | (Whether or not to support wildcard searches.) |
| maxKwicsToHarvest | (Controls the number of keyword-in-context extracts that will be harvested from the data for each term in a document.) |
| maxKwicLength | (Sets the maximum length (in words) of a keyword-in-context result.) |
| maxKwicsToHarvest | (Controls the number of keyword-in-context extracts that will be harvested from the data for each term in a document.) |
| kwicTruncateString | (The string that will be used to signal ellipsis at the beginning and end of a keyword-in-context extract. Conventionally three periods, or an ellipsis character (which is the default value).) |
Note that contexts are necessary for phrasal searching or wildcard searching.
| resultsPerPage | (The maximum number of document results to be displayed per page. All results are displayed by default; setting resultsPerPage to a positive integer creates a Show More/Show All widget at the bottom of the batch of results.) |
| maxKwicsToShow | (Controls the maximum number of keyword-in-context extracts that will be shown in the search page for each hit document returned.) |
| file | (The path (relative to the config file) to a text file containing a single version identifier (such as 1.5, 123456, or 06ad419).) |
<version> enables you to specify the path to a plain-text file containing a simple version number for the project. This might take the form of a software-release-style version number such as 1.5, or it might be a Subversion revision number or a Git commit hash. It should not contain any spaces or punctuation. If you provide a version file, the version string will be used as part of the filenames for all the JSON resources created for the search. This is useful because it allows the browser to cache such resources when users repeatedly visit the search page, but if the project is rebuilt with a new version, those cached files will not be used because the new version will have different filenames. The path specified is relative to the location of the configuration file (or absolute, if you wish).
| dir | (A pointer to a local directory.) |
When the staticSearch build process creates its output, many files need to be added to the website for which an index is being created. For convenience, all of these files are stored in a single folder. This element is used to specify the name of that folder. The default is staticSearch, but if you would prefer something else, you can specify it here. You may also use this element if you are defining two different searches within the same site, so that their files are kept in different locations.
| match [att.match] | (An XPath equivalent to the @match attribute of an xsl:template, which specifies a context in a document.) |
| weight | (The weighting to give to a search token found in the context specified by the match attribute. Set to 0 to completely suppress indexing for a specific context, or greater than 1 to give stronger weighting.) |
The rule element is used to identify nodes in the XHTML document collection which should be treated in a special manner when indexed; either they might be ignored (if weight=0), or any words found in them might be given greater weight than words in normal contexts weight>1. Words appearing in headings or titles, for example, might be weighted more heavily, while navigation menus, banners, or footers might be ignored completely.
The value of the match attribute is transformed in a XSLT template match attribute, and thus must follow
the same rules (i.e. no complex rules like p/ancestor::div). See the W3C XSLT Specification for further details on allowable pattern rules.
Note that the indexer does not tokenize any content in the <head> of the document (but as noted in 8.1 Configuring your site: search filters, metadata can be configured into filters) and that all elements in the <body> of a document are considered tokenizable. However, common elements that you might want to exclude include:
| match [att.match] | (An XPath equivalent to the @match attribute of an xsl:template, which specifies a context in a document.) |
When the indexer is extracting keyword-in-context strings for each word, it uses a common-sense approach based on common element definitions, so that for example when it reaches the end of a paragraph, it will not continue into the next paragraph to get more context words. You may have special runs of text in your document collection which do not appear to be bounding contexts, but actually are; for example, you may have span elements with class=note that appear in the middle of sentences but are not actually part of them. Use <context> elements to identify these special contexts so that the indexer knows the right boundaries from which to retrieve its keyword-in-context strings.
"...the size of the index.Search filtering using any metadata you like,..."⚓
"...nothing to say here,Some information on this subject can be found...⚓To tell the tokenizer that the <span> constitutes the context block for any of its tokens, use the <context> element with an match pattern:
The default context elements are:
Pages may contain different kinds of blocks, or ‘contexts’, that need to be differentiated. For example, consider a page for an online journal article, which includes the article’s title, an abstract, the body of the article, and footnotes. Users may want to search for terms only within abstracts or they may want to search only within the body of the article, ignoring editorial and paratextual material.
| label [att.labelled] | (A string identifier specifying the name for a given context.) |
When the indexer is extracting keyword-in-context strings for each word, it uses a common-sense approach based on common element definitions, so that for example when it reaches the end of a paragraph, it will not continue into the next paragraph to get more context words. You may have special runs of text in your document collection which do not appear to be bounding contexts, but actually are; for example, you may have span elements with class=note that appear in the middle of sentences but are not actually part of them. Use <context> elements to identify these special contexts so that the indexer knows the right boundaries from which to retrieve its keyword-in-context strings.
The text in the label serves as the text for a filter option on the generated search page, which allows users to perform a search within only a particular component of the page. For instance, for a page structured like the journal article mentioned above, we could specify the abstract, the notes, and the document’s body like so:| type | |
| match [att.match] | (An XPath equivalent to the @match attribute of an xsl:template, which specifies a context in a document.) |
<exclude> can be used to identify documents or parts of documents that are to be omitted from indexing, but, unlike setting <weight> to zero, should be retained during the indexing process. This is helpful in cases where the text itself should be ignored by the indexer, but should still appear in KWICs. Another common use is for multiple search engines/pages that each have their own special features; in this case, you may want one specific search index/page to ignore filter controls (HTML <meta> elements, as described in 5 Search facet features) which are provided to support other search pages.
A complex site may have two or more search pages targetting specific types of document or content, each of which may need its own particular search controls and indexes. This can easily be achieved by specifying a different <searchPage> and <output> in the configuration file for each search.
However, it's also likely that you will want to exclude certain features or documents from a specialized search page, and this is done using the <excludes> section and its child <exclude> elements.
Using exclusions, you can create multiple specialized search pages which have customized form controls within the same document collection. This is at the expense of additional disk space and build time, of course; each of these searches needs to be built separately.
Note that once your file has been processed and all this content has been added, you can process it again at any time; there is no need to start every time with a clean, empty version of the search page.
You can take a look at the test/search.html page for an example of how to configure the search page (although note that since this page has already been processed, it has the CSS and the search controls embedded in it; it also has some additional JavaScript which we use for testing the search build results, which is not necessary for your site).
Once you have configured your HTML and your configuration file, you're ready to create a search index and a search page for your site. This requires that you run ant in the root folder of the staticSearch project that you have downloaded or cloned.
Note: you will need Java and Apache Ant installed, as well as ant-contrib.
Before running the search on your own site, you can test that your system is able to do the build by doing the (very quick) build of the test materials. If you simply run the ant command, like this:
mholmes@linuxbox:~/Documents/staticSearch$ ant⚓
you should see a build process proceed using the small test collection of documents, and at the end, a results page should open up giving you a report on what was done. If this fails, then you'll need to troubleshoot the problem based on any error messages you see. (Do you have Java, Ant and ant-contrib installed and working on your system?).
If the test succeeds, you can view the results by uploading the test folder and all its contents to a web server, or by running a local webserver on your machine in that folder, using the Python HTTP server or PHP's built-in web server.
If the tests all work, then you're ready to build a search for your own site. Now you need to run the same command, but this time, tell the build process where to find your custom configuration file:2
ant -DssConfigFile=/home/mholmes/mysite/config_staticSearch.xml⚓
The same process should run, and if it's successful, you should have a modified search.html page as well as a lot of index files in JSON format in your site HTML folder. Now you can test your own search in the same ways suggested above.
ant target with a nested <property> value that points to your staticSearch config file (either relative to staticSearch
using ssConfig or an absolute path using ssConfigFile). Assuming that the build file, your config file, and your staticSearch directory
are all at the root of the project, you could call the staticSearch build in ant like
so:
-lib parameter (since the project's version of Saxon may conflict, for instance, with
the version used by staticSearch). If your build requires the use of the -lib parameter, then an alternative approach for calling staticSearch from your build
is to use the exec task like so:
After indexing your HTML files, the staticSearch build then generates an HTML report of helpful statistics and diagnostics about your document collection, which can be found in the directory specified by <output>. We recommend looking at this file regularly, especially if you're encountering unexpected behaviour by the staticSearch engine, as it contains information that can often help diagnose issues with configured filters or the HTML document collection that, if fixed, can improve staticSearch results.
By default, the report includes only basic information about the number of stem files created, the the filters used, and any problems encountered. However, if you run the build process using the additional parameter ssVerboseReport:
ant -DssVerboseReport=true -DssConfigFile=...⚓
then the report will also include a number of tables that outline some statistics about your project. However, please note that compiling these statistics is very memory-intensive and if your site is large, it may cause the build process to run out of memory.
As of version 1.4, the word frequency table is a separate document and is no longer included as part of the verbose report. Instead, after running a build, you can then build just the word frequency table with the special concordance target:
ant -DssConfigFile=path/to/your/config.xml concordance⚓
While the chart itself is not necessary for the core functionality of staticSearch, it is particularly useful during the initial development of a project’s search engine; it can be used to create and fine-tune the project-specific stopword list (i.e. if a word appears multiple times in every document, then it probably ought to be a stopword) or, in some cases, to locate potential typos in the document collection. Alongside the HTML chart, the concordance target also provides a (large) JSON file that represents that statistical data used to generate the HTML view. It contains a JSON object that lists each stem, its variants; each variant lists each document and the number of times it appears within that document.
The concordance target comes with the same warning as above: compiling these statistics is memory-intensive and may cause the build process to run out of memory.
The search index will automatically link each keyword-in-context extract with the closest element that has an id, so that each keyword-in-context string in a search result will point directly to a specific page fragment. You can also add your own attributes to any element in your document in order to have those attributes appear on the keyword-in-context (KWIC) result string (which is in the form of an HTML <li> element).
You can add as many custom attributes as you like (although bear in mind that they increase the size of the index JSON files slightly and may add to the build time).
When you use a conventional search engine which is based on a backend database and configured for a specific dynamic website, it is not unusual to find that when you follow a link to a search hit on a target page, the hit will be highlighted on that page, because the backend database is able to pre-process it. With staticSearch, that's not straightforward, because there is no ‘engine’ that is preprocessing your HTML pages. However, if this is important to you, there is a workaround that you can use. Since staticSearch does not modify your web pages, though, the implementation has to be done by you.
By default each keyword-in-context result that shows on the search page will have its own specific link to the fragment of the document which contains the hit. We strongly recommend that you ensure your target documents have id attributes for any significant divisions so that result links are as precise as possible, making the search results much more useful.
Those links are also provided with a search string, like this: https://example.com/egPage.html?ssMark=elephant#animals This link points to the section of the document which has id=animals, but it also says ‘the hit text is the word elephant.’ Some JavaScript that runs on the target page, egPage.html (which you control) will be able to parse the value of the query parameter ssMark in order to find the hit text, and highlight it in some way.
Obviously you can implement this any way you like (or just ignore it), but we also supply a small demonstration JavaScript library which implements this functionality, called ssHighlight.js. This JS file is included into the staticSearch output folder (see <output>) by default, and if you include it into the header of your own pages, it will probably do the highlighting without further intervention. If, however, you have lots of existing JavaScript that runs when the page loads, there may be some interference between this library and your own code, so you may have to make some adjustments to the code.