Schema and guidelines for creating a staticSearch engine for your HTML5 site
Martin Holmes
Joey Takeda

This documentation provides instructions on how to use the Project Endings staticSearch Generator to provide a fully-functional search ‘engine’ to your website without any dependency on server-side code such as a database.

8 How does it work?

8.1 Building the index

The tokenizing process first processes your configuration file to create an XSLT file with all your settings embedded in it. Next, it processes your document collection using those settings. Each document is tokenized, and then a separate JSON file is created for each distinct token found; this file contains links to each of the documents which contain that token, as well as keyword-in-context strings for the actual tokens. There will most likely be thousands of these files, but most of them are quite small. These constitute the textual index.

In addition, separate JSON files are created for the list of document titles, and for your stopword list if you have specified one. A single text file is also created containing all the unique terms in the collection, used when doing wildcard searches.

Next, if you have specified search facets in your document headers, the processor will then create a separate JSON file for each of those search facets, consisting of a list of the document identifiers for all documents which match the filters; so if some of your documents are specified as ‘Illustrated’ and some not (true or false), a JSON file will be created for the ‘Illustrated’ facet, with a list of documents which are true for this facet, and a list of documents which are false.

Finally, the template file you have created for the search page on your site will be processed to add the required search controls and JavaScript to make the search work.

8.2 The search page

In order to provide fast, responsive search results, the search page must download only the information it needs for each specific search. Obviously, if it were to download the entire collection of thousands of token files, the process would take forever. So when you search for the word waiting, what happens is that the JavaScript stems that word, producing wait, then it downloads only the single file containing indexing information for that specific word, which is very rapid. (If you are using a different stemmer, of course, then the token will be stemmed to a different output. If you are using the identity stemmer, then the token will be unchanged; with the stripDiacritics pseudo-stemmer, all combining diacritics will be stripped from the search terms, as they are in the corresponding index.)

However, there is some information that is required for all or many searches. To display any results, the list of document titles must be downloaded, for example. A user may for instance use the search facets only, not searching for a particular word or phrase but just wanting a list of all the documents classified as ‘Poems’. This requires that the JSON file with information about that facet be downloaded. So there is some advantage in having the JavaScript start downloading some of the essential files (titles, stopwords and so on) as soon as the page loads, and it also starts downloading the facet files in the background.

At the same time, though, we don't want to clog up the connection downloading these files when the user may do a simple text search which doesn't depend on them, so these files are retrieved using a ‘trickle’ approach, one at a time. Then if a search is initiated, all the files required for that specific search can be downloaded as fast as possible overriding the trickle sequence for files that are needed immediately.

One exception to the trickle approach is the case of Feature filters (where a user can select search facets based on a typeahead control). Their JSON files be downloaded to the browser before the typeahead control can be functional, so if you have feature facets in your search, you will see an additional delay before the search page is responsive.

Once the user has been on the search page for any length of time, all ancillary files will have been retrieved (assuming they weren't already cached by the browser), so the only files required for any search are those for the actual text search terms; the response should therefore be even faster for later searches than for early ones.

8.3 JavaScript compilation

The search page created for your website is entirely driven by JavaScript. The JavaScript source code can be found in a number of .js files inside the repository js folder. At build time, these files (with the exception of ssHighlight.js and ssInitialize.js) are first concatenated into a single large file called ssSearch-debug.js. This file is then optimized using the Google Closure Compiler, to create a smaller file called ssSearch.js which should be faster for the browser to download and parse. Both of these output files are provided in your project <outputFolder>; ssSearch.js is linked in your search page, but if you're having problems and would like to debug with more human-friendly JavaScript, you can switch that link to point to ssSearch-debug.js.

We are still experimenting with the options and affordances of the Closure compiler, in the interests of finding the best balance between file size and performance.

8.4 The JavaScript Initializer file

When your search page is created, in addition to the main ssSearch.js file, an additional tiny JavaScript file, ssInitialize.js, is also linked into the page. The only purpose of this script is to create a variable for the StaticSearch object and to initialize the object, which causes the search functionality to be set up on the page. We do this in a separate file so that if you have any specific options or setup functions that run on your own site that might conflict with this initialization process, you can easily modify or remove this single file and manage the initialization of the StaticSearch object with your own code. In early versions of staticSearch, this code was injected directly into a <script> tag within the search page, but since Content Security Policy settings increasingly object to inline JavaScript, we have abstracted it into a separate file.

Martin Holmes and Joey Takeda. Date: 2019-2023