Chapter 7. mnoGoSearch word index formats

Table of Contents
Word modes with an SQL database
Cache mode storage
mnoGoSearch performance issues
Oracle notes
IBM DB2 notes

Word modes with an SQL database

Various modes used to store words

mnoGoSearch can use a number of different formats (modes) to store word information in the database, suitable for different purposes. The available modes are: single, multi and blob. The default mode is blob. The mode can be selected using the DBMode part of the DBAddr command in indexer.conf and search.htm.

Examples:


DBAddr mysql://localhost/test/?DBMode=single
DBAddr mysql://localhost/test/?DBMode=multi
DBAddr mysql://localhost/test/?DBMode=blob

Storage mode - single

The single mode is suitable for a small site with the total number of documents up to 5000.

When the single mode is specified, all words are stored in a single table dict with three columns (url_id,word,coord), where url_id is the ID of the document which is referenced by rec_id field in the table url, and coord is a combination of the section ID and position of the words in the section. Word has the variable char(32) SQL type. Every appearance of the same word in a document produces a separate record in the table.

The advantage of the single mode is live updates support - a document updated by indexer becomes immediately visible for searches with its new content. In other words crawling and indexing is done at the same time, for every document individually.

Another advantage of the single mode is its simplicity and straightforward data format. You can use mnoGoSearch as a fulltext solution for your database-driven Web application. For example, you may find useful to create a simple search page which will query the data collected by indexer this way:


SELECT
  url.url, count(*) AS rank
FROM
  dict, url
WHERE
  url.rec_id=dict.url_id
AND
  dict.word IN ('some','words')
GROUP BY
  url.url
ORDER BY
  rank DESC;
and display the results of this search query.

Note: The above query implements very simple ranking based on the count of the word hits. You can also integrate mnoGoSearch with your own application using the UserCacheQuery command, which supports full-featured ranking taking into account all factors described in the Section called Commands affecting document score in Chapter 10.

Note: When you use mnoGoSearch to index data stored in your SQL tables (see the Section called Indexing SQL tables (htdb:/ virtual URL scheme) in Chapter 6 for details), you may find useful to run queries joining the table dict with your own tables.

Storage mode - multi

The multi mode is suitable for a medium size Web space with up to about 50000 documents. It can be useful if your documents are updated very often.

If the multi mode is selected, word information is distributed into 256 separate tables dict00..dictFF using a hash function for distribution. The structure of these tables is close to the table dict used in the single mode: (url_id,secno,word,coords). The difference is that all positions of the same word (hits) in a section of a document are grouped into a single binary array coords, instead of producing multiple records. Word information for different sections is stored in separate records.

Similar to the single mode, the multi mode supports live updates. That is, crawling and indexing are done at the same time. A new document (or an updated document) becomes available for search very soon after indexer has crawled it.

When working in the multi mode, indexer performs caching of the word information in memory for better crawling performance. The word cache is flushed to the database as soon as it grows up to the value given in WordCacheSize, with 8Mb by default. You can change WordCacheSize to a bigger value for better crawling performance.

Note: The disadvantage of having a too big WordCacheSize value is that in case when indexer crashes or dies for any other reasons, all cached information gets lost.

Grouping word hits into the same record and distribution between multiple tables make the multi mode much faster both for search and indexing comparing to the single mode.

Storage mode - blob

The blob mode is the fastest mode currently available in mnoGoSearch for both purposes: indexing and searching. This mode can handle up to 1,000,000 - 2,000,000 million documents on a single machine.

DBMode=blob is know to work fine with DB2, Mimer, MS SQL, MySQL, PostgreSQL, Oracle, Sybase, Firebird/Interbase, SQLite3.

In the blob mode crawling and indexing are done separately. Crawling is done by starting indexer without any command line arguments. At crawling time indexer collects word information into the table bdicti with a structure optimized for crawling purposes, but not suitable for search purposes.

After crawling is done, an extra step is required to create the search index by launching indexer -Eblob. When creating the search index, indexer loads information from the table bdicti, groups all hits of the same word in different documents together and writes the grouped data into the table bdict with a structure optimized for search purposes. The table bdict consists of three columns (word, secno, intag), where intag is a binary array which includes information about all documents this word appears in (using 32-bit IDs of the documents), as well as positions of the word in every document (for phrase search). The table bdict has an index on the column word for fast look-up at search time. Words from different sections (e.g. title and body) are written in separate records.

Note: Separate records for different sections are needed to optimize searches with section limits, for example "find only in title".

Also, additional arrays of data are written into the table bdict:

  • #rec_id - a list of 32-bit document IDs

  • #last_mod_time - an array of 32-bit Last-Modified values (in Unix timestamp format) - for fast limiting searches by date.

  • #pop_rank - an array of popularity rank values, each in the 32-bit float format.

  • #site_id - an array of 32-bit site IDs, for GroupBySite.

  • #limit#name - a list of document IDs covered by a user defined limit with name "name". A separate #limit#xxx record is created for every user defined Limit configured in indexer.conf.

  • #ts - the timestamp indicating when indexer -Eblob was executed last time, in textual representation, using the Unix timestamp format. This value is used for invalidating old queries stored in the search result cache, as well as for searches with live updates, described in the Section called Live updates emulator with DBMode=blob.

  • #version - a string representing the version ID of indexer which created the search index. For example, indexer from mnoGoSearch 3.3.0 writes the string "30300". This record is required for easier upgrade purposes, to make a newer version of search.cgi recognize an older format.

Note, creating fast search index is also possible for the databases using DBMode=single and DBMode=multi. This is useful when you need to quickly switch to DBMode=blob when search performance with the other modes became bad - without even having to re-index your Web space. Later you can completely switch to DBMode=blob in both indexer.conf and search.htm, and run indexing from the very beginning.

The disadvantage of DBMode=blob is that it does not support live updates directly. New or updated documents, crawled by indexer are not visible for search until indexer -Eblob is run again. Creating search index takes about 6 minutes on a collection with 200000 HTML documents, with 10Gb total size (on a Intel Core Duo 2.13GHz CPU), which can be unacceptably long for some applications (for example, on a news site, or when using mnoGoSearch as an external full-text engine for SQL tables with help of HTDB).

Live updates emulator with DBMode=blob

Starting from version 3.3.1, mnoGoSearch emulates live updates by reading word information for the new or updated documents directly from the crawler table bdicti. It allows to add or update up to about 10,000 documents without having to run indexer -Eblob. To activate using live updates, please add LiveUpdates=yes parameter to the DBAddr command in search.htm.

Example:


DBAddr mysql://root@localhost/test/?DBMode=blob&LiveUpdates=yes

Extended features with DBMode=blob

Starting with the version 3.3.0, indexer -Eblob can be used in combination with URL and Tag limits, other limits described in the Section called Subsection control in Chapter 3, as well as in combination with a user defined limit described by a Limit command. The limits allow to generate a search index over a subset of the documents collected by indexer at crawling time.

Examples:


indexer -Eblob -u %/subdir/%
indexer -Eblob -t tag
indexer -Eblob --fl=limitname

Starting with the version 3.2.36 an additional command is available: indexer -Erewriteurl. When indexer is launched with this parameter it rewrites URL data for DBMode=blob. It can be useful to rewrite URL data quickly without having to rebuild the entire search index, for example if you added the Deflate=yes parameter to DBAddr, or after running indexer -n0 -R to update the Popularity Rank.

Maximum amount of words collected from a document

Starting from the version 3.3.0, mnoGoSearch enumerates words positions for every section separately and allows to store information about up to 2 million words per section.

In the versions prior to 3.3.0 it was possible to store up to 64K words from a single document.

Substring search notes

The single, multi and blob modes support substring search. An SQL query containing a LIKE predicate is executed internally in order to do substring search. Substring search is usually slower than searching for a full word, especially in case of very short substring. You can use the SubstringMatchMinWordLength command to limit the minimal word length allowed for substring search.

Note: When performing substring search in the multi mode, search.cgi has to iterate search queries through all 256 tables dict00..dictFF, which makes substring search especially slow. Using substring search is not recommended with DBMode=multi.