mnoGoSearch can use
a number of different formats (modes) to store
word information in the database, suitable for
different purposes. The available modes
are: single,
multi and blob.
The default mode is blob.
The mode can be selected using the
DBMode
part of
the DBAddr command
in indexer.conf and
search.htm.
Examples:
DBAddr mysql://localhost/test/?DBMode=single DBAddr mysql://localhost/test/?DBMode=multi DBAddr mysql://localhost/test/?DBMode=blob
The single mode is suitable for a small site with the total number of documents up to 5000.
When the single mode is specified,
all words are stored in a single table dict
with three columns (url_id,word,coord),
where url_id
is the ID
of the document which is referenced by rec_id
field in the table url, and coord
is a combination of the section ID and
position of the words in the section.
Word has the variable char(32) SQL type.
Every appearance of the same word in a document produces a separate record in the table.
The advantage of the single mode is live updates support - a document updated by indexer becomes immediately visible for searches with its new content. In other words crawling and indexing is done at the same time, for every document individually.
Another advantage of the single mode is its simplicity and straightforward data format. You can use mnoGoSearch as a fulltext solution for your database-driven Web application. For example, you may find useful to create a simple search page which will query the data collected by indexer this way:
SELECT url.url, count(*) AS rank FROM dict, url WHERE url.rec_id=dict.url_id AND dict.word IN ('some','words') GROUP BY url.url ORDER BY rank DESC;and display the results of this search query.
Note: The above query implements very simple ranking based on the count of the word hits. You can also integrate mnoGoSearch with your own application using the UserCacheQuery command, which supports full-featured ranking taking into account all factors described in the Section called Commands affecting document score in Chapter 10.
Note: When you use mnoGoSearch to index data stored in your SQL tables (see the Section called Indexing SQL tables (htdb:/ virtual URL scheme) in Chapter 6 for details), you may find useful to run queries joining the table dict with your own tables.
The multi mode is suitable for a medium size Web space with up to about 50000 documents. It can be useful if your documents are updated very often.
If the multi mode is selected, word information
is distributed into 256 separate tables
dict00
..dictFF
using a hash
function for distribution. The structure of these
tables is close to the table dict
used
in the single mode: (url_id,secno,word,coords).
The difference is that all positions of the same word (hits)
in a section of a document are grouped into a single binary array
coords
, instead of producing multiple records.
Word information for different sections is stored in separate records.
Similar to the single mode, the multi mode supports live updates. That is, crawling and indexing are done at the same time. A new document (or an updated document) becomes available for search very soon after indexer has crawled it.
When working in the multi mode, indexer performs caching of the word information in memory for better crawling performance. The word cache is flushed to the database as soon as it grows up to the value given in WordCacheSize, with 8Mb by default. You can change WordCacheSize to a bigger value for better crawling performance.
Note: The disadvantage of having a too big WordCacheSize value is that in case when indexer crashes or dies for any other reasons, all cached information gets lost.
Grouping word hits into the same record and distribution between multiple tables make the multi mode much faster both for search and indexing comparing to the single mode.
The blob mode is the fastest mode currently available in mnoGoSearch for both purposes: indexing and searching. This mode can handle up to 1,000,000 - 2,000,000 million documents on a single machine.
DBMode=blob is know to work fine with DB2, Mimer, MS SQL, MySQL, PostgreSQL, Oracle, Sybase, Firebird/Interbase, SQLite3.
In the blob mode crawling and indexing are done separately.
Crawling is done by starting indexer without
any command line arguments. At crawling time indexer
collects word information into the table bdicti
with a structure optimized for crawling purposes, but not suitable for
search purposes.
After crawling is done, an extra step is required to
create the search index by launching
indexer -Eblob. When creating
the search index, indexer
loads information from the table bdicti
,
groups all hits of the same word in different documents together
and writes the grouped data into the table bdict
with a structure optimized for search purposes.
The table bdict
consists of three columns
(word
, secno
, intag
),
where intag
is
a binary array which includes information about all documents this
word appears in (using 32-bit IDs of the documents),
as well as positions of the word in every document (for phrase search).
The table bdict
has an index on the column
word
for fast look-up at search time.
Words from different sections (e.g. title
and body) are written in separate records.
Note: Separate records for different sections are needed to optimize searches with section limits, for example "find only in title".
Also, additional arrays of data are written into the table
bdict
:
#rec_id - a list of 32-bit document IDs
#last_mod_time - an array of 32-bit Last-Modified values (in Unix timestamp format) - for fast limiting searches by date.
#pop_rank - an array of popularity rank values, each in the 32-bit float format.
#site_id - an array of 32-bit site IDs, for GroupBySite.
#limit#name - a list of document IDs covered by a user defined limit with name "name". A separate #limit#xxx record is created for every user defined Limit configured in indexer.conf.
#ts - the timestamp indicating when indexer -Eblob was executed last time, in textual representation, using the Unix timestamp format. This value is used for invalidating old queries stored in the search result cache, as well as for searches with live updates, described in the Section called Live updates emulator with DBMode=blob.
#version - a string representing the version ID of indexer which created the search index. For example, indexer from mnoGoSearch 3.3.0 writes the string "30300". This record is required for easier upgrade purposes, to make a newer version of search.cgi recognize an older format.
Note, creating fast search index is also possible for the databases using DBMode=single and DBMode=multi. This is useful when you need to quickly switch to DBMode=blob when search performance with the other modes became bad - without even having to re-index your Web space. Later you can completely switch to DBMode=blob in both indexer.conf and search.htm, and run indexing from the very beginning.
The disadvantage of DBMode=blob is that it does not support live updates directly. New or updated documents, crawled by indexer are not visible for search until indexer -Eblob is run again. Creating search index takes about 6 minutes on a collection with 200000 HTML documents, with 10Gb total size (on a Intel Core Duo 2.13GHz CPU), which can be unacceptably long for some applications (for example, on a news site, or when using mnoGoSearch as an external full-text engine for SQL tables with help of HTDB).
Starting from version 3.3.1,
mnoGoSearch emulates
live updates by reading
word information for the new or updated documents
directly from the crawler table bdicti
.
It allows to add or update up to about 10,000
documents without having to run indexer -Eblob.
To activate using live updates,
please add LiveUpdates=yes
parameter
to the DBAddr command in search.htm.
Example:
DBAddr mysql://root@localhost/test/?DBMode=blob&LiveUpdates=yes
DBMode=blob
Starting with the version 3.3.0, indexer -Eblob can be used in combination with URL and Tag limits, other limits described in the Section called Subsection control in Chapter 3, as well as in combination with a user defined limit described by a Limit command. The limits allow to generate a search index over a subset of the documents collected by indexer at crawling time.
Examples:
indexer -Eblob -u %/subdir/% indexer -Eblob -t tag indexer -Eblob --fl=limitname
Starting with the version 3.2.36
an additional command is available: indexer -Erewriteurl.
When indexer is launched with this parameter
it rewrites URL data for DBMode=blob
.
It can be useful to rewrite URL data quickly
without having to rebuild the entire search index, for example
if you added the Deflate=yes
parameter to
DBAddr, or after running
indexer -n0 -R to update the
Popularity Rank.
Starting from the version 3.3.0, mnoGoSearch enumerates words positions for every section separately and allows to store information about up to 2 million words per section.
In the versions prior to 3.3.0 it was possible to store up to 64K words from a single document.
The single, multi and blob modes support substring search. An SQL query containing a LIKE predicate is executed internally in order to do substring search. Substring search is usually slower than searching for a full word, especially in case of very short substring. You can use the SubstringMatchMinWordLength command to limit the minimal word length allowed for substring search.
Note: When performing substring search in the multi mode, search.cgi has to iterate search queries through all 256 tables
dict00
..dictFF
, which makes substring search especially slow. Using substring search is not recommended withDBMode=multi
.