Appendix A. mnoGoSearch change history

Changes in 3.3

Changes in 3.3.14 (April 02, 2013)

  • DOCX and RTF built-in parsers were added.

  • It's now possible to use the $(ConfDir), $(ShareDir), $(VarDir), $(TmpDir) template variables in search.htm, e.g.:

    
Include $(ConfDir)/common.inc
    DBAddr sqlite3:///$(VarDir)/mnogosearch.sqlite3/
    
    Previously these variables were understood only in indexer.conf.

  • A minor fix in installation layout was made: the --docdir parameter to configure is now respected, and the HTML documentation is now installed to PREFIX/share/doc/mnogosearch/ by default. Previously --docdir was ignored, and the documentation was installed to PREFIX/doc/.

  • Files to build rpm and deb binary packages were added.

  • A few minor problems discovered by the code static analysis tools were fixed.

  • Unassigned euc-jp characters were converted to U+0000 instead of the question mark when converting to other character sets.

  • Context snippets did not work well if the CachedCopy section name was written the in lower case in indexer.conf.

  • Fixed that static linking against MySQL-5.5 client library failed because of the missing -ldl linker flag.

  • Fixed a crash in search.cgi when compiled in extra debug mode with --enable-trace on a 64-bit machine.

  • Fixed that indexer failed with the error "Integer does not fit into column" on 64-bit machines when running with the OpenLink Virtuoso backend.

Changes in 3.3.13 (March 03, 2013)

  • Bug#4818 "Arbitrary Files Reading in mnoGoSearch" was fixed. This is a security bug. All users of the earlier 3.3.x releases are highly advised to upgrade.

  • Bug#4819 "Variables Overwriting in mnoGoSearch" was fixed. search.cgi was vulnerable to Cross-Site Scripting in cases when values of some empty pre-defined internal variables were replaced in the HTTP query string (e.g. search.cgi?q=test&stored=%3Cscript%3E). Now all variables coming from the query string are automatically HTML-escaped in the $(var) template format.

    The meaning of $&(var) has not changed, it still applies HTML escaping to all variables, both coming from the query string and those generated internally.

  • Support for "Content-Type: message/rfc822" was added (*.eml and *.mht files), including multi-part messages and messages with attachments, with Content-Transfer-Encoding of types 7bit, 8bit, base64 and quoted-printable. When processing attachments, indexer can use external parsers. For example, if indexer is configured to use catdoc for the documents of the type application/msword, then indexer also executes catdoc for the attachments of this type.

  • Bug#4803 "buffer overflow detected with search.cgi" was fixed.

  • Fixed that search.cgi did not use HTML entities (<, > and &) to escape special characters when displaying cached copy for a document of type text/plain.

  • Fixed that the "dm" search parameter did not work in some cases.

  • Fixed that the "su" search parameter (user defined order) was not taken into account by search query cache, thus wrong cache hits were returned in some cases.

  • Bugs in synonym processing were fixed: a word form generated from synonyms could be searched twice in the database; bad memory access when using "ComplexSynonyms yes".

  • A memory leak bug was fixed in the code producing word forms from an Ispell dictionary.

  • Improved compatibility with the latest versions of PostgreSQL. Escaping of the SQL character literals for PostgreSQL >= 90000 was changed from the C-alike stype (using backslash) to the standard SQL style.

  • Data type for the column "url.url" for Firebird was changed from varchar(127) to varchar(247), which is the longest indexable varchar in a Firebird database with page_size=1024 (bug#2125).

  • Fixed that mnoGoSearch did not work with Mecab (Japanese segmenter) dictionaries encoded in utf-8 encoded.

  • Fixed that indexer silently ignored the -u values (URL limits) longer than 64 characters (bug#4800, bug#4689).

  • Fixed a bug in the code handling excerpts. It did not work well in cases when context following a highlighted word have no space characters and ExcerptPadding ends in the middle of the next highlighted word. The entire excerpt was erroneously highlighted in such cases.

  • Fixed that the PHP extension module did not compile with PHP-5.4 (bug#4808).

  • Fixed a crash in indexer happened on processing a message/http response in combination with an external parser returning an empty response.

  • Bug#4806 "Command Proxy without argument" was fixed.

  • Bug#4359 "buffer overflow when doing using -Ewordstat option" was fixed.

  • A few compilation warnings on 64bit platforms were fixed.

  • Bug#4722 "Error messages display directly on web page" was fixed.

  • Bug#4718 "sqlite3 driver: (1) cannot start a transaction within a transaction" was fixed.

  • Fixed that in case of LiveUpdates=yes search.cgi erroneously printed the error "word index not found" when the query produced no results.

  • A few dead links in the Section called External parsers for the most common file types in Chapter 5 were fixed.

  • Bug#4814 PATCH to add libpwd parser configuration for indexing WordPerfect docs.

  • Bug#4817 PATCH to add libpws (*.wps) parser configuration for MS Works docs.

  • autoconf warnings were fixed. MySQL client library detection was improved for OS X Lion.

Changes in 3.3.12 (December 15, 2011)

  • An SQL injection that happened because of weak control of valid characters in host names in hypertext links was fixed. The injection was possible with the databases supporting multiple statements in a single SQL query: with MySQL (when ClientMultiStatement=yes option is enabled in DBAddr) as well as with PostgreSQL.

  • A new search query syntax for range search was added. For example,

    [jan TO john]
    will find documents having words in the range between jan and john. The range search operators can be used in combination with the other operators (e.g. phrase search, restricting search word to a section, etc).

  • A new search.htm command UseRangeOperators was added to activate range search operators (which are disabled by default).

  • A new option decimal was added to the Section command. Words of the sections marked with this option are treated as decimal numbers. This, for example, allows numeric range search for the given section. The query:

    title:t-shirt price:[10.1 TO 200]
    will find documents having t-shirt in title and with price in the range 10.1 to 200. See the Section command description how to mark a section as decimal.

  • --help command line indexer option was added as a synonym for -h.

  • A description how to use pdftohtml converter was added into indexer.conf-dist and into the manual.

  • Fixed that indexer allowed malformed URLs containing non-ASCII characters in host names, which led to SQL errors on attempt to insert a malformed URL into the database, for example: PQexec: ERROR: invalid byte sequence for encoding "UTF8": 0xbf.

  • A bug in udm-config was fixed. Due to this bug, linking of the mnoGoSearch PHP module failed with the error cannot find -lmnogosearch.

  • Fixed that the Firebird (Interbase) API did not work with SQL_LONG data type correctly on x86_64 platforms.

  • UDM_MAXTIMESTRSIZE constant was changed from 35 to 64, as strftime() can return a result longer than 35 characters on some operating systems (e.g. AIX). Too small constant value led to a wrong or a zero value in the $(Last-Modified) template variable on AIX.

  • Fixed that Microsoft SQL Server driver did not work with database names consisting only of digit characters.

  • Compilation problems when building using --without-pthreads where fixed.

  • A compilation problem happened on AIX5/AIX6 because of wrong thread compiler flags was fixed.

  • A compilation problem with Sybase client library on 64-bit Linux platforms was fixed.

  • "Bug#4704 Indexing various binaries as XML" was fixed.

  • "Bug#8299 Wrong score when UserScore gets 0 and UserScoreFactor is set" was fixed.

Changes in 3.3.11 (January 27, 2011)

  • Bug#4346 "QCache does not differentiate on Sections" was fixed. ${sl.*} was added into default QueryCacheID format.

  • The <!--variables--> search template section is now executed earlier, so tmplt, Locale and StdoutBufferSize can be set dynamically.

  • Bug#4256 "Install include headers fails due to duplicated entry of udm_http.h" was fixed.

  • GNU-style long options are now understood. For example, indexer --rewritelimits is a synonym for the old command indexer -Erewritelimits. The new long options are intended to replace the old -Exxx options. See indexer -? for the list of the new options.

  • A few performance improvements in handling information in the server and srvinfo tables were made.

    • The tables are now updated in transactions (or under locking, e.g. in MySQL).

    • The default values for match_type, case_sense, nomatch, follow, Method are not written to the server and srvinfo tables any more.

    • The tables are not updated if no changes in indexer.conf have been made since last indexer startup.

    These improvements make indexer start about 3-12 times faster (depending on database software) with an indexer.conf file having 4096 Server commands.

  • SQL scripts to create tables for MySQL now have the ENGINE=MyISAM option, to address the default storage engine change in MySQL version 5.5.

  • Fixed a Valgrind warning when a template variable didn't end with right parenthesis properly, e.g. $(name.

  • Non-standard RSS tags inside the <item> tag can now be parsed when defined using a Section command. Cluster XML search results can also transfer the non-default user section values to the front-end point.

  • Fixed that the default PHP frontend (php/index.php) did not highlight non-Latin searched words when displaying cached copies.

  • The --enable-fhs-layout option to configure is now available, to build and install with layout which suites File Hierarchy Layout standard better. When --enable-fhs-layout, mnoGoSearch installs:

    • indexer into /bin rather than /sbin (as indexer is not really limited to be run by the superuser only).

    • language map files into /share/langmap/*.lm rather than /etc/langmap/*.lm.

    • stopword files into /share/stopwords/*.sl rather than /etc/stopwords/*.sl.

    • synonym files into /share/synonym/*.syn rather than /etc/synonym/*.syn.

    • frequency files (for Asian word segmenters) into /share/freq/*.freq rather than /etc/*.freq.

    • SQL scripts into /share/create/dbname/*.sql rather than /share/dbname/*.sql.

    For backward compatibility, the traditional mnoGoSearch layout is created when no --enable-fhs-layout is given.

  • Correct path to MySQL client library is now detected by configure on 64-bit Linux platforms.

  • Oracle's 11g client include and library layout is now detected by configure on 64-bit Linux platforms.

  • An error message is now displayed when indexer can not find a create or drop SQL script. Earlier indexer exited silently.

  • Fixed that a few files from the msearch-test directory were not included into the distribution, so make check did not work outside mnoGoSearch CVS tree.

Changes in 3.3.10 (November 23, 2010)

  • Content-Length is now stored into the database for FTP protocol when CheckOnly access method is used. Previous it was set to 0.

  • mconv conversion utility improvements were made. Input buffer size was changed to 1Mb to avoid breaking apart multibyte characters when processing a file with long lines. -x command line option was added to display characters that can not be converted to the target character set using hexadecimal SGML entities (e.g. &#x123;).

  • zh-hans.utf8.lm and zh-hant.utf8.lm language maps were added to detect Simplified and Traditional Chinese in UTF-8.

  • Displaying Cached copies now works with mnoGoSearch PHP extension module.

  • Date formats "1997-07-16T19:20:30+01:00" and "1997-07-16T19:20:30-01:00" are now understood in protocol headers and when parsing XML files.

  • The Sitemap Protocol is now supported to fetch the list of URLs available for crawling on a website. A new command UseSitemap was added to specify whether to use Sitemap Protocol (yes by default).

  • Fixed that binding of integer parameters in SQL driver didn't work well on 64-bit platforms which might cause indexer -Eblob failures.

  • Fixed that ctlib (Sybase and Microsoft SQL Server client library) driver returned CS_ROW_FAIL error on float numbers with high precision.

  • QCache=yes and Suggest now work with Microsoft SQL Server.

  • Running multiple indexer crawling processes is now possible with Microsoft SQL Server. indexer uses the (TABLOCKX) table hint when fetching targets from the database to avoid crawling of the same documents by multiple indexer instances. Previously running multiple crawling processes was possible only with MySQL, PostgreSQL and Oracle.

  • indexer -Eblob now runs in non-locking mode (i.e. without search down time) when working with Microsoft SQL Server. Previously non-locking re-indexing was possible only with MySQL, PostgreSQL and Oracle.

  • Bug#4220 "one-character HTML titles are not indexed" was fixed.

  • Bug#501 "HoldBadHrefs don't work" was fixed.

  • indexer now respects the -D command line option when dumping data. For example, indexer -Edumpdata -D2 will dump data from the second DBAddr command in indexer.conf.

  • New grouping mode GroupBySite rank was added.

  • MonetDB and OpenLink Virtuoso databases are now supported.

  • Bug#3963 "SQL injection possible with tag and URL parameter" was fixed. Protection against SQL injection in other search parameters was improved.

  • New parameter MultiInsert=yes was added to the DBAddr command, to enable inserting of multiple records in a single INSERT SQL statement when running "indexer -Eblob" with a MySQL database.

  • Verbose output was improved to give more information about time spent on various indexing steps (when running indexer -Eblob) and search steps (when running search.cgi), to find performance bottlenecks easier.

  • Minor indexer -Eblob performance improvement was made: more records are now inserted per single prepared statement call.

  • An XSS problem was fixed: typing this URL in Internet Explorer address bar popped up an alert message window:

    
http://localhost/cgi-bin/search.cgi?q=who>"><script>alert(123)</script><"=Search!
    

  • It's now possible to specify an alternative name for the bdict table using the bdict parameter to DBAddr:

    
DBADdr mysql://root@localhost/test/?bdict=bdict_name
    
    This can be used to build multiple search indexes in the same database, for example, using different subsection filters.

  • Bug#3792 "URL limit for htdb causes inefficient SQL query" was fixed.

  • Bug#3806 "wrong usage of memset function" was fixed.

  • The UserScore command can now be written in indexer.conf, so its result is cached in the database during indexer -Eblob time to be used by search.cgi at search time. This improves performance in case of a complex SQL query given in UserScore.

  • Tika MSWord-to-text converter configuration instructions were added into the manual.

Changes in 3.3.9 (29 October 2009)

  • DBAddr now unserstands the ClientMultiStatements=yes parameter when connecting to MySQL, which makes it possible to use stored procedure calls in UserScore and UserSiteScore commands.

  • A bug was fixed: a missing <link>...<link/> tag made indexer crash when parsing a broken RSS file.

  • Fixed that indexer -Eblob wrote unsorted URL data into table bdicti, which made search.cgi return "no documents found" and other kinds of unexpected results when running with GroupBySite=yes.

  • mnoGoSearch now uses prepared statements when working with PostgreSQL (API functions PQprepare(), PQexecPrepared()).

  • When compiled with MySQL >=4.1 client library and connecting to an older MySQL server (without native PS API), mnoGoSearch now automatically switches not to use prepared statements.

  • Bug#3789 was fixed. russian.dict from Lebedev's Ispell package was incorrectly detected as mnoGoSearch hash file. More accurate mnoGoSearch ispell hash file detection was made.

  • Improvements in word distance calculation were made. If a distance between two searched words is longer than 64, then this distance is considered to be equal to 64.

  • The ${total} variable is now available in UserCacheQuery.

  • search.cgi now does not execute an empty UserCacheQuery query.

  • It's now possible to use dot (.) and dash (-) characters as separators in wf vector for easier readability. For example, wf=FFFF-9999-2221 is now the same to wf=FFFF99992221.

  • indexer now understands the --exec command line parameter to execute a single SQL query, for example:

    
indexer -Esql --exec="select url from url"
    

  • indexer -Esqlmon now understands the -D parameter to connect to a certain database in multi-database environment. For example:

    
indexer -Esqlmon -D2
    
    connects to the database specified in the second DBAddr command in indexer.conf.

  • ServerTable HOWTO was added into the manual. See the Section called ServerTable in Chapter 3.

  • Fixed that search.cgi didn't work with ps values larger than 500 because of query buffer overflow.

  • The ResultsLimit search.htm command now understands 0 as unlimited number of results displayed to the user.

  • Fixed a bug that some documents with very low score could be excluded from search results.

  • New recursive and final synonym modes were added. See the Section called Synonyms in Chapter 10.

  • indexer -Eblob now uses a temporary table bdict_tmp and renames it to bdict after search index is ready with PostgreSQL 8.2.4, Oracle, SQLite and IBM DB2. This reduces search service down time when recreating search index just to less than a second. Previously, temporary table was used only with MySQL.

  • indexer -Ecrawl was added as a synonym for indexer -Eindex. The latter is now deprecated and will eventually removed.

  • Fixed that mnoGoSearch did not work with Sybase via ODBC. The call for SQLSetConnectOption(hDbc, SQL_AUTOCOMMIT, SQL_AUTOCOMMIT_ON) expected only SQL_SUCCESS to be returned, while SQL_SUCCESS_WITH_INFO is also possible with Sybase.

  • A minor crawling performance improvement was made: indexer could execute empty SQL transactions in some cases. Now it does not.

  • A minor crawling performance improvement was made: indexer does not send DELETE FROM dict WHERE url_id=xxx, DELETE FROM urlinfo WHERE url_id=xxx SQL queries when crawling a document for the very first time.

  • A minor crawling performance improvement was made: indexer now does not send DELETE FROM links WHERE... SQL queries when CollectLinks is set to no.

  • The special purpose section User.Date now understands Unix Timestamp format. For example:

    
<meta name="Date" content="1104537600">
    

  • A description of the offs search parameter was added into the manual.

  • Fixed that the offs search parameter did not work in cluster mode.

  • A new search.htm command IDFFactor was added to diminish the weight of words that occur very frequently in the document collection (such as the) and increase the weight of words that occur rarely.

  • WordDensityFactor now works smother.

  • A new query language syntax was added to set importance for individual query words, for example: importance200:star importance200:wars importance10:movie.

  • phrase-to-word and phrase-phrase synonym types are now supported. A new search.htm command ComplexSynonyms was added.

  • If indexer -qq is given, then indexer switches to even faster start-up comparing to indexer -q. Additionally, indexer -qq does not synchronize the Server and Realm commands found in indexer.conf with the table "server" content. It can be useful for those having complex indexer.conf file with many Server / Realm commands.

  • The SyslogFacility command now understands none as a possible value, which means suppress logging to syslog.

  • Improvements in word distance calculation were made. Now search additionally detects if all query words appear in a frame with length 2*number_of_query_words words. For example, documents with text fragments:

    
.... w1 . w2 . w3  ....
    
    .... w1 . . w2 w3  ....   
    
    .... w1 w2 . . w3  ....
    
    are now ranked better.

  • indexer -Eblob now does not put information about popularity rank into search index if all pop_rank values are empty (i.e. when indexer -R has never been run). This slightly improves performance.

  • Tools to dump and restore search databases were added. An article how to add a cluster node with help of dump/restore tools was added into the manual.

  • indexer now can limit documents by seed: indexer --seed=10, where seed is a value in the range 0..255. Seed range is also understood: indexer --seed=10-20. You can use seed limit, for example, to distribute crawler work through the week: run indexer --seed=0-36 on Sundays, indexer --seed=37-75 on Mondays, ..., indexer --seed=220-255 on Saturdays.

  • SQL monitor tool (indexer -Esql) now reports line numbers when displaying errors.

  • A description of MS Word 2007 *.docx external parser was added into the manual.

Changes in 3.3.8 (13 February 2009)

  • The UserOrder command was added to indexer.conf to create helper data for fast ordering by a user defined section.

  • New implementation of search results cache was added. Use the QCache=yes parameter to DBAddr to activate the new search result cache. Along with fast retrival of the results for the cached queries, the new search result cache also supports the search in found feature. Also, the new search result cache does not need manual cleaning after recreating of the search index.

  • A new "-D number" command line option is now understood by indexer. This options makes indexer connect only to the given database (when running a multi-database environment). For example, when started with -D2, indexer will crawl only those targets stored in the database corresponding to the second DBAddr command of indexer.conf.

  • URLSelectSkipLock - a new indexer.conf command was added. By default, indexer sends a LOCK TABLE url SQL query when fetching a set of new crawler targets from a MySQL database to avoid multiple indexer instances crawling into the same URLs. As fetching targets can take up to a few seconds in a huge database, it can make simultaneous search queries stall and wait for indexer to finish fetching targets. URLSelectSkipLock helps to avoid this kind of delays in search queries when you're running only a single instance of indexer at the same time.

  • Prepared statements support for MySQL and PostgreSQL client-server protocols was added, which slightly improves crawling and indexing performance. Use the ps=yes parameter to DBAddr to activate using of prepared statements.

  • indexer -Eblob now doesn't put information about the URLs not matching the command line filters (e.g. on status or URL), for performance purposes.

  • A new command CrawlerThreads is now understood in indexer.conf to specify the number of crawler threads to start by default when -N command line option is not given to indexer.

  • Fixed that search.cgi and indexer crashed in some cases when using UserScore on a 64-bit platform.

  • Fixed that "indexer -a -f urllist.txt" incorrectly used "prefix match" instead of "exact match" when marking documents with the URLs listed in the given file as expired.

  • A new AddEncoding command is now understood in indexer.conf. AddEncoding associates file names and extensions with Mime content encoding types. For example:

    AddEncoding gzip *.html.gz

  • A new CrawlDelay indexer.conf command was added. CrawlDelay sets the number of seconds to wait between subsequent requests to the same server.

  • search.cgi now understands -d command line argument to pass a template file name to load.

  • $base64(varname) template format was added to print content of a variable using base64 encoding.

  • Better score results for documents containing full long search phrases.

  • A new command UseLocalCachedCopy was added into search.htm. Use this command to generate excerpts and "Cached Copy" documents from the original copy of a document when indexing local file system. This command helps to avoid storing of cached copies in the database and thus makes the database smaller.

  • A new command LoadURLBasicInfo was added into search.htm. Use this command to improve performance in case when you only need to output document IDs and score values in search results and don't need the other basic information about the documents such as URL, last modification date, document size (and other columns from the table "URL").

  • A new command UserSiteScore is now understood in search.htm. It works similarly to UserScore, but sets the desired score factor for the entire host name instead of a single document.

  • New phrase segmenter types cjk and cjk-phrase were added to make mnoGoSearch suit East-Asian languages better.

  • search.cgi automatically switches to chj-phrase segmenter mode when m=phrase query string parameter is given and segmenter mode is cjk.

  • A new template variable $(SEGMENTER.QUERY_STRING) was added. This variable contains the query string after being processed by segmenter and can be used for testing and debugging the search template.

  • Fixed that mnoGoSearch PHP extension module could crash when using Ispell files for stemming.

  • It's now possible to use Spell and Affix commands in PHP extension module's function udm_set_agent_param_ex().

  • A new DBAddr parameter step=number was added. It makes indexer load not more than "number" documents at the same time when creating fast search index. Use this command if you have a huge database which doesn't fit into RAM, or you want to limit memory consumed by indexer -Eblob. This parameter can also be useful when using mnoGoSearch with a MySQL server with a small max_allowed_packet value.

  • Fixed that indexer and search.cgi crashed in some cases due to wrong use of OCILobGetLength() in the code.

  • The DBAddr parameter setnames=charsetname is now understood when connecting to MySQL and PostgreSQL via ODBC.

  • Bug#3773 "Phrase excerpts do not work in some cases" was fixed.

  • Bug#3791 "indexer exits even though content encoding is known" was fixed.

  • A new UserCacheQuery indexer.conf command was added to store search results in the database. This can help to use mnoGoSearch as an external full-text solution together with your database driven application.

  • Improvements in ExcerptPadding were made to generate nicer excerpts from East-Asian texts.

  • A new DBAddr parameter Compress=yes was added to activate compression in MySQL's client-server protocol by using mysql_options(&mydb->mysql, MYSQL_OPT_COMPRESS, 0) API's call. This option improves indexing and search performance when connecting to a MySQL server on a remote host.

  • Fixed that search.cgi tried to load a sorting section even when "su" search parameter was empty, which caused performance problems.

  • configure --with-docs now automatically searches for the DocBook's catalog in a number of common locations. Previously one had to fix the file doc/catalog manually to build the HTML manual.

  • Log level for "Unsupported Content-Type" and "Unsupported Content-Encoding" messages was changed from ERROR to WARNING to avoid excessive flood of syslog logs when running indexer with -l1 command line option.

  • A description how to use rtfx (a nice RTF-to-XML converter) was added into the manual and indexer.conf-dist.

  • The HIGH_PRIORITY keyword was added into more SQL SELECT queries sent to MySQL during search time to give a higher priority to search.cgi over the simultaneosly runnig indexer instances.

  • indexer now adds the FIRST N clause into the SQL SELECT query when fetching crawler targets from a Firebird database. This slightly improves crawling performance, as well as offloads the Firebird server.

  • The command PagesPerScreen was added into the manual.

Changes in 3.3.7 (11 April 2008)

  • New synonym file command "Mode: return" was added. The words written on the same line in a synonym file are expanded only to the leftmost words in this mode.

  • Synonym file command "Mode: roundtrip" was added as a synonym to "Mode: reverse" to avoid ambiguity. The old version (e.g. reverse) will be removed eventually.

  • search.cgi now can work as an inetd or xinetd service. See the Section called Running search.cgi from inetd / xinetd in Chapter 2 for details.

  • -s flag now understands status range, e.g. "indexer -s200-299" will crawl documents having status in the range 200..299.

  • C-API description was added into the manual. See Reference II, mnoGoSearch C API function reference for details.

  • A possibility to debug score values was added. See the Section called Analyzing score values in Chapter 10 for details.

  • ppthtml PPT-to-HTML parser configuration instructions were added into the manual.

  • Performance improvement: when processing wild-card patterns like *.txt or *.htm (e.g. file extensions in the AddType, Allow, Disallow commands etc), comparison code now automatically switches from "wild-card comparison" to "string ending comparison" for this type of patterns.

  • Performance improvements were made in creating search index ("indexer -Eblob"), which is now about 30% faster with Firebird, 80% faster with SQLite3, 60% faster with Mimer, 30% faster with Sybase ASE.

  • Minor performance improvements were made in various pieces of the sources.

  • Search now returns at most 1000 results by default, to avoid flood attacks.

  • Fixed that user-defined sections didn't respect

    <META NAME="Robots" CONTENT="NOINDEX">
    tags.

Changes in 3.3.6 (27 November 2007)

  • The default word storage mode was changed to DBMode=blob.

  • DBMode=blob now works with SQLite3.

  • Fixed that the "flags" commands in Ispell affix files were expected to start immediately after the "new line" character. Some affix files available on the Internet have leading spaces and tabs before these commands. Previously mnoGoSearch didn't read these files correctly.

  • Bug#2023 "--disable-mysql-fulltext-plugin doesn't work" was fixed.

  • Fixed that "GroupBySite=yes" didn't work with DBMode=multi correctly.

  • Search and indexing performance improvements were made.

  • search.cgi now uses less memory in DBMode=blob, especially for huge results.

Changes in 3.3.5 (17 October 2007)

  • Fixed an XSS (cross-site scripting) security problem in the default template search.htm-dist. Passing special values of the "t" query string variable to search.cgi resulted in bad code injection near the OPTION tags of the <SELECT NAME="t"> option list in extended search form.

    This problem happened only with <SELECT NAME="t"> which is inside a HTML comment in the default template. Other SELECT lists were not affected, if you didn't put them into a HTML comment.

    To prevent this problem, search.cgi was modified to understand variable references with "HTML-encoded" output format:

    
<OPTION VALUE="val" SELECTED="$&(var)">
    
    Previously only non-encoded variable references worked in OPTION tags:
    
<OPTION VALUE="val" SELECTED="$(var)">
    
    The default template search.htm-dist was modified to use HTML-encoded output format in variable references in all OPTION tags.

    After upgrade to this release, modify the existing templates by replacing all <OPTION VALUE="val" SELECTED="$(var)"> to <OPTION VALUE="val" SELECTED="$&(var)">.

  • Thread concurrency for resolving host names and processing robot.txt files was significantly improved, which makes "indexer -Nnum" work much faster when indexing multiple sites.

  • The SubstringMatchMinWordLength search.htm command was added. Thanks to Matthias Pigulla for contribution.

  • The Skip indexer.conf command was added.

  • The CaseFolding command was added to allow alternative lower case mapping for some languages (e.g. Turkish).

  • In search queries with boolean operator ~ (NOT), e.g. "usa & ~chicago", boolean operator & (AND) is not required anymore. This syntax now works as well: "usa ~chicago". search.cgi automatically assumes & before ~.

  • Udm_Set_Agent_Param_Ex() function in PHP extension module now understands search.htm compatible commands:

                
       Udm_Set_Agent_Param_Ex($udm_agent, "Section body  1 1");
       Udm_Set_Agent_Param_Ex($udm_agent, "Section title 2 1");
    

  • The default value of the VaryLang was changed from "en" to empty.

  • Cluster now honors the ReadTimeOut command in search.htm to skip the nodes which currently are not available, e.g. because of network problems. Previously, search waited 30 seconds before returning results if one of the nodes was unavailable.

  • Performance improvements in phrase search were made.

  • search.cgi now doesn't try to find clones for a document if value of its "url.crc32" is 0.

  • Column type of "qcache.doclist" was changed from BLOB to LONGBLOB in MySQL structure, to allow storing of longer cached results.

  • Fixed that indexer crashed in some cases when running with many threads.

  • Fixed that <!INCLUDE> didn't work when the CONTENT parameter started with a variable reference, e.g.:

                
      <!SET NAME="x" CONTENT="http://hostname/">
      <!INCLUDE CONTENT="$(x)">
    

  • Bug#1903 "$(tag) doesn't work in cluster" was fixed.

  • Bug#1959 "Confusing message "Unable to find working zlib library" on missing libdmalloc" was fixed. Configure parameter "--enable-dmalloc" was changed to "--with-dmalloc", to be able to specify non-standard dmalloc location.

  • Bug#2022 "search.cgi crashes when searching for a single word with 'Dehyphenate yes' and DBMode=blob" was fixed.

  • Fixed a bug in HTTP content negotiation which made indexer after receiving a "Vary: accept-language" response header download the same URL several times again, even though indexer.conf didn't specify any languages to vary (i.e. when the VaryLang command was not set or was empty).

Changes in 3.3.4 (27 July 2007)

  • mnoGoSearch now works better for huge documents. Maximum number of words collected from each document was changed from "64K words per section" to "2048K words per section". Data format in DBMode=single was changed, users of DBMode=single have to reindex their documents from the beginning. Data format in DBMode=multi and DBMode=blob was not changed, reindexing in these modes is only necessary for huge documents (bigger than approximately 512K) - to make indexer collect more words from these documents. New limit allows to fully index documents with text size up to about 16Mb.

  • The LoadTagInfo search.htm command was added, to make tag values available in search results using $(tag).

  • The LoadURLInfo search.htm command was added, to switch off loading extra section values from the urlinfo table for performance purposes.

  • The StripAccents yes/no command was added into indexer.conf and search.htm to make accent insensitive searches possible with the databases not supporting accent insensitive collations. When StripAccent is set to yes, all accented letters are converted to their non-accented counterparts when writing or looking up the word index.

  • Content-Type "application/http" is now understood - a HTTP response with headers.

  • Content-Type "application/http" now work external parsers: if result type of a parser is "application/http", then indexer consider it is a full HTTP response and parses both headers and content.

  • PostgreSQL driver now understands the "setnames" DBAddr parameter to set client encoding. If a non-empty "setnames" parameter is given, PQsetClientEncoding() is executed immediately after establishing a connection to the server.

  • Fixed that highlighting didn't work in some cases when a search query contained two or more phrases.

Changes in 3.3.3 (8 May 2007)

  • Performance improvement: the "sorting results by score" step is now much faster on big results (0.01 second vs 1.00 second on results returning one million documents).

  • Performance improvement: searching for a single word is now about three times faster on big results.

  • Some indexes were added into SQL schema to make searches with tag and category limit faster (Feature request #772).

  • Feature request #1364 "highlight collation matches" was implemented. Now when using an accent insensitive collations (for example, latin1_general_ci with MySQL), search.cgi will take into account all word forms for excerpts and highlighting. For example, searches for French "cote" will also highlight "coté" and vice versa, if the non-exact word form generated hits.

  • MySQL driver now understands setnames parameter in DBAddr (feature request #1326).

  • MySQL driver now understands sqllogbin parameter in DBAddr (feature request #697).

  • DebugSQL parameter to DBAddr is now understood. When DebugSQL is set to yes, indexer and search.cgi print all SQL queries sent to the database. mnoGoSearch must be compiled using ./configure --with-debug ... to make this feature work.

  • MinCoordFactor and MaxCoordFactor impact is now calculated separately for each section.

  • "nwf" parameter is now understood in DBAddr string, to set its value per database.

  • "HoldBadHrefs 0" now means never delete unavailable documents from the database automatically (e.g. when remote host is down), which improves indexing speed, and which is now default behavior. Only positive HoldBadHrefs values activate automatic deletion.

  • Data type of urlinfo.sval was changed from TEXT to MEDIUMTEXT in MySQL table structure, to allow storing sections longer than 64K.

  • Bug#1733 "'indexer -Ewordstat' problem with PostgreSQL" was fixed.

  • Bug #1054 "indexer does not index html files without body tag" was fixed. A new special section with name "nobody" is now understood. If this section is configured, then indexer collects words outside the <body>...</body> tags. The default behavior is still not to index words outside these tags.

  • Bug#768 "User defined section is too short (1Kb limit)" was fixed.

  • Bug#1654 "SQLWordForms doesn't work with cluster" was fixed. Those using cluster should upgrade node.xml using the latest version of node.xml-dist.

  • Bug#1713 "Square brackets in DOCTYPE makes XML parser fail" was fixed.

  • Bug#1739 "indexer doesn't understand Content-Encoding for robots.txt" was fixed.

  • Bug#1740 "'UseRemoteContentType yes' doesn't work." was fixed.

  • Bug#1741 "'indexer.conf -Eblob -t tag' fails with 'Unknown table 's' in WHERE clause'" was fixed.

  • Fixed that indexer ignored the LogLevel command.

  • Fixed that popularity rank calculation didn't work with Interbase/Firebird. A missing column "url.shows" was added into SQL schema.

  • Fixed that phrase search didn't work in some cases (a bug since 3.3.0).

Changes in 3.3.2 (19 April 2007)

  • "ResultContentType none" is now understood to suppress printing of the "Content-Type" HTTP header by search.cgi. This is useful if you execute search.cgi from another Web application which sends HTTP headers itself.

  • ue search.cgi is now understood again to exclude documents with the given URL pattern from search results. This feature was broken in 3.2.x.

  • indexer now uses UDM_TMP_DIR and TMPDIR environment variables when creating temporary files (e.g for external parsers) instead of the default /tmp.

  • Fixed that standalone dash character was considered as a separate word with "Dehyphenate yes", so for the queries like "a - b", search.cgi incorrectly searched for three words: "a", "-", "b", which never returned results in "find all words" mode.

  • Fixed that "UseCookie yes" made indexer crash when fetching data from HTDB sources.

  • Fixed that excerpts generated from cached copy of TEXT files didn't work (bug since 3.3.0).

  • Bug#746 "Stopwords in a long boolean query" was fixed.

  • Bug#1016 "Indexer is selecting wrong Content-Type" was fixed.

  • Bug#1024 "Clear database limitations do not work: error ORA-01795" was fixed.

  • Bug#1044 "-Ewordstat: incorrect unicode sequence" was fixed.

  • Bug#1110 "'invalid UTF-8 byte sequence detected' when INSERT INTO dictXX" was fixed. This error happened when indexing into PostgreSQL with DBMode=multi. The "intag" column type was changed from TEXT to BYTEA in the tables "dict00".."dictFF".

  • Bug#1182 "Indexer crashes with -a -y 'content/type'" was fixed.

  • Bug#1427 "ORA-01785: maximum number of expressions in a list is 1000" was fixed.

  • Bug#1436 "Cannot run -Ewordstat, ORA-01400: cannot insert NULL" was fixed.

  • Bug#1615 "The identifier "PATH_MAX" is undefined" wad fixed.

  • Bug#1641 "Documentation problem" was fixed.

  • Bug#1659 "GroupBySite doesn't work in cluster mode" was fixed.

  • Bug#1679 "search.cgi dumps core on OpenBSD 4.0 when I search for non existing word" was fixed.

  • Bug#1693 "User defined sections don't work for text/plain files" was fixed.

  • Bug#1716 "Can't limit indexer to documents matching language" was fixed.

  • Bug#1725 "Navigation doesn't work when using a single cluster node" was fixed.

  • Bug#1726 "DateFormat doesn't work in cluster" was fixed.

Changes in 3.3.1 (18 March 2007)

  • Relevancy improvement: Fixed that average word distance was considered to be very big in the case when words were found in different sections (e.g. one word in "body" and one word in "title"). Word pairs from different sections are not taken into account anymore for distance calculation.

  • Relevancy improvement: Average word distance is now calculated taking into account "wf" values for the sections - the final score is now more sensitive to word distances in the sections with higher "wf" values.

  • DBMode=blob&LiveUpdates=yes is now understood in DBAddr parameter. If LiveUpdates=yes is specified, it's possible to crawl up to several thousand documents without full recreating of search index by running "indexer -Eblob". Thanks to Oz Basarir and Natural Capital Institute for sponsoring this feature.

  • The "text" and "html" keywords were added into the "Section" command syntax, to apply either text or HTML parser for data returned from a "simple" HTDBDoc query. This option is useful if the source SQL table stores data in HTML format. The default value is "text". Thanks to Oz Basarir and Natural Capital Institute for sponsoring this feature.

  • Column with name "last_mod_time" is now considered as modification time of the documents, returned from "simple" HTDBDoc queries.

  • A new syntax to display N rightmost characters from a template variable was added. For example, $(URL:-10). Thanks to Eggert Ehmke for the idea and the original patch.

  • Performance improvements in score calculation with non-empty "nwf" parameter were made.

  • Fixed that "simple" HTDBDoc queries didn't work with Interbase/Firebird, because the driver returned empty column names.

  • Fixed a bug which made search.cgi crash when generating a link to "cached copy" with a template having multiple DBAddr commands.

  • Fixed a bug in character set conversion, which made indexer crash in rare cases.

  • Fixed that "indexer -Cw" didn't empty the "bdict" table.

  • Fixed a bug in cluster code which made search.cgi crash on processing of a front-end template with "Suggest yes" when search didn't return any results.

Changes in 3.3.0 (06 March 2007)

  • Cluster support was added. A typical cluster consists of several database machines and a single front-end machine. The front-end machine receives HTTP requests from a user's browser, forwards search queries to the database machines using HTTP protocol, receives back a limited number of top best search results (using a simple XML format, based on OpenSearch specifications) from every database machine, then parses and merges the results, and displays them according to score and applying HTML template. This approach distributes operations with high CPU and hard disk consumption between the database machines in parallel, leaving simple merge and HTML template processing functions to the the front-end machine. As of version 3.3.0, mnoGoSearch allows to join up to 256 database machines into a single cluster.

  • node.xml-dist is now installed into /etc directory - an XML template for a cluster database machine.

  • "DBAddr http://hostname/search.cgi/node.xml" search.htm command was added, to specify an URL of a cluster database machine interface with XML format.

  • "DBAddr file:///path/to/node.xml" search.htm command was added, to specify a static XML search response. This is mostly for test purposes.

  • Two cluster types were implemented - a merge cluster to join results from several independent databases, each created by its own indexer.conf, as well as a distributed cluster - created by a single indexer.conf when indexer automatically distributes search index between database machines.

  • Changing default distribution type from "reminder" to "quotient". Thus, for indexer.conf having three DBAddr command, distribution is done as follows:

    • URLs with seed 0..85 go to the first DBAddr

    • URLs with seed 85..170 go to the second DBAddr

    • URLs with seed 171..255 go to the third DBAddr

    This distribution style simplifies manual redistribution of an existing clustered database when adding a new DBAddr (i.e. a new database machine). Future releases will provide an automatic tool for redistribution when adding and deleting machines in an existing cluster, as well as more configuration commands to control distribution.

  • Maximum amount of words collected from a document was changed from 64K words per document to 64K words per section - positions are now enumerated per section, starting from the beginning of each section separately.

  • "SaveSectionSize yes/no" indexer.conf and search.htm command was added. When SaveSectionSize is set to yes, indexer stores additional information about section sizes, making it possible to generate better score values, as well as to do "exact section match" searches. Default value is "yes".

  • Relevancy improvement: "WordDensityFactor num" search.htm command was added. Num is a number in the range 0..255 to specify impact of word frequency on the result score. This feature works with "SaveSectionSize yes". The default value is 25.

  • Exact section match syntax was added:

                
    title="Apache web server"
    
    This feature works with "SaveSectionSize yes".

  • "WordFormFactor num" search.htm command was added to give more weight to the word forms originally written in the search query and less weight to generated word forms using ispell dictionaries and synonyms. Use with a number 0..255. Default value is 255. 255 means to give the same weight to the original and generated forms. 0 means maximum effect, i.e. weight for a generated word form is much smaller than weight for the original word form.

  • Excerpt generating code performance improvements were done. Excerpt generation from CachedCopy is now about 6-12% faster.

  • Using URL and Tag limits is now possible with "indexer -Eblob", e.g.:

                
    ./indexer -Eblob -u "%subdir%"
    ./indexer -Eblob -t tag
    
    This is to generate a search index over a subset of all documents collected during crawling.

  • Using "Limit" command is also possible with "indexer -Eblob", e.g.:

    indexer.conf command:

                
    Limit subdir "SELECT rec_id FROM url WHERE url LIKE '%/subdir/%'"
    

    command line:

                
    ./indexer -Eblob --fl=subdir
    

  • "ResultContentType type" search.htm command was added to specify Content-Type header generated by search.cgi. The default value is "text/html".

  • "Dehyphenate yes/no" search.htm command was added. When "Dehyphenate yes" is specified, searching for "peace-making" also will return documents having "peacemaking". Thanks to Oz Basarir and Natural Capital Institute for sponsoring this feature.

  • Clone template variables were changed: clones are now returned in the same row with the document itself, using CloneN prefix, e.g.: $(Clone0.URL). The "<!--clone-->" search.htm section and the $(CL) variable are not supported anymore.

  • DetectClones is now "no" by default, for performance purposes.

  • "CollectLinks yes/no" indexer.conf command was added. The default value is "no" which improves indexing performance by not pupulating the "links" table. As a side effect PopRank calculation is not possible in the default configuration. If PopRank is important for your installation, specify "CollectLinks yes" in indexer.conf.

  • Default sort order was changed from "RP" (score, then popularity) to "R" (score). This change improves search performance for the installations where PopRank is not important.

  • Indexer now honors <a rel="nofollow"> tags. Thanks to Jeff Veit for contribution.

  • A simplified format of HTDBDoc command was added:

                
    HTDBDoc "SELECT title, body FROM docs WHERE id=$2"
    
    SQL column names are associated with "Section" names. Thanks to Oz Basarir and Natural Capital Institute for sponsoring this feature.

  • It's now possible to specify wf as a parameter for DBAddr search.htm command, which is useful when merging two or more databases - to give more score to results coming from a desired database.

                
    DBAddr mysql://root@localhost/db1/?wf=FFFF
    DBAddr mysql://root@localhost/db2/?wf=1111
    DBAddr mysql://root@localhost/db3/?wf=1111
    

  • MaxResults parameter was added for DBAddr, which is useful to add a limited number of sponsored links in the top of search results:

    
DBAddr mysql://root@localhost/avd/?wf=FFFF&MaxResults=1
    DBAddr mysql://root@localhost/db1/?wf=1111
    DBAddr mysql://root@localhost/db2/?wf=1111
    

  • $(DBOrder) template variable was added to display the original order of a document in its database result, before multiple DBAddr search results were merged into the final result. It is equal to $(Order) when using only a single DBAddr command in search.htm.

  • FOR template operator was added. Loop limits can be both constants:

    
  <!FOR NAME="a" FROM="10" TO="20">a=$(a)<!ENDFOR>
    
    and variables that were previously set, for example by the SET operator:
    
  <!SET NAME="from" CONTENT="80">
      <!SET NAME="to" CONTENT="90">
      <!FOR NAME="a" FROM="$(from)" TO="$(to)">a=$(a)<!ENDFOR>
    

  • "[no title]" is not added automatically anymore: an empty string is printed instead. One can use IF template operator to reproduce 3.2.x behaviour:

    
<!IF NAME="title" CONTENT="">[no title]<!ELSE>$&(title)<!ENDIF>
    

  • Various indexing and search performance improvements were made.

  • Fixed that indexer didn't work with MySQL-5.1.15-GPL.

  • "indexer -?" now prints its help page to STDOUT instead of STDERR.

  • A "#version" record is now put into the table "bdict" when running "indexer -Eblob". mnoGoSearch version ID is put as its value. For example, mnoGoSearch 3.3.0 will put "30300" string.

  • Preliminary implementation for DBMode=rawblob in search.htm was added. This mode is designed for direct search from the table "bdicti" without having to run "indexer -Eblob" and is intended for use with small search databases as a replacement for DBMode=single. In the future releases it will also be reused for real-time index updates - to avoid running "indexer -Eblob" when only a small number of documents were changed.