mnoGoSearch supports almost all known 8 bit character sets as well as the most widely used multi-byte character sets including Korean EUC-KR, Chinese Big5 and GB2312, Japanese Shift-JIS, EUC-JP and ISO-2022-JP, as well as UTF-8. Some multi-byte character sets are not supported by default, because the conversion tables for them are large which makes the size of executable programs larger. See configure parameters to enable support for extra character sets.
mnoGoSearch also supports the following Macintosh character sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic, MacGujarati.
Table 9-1. Supported character sets
Languages | Character sets |
Western Europe: Albanian, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, Swedish | ASCII 8, CP437, CP850, CP860, CP1252, ISO 8859-1, ISO 8859-15, MacRoman, MacIceland |
Eastern Europe: Croatian, Czech, Hungarian, Polish, Romanian, Slovak, Slovene | CP852, CP1250, ISO 8859-2, MacCentralEurope, MacRomania, MacCroatian |
Baltic: Latvian, Lithuanian, Estonian | CP1257, ISO-8859-4, ISO-8859-13 |
Cyrillic: Bulgarian, Belorussian, Macedonian, Russian, Serbian, Ukrainian | CP855, CP866, CP1251, ISO 8859-5, Koi8-r, Koi8-u, MacCyrillic |
Arabic | CP864, CP1256, ISO 8859-6, MacArabic |
Greek | CP869, CP1253, ISO 8859-7, MacGreek |
Hebrew | CP1255, ISO 8859-8, MacHebrew |
Turkish | CP857, CP1254, ISO 8859-9, MacTurkish |
Japanese | Shift-JIS, EUC-JP, ISO-2022-JP |
Simplified Chinese | GB2312 |
Traditional Chinese | Big5 |
Korean | EUC-KR |
Thai | CP874, TIS 620, MacThai |
Vietnamese | CP1258 |
Indian | MacGujarati, TSCII |
Georgian | geostd8 |
Unicode: over 650 languages | UTF-8 |
mnoGoSearch allows to index documents in different languages into the same database. Disk space, required to store search data, depends on the choice of the character set that mnoGoSearch uses to store data. The character set is specified using the LocalCharset command.
indexer converts all documents to the character set specified in the LocalCharset command in indexer.conf . Internally conversion is implemented using Unicode.
mnoGoSearch performs character conversion in loss-less manner. Usually, conversion between different character sets can loose some data. For example, conversion of a text file from Greek cp1253 to Russian cp1251 will loose all Greek characters. To avoid data loss, mnoGoSearch stores all characters which cannot be simply covered to LocalCharset using &#nnn; notation, where nnn is the decimal code point of a character, according to Unicode.
To avoid excessive use of disk space which can be caused by a huge amount of the &#nnn; sequences (each requires from 5 to 7 bytes) it's important to choose a good value for LocalCharset. If your document collection consists of documents in many scripts, like Greek and Russian and German, UTF-8 is usually the best choice for LocalCharset.
You can specify the BrowserCharset command to choose the character set which will be used to display search results. If BrowserCharset and LocalCharset have different values, mnoGoSearch will apply character set conversion. Similar to indexing time, if some characters cannot be converted to BrowserCharset, they will be displayed using &nnn; notation.
Every character set is recognized by a number of its aliases. Different web servers can return the same charset using different notations. For example, ISO-8859-2, ISO8859-2, latin2 are the names same of the same character set. mnoGoSearch understands the following character set name aliases:
Table 9-2. Character set aliases
ISO-2022-JP: | ISO-2022-JP |
ISO-8859-1: | CP819, CSISOLATIN, IBM819, ISO-8859-1, ISO-IR-100, ISO_8859-1, ISO_8859-1:1987, L1, LATIN1 |
ISO-8859-10: | CSISOLATIN6, ISO-8859-10, ISO-IR-157, ISO_8859-10, ISO_8859-10:1992, L6, LATIN6 |
ISO-8859-11: | ISO-8859-11, TIS-620, TIS620, TACTIS |
ISO-8869-13: | ISO-8859-13, ISO-IR-179, ISO_8859-13, L7, LATIN7 |
ISO-8859-14: | ISO-8859-14, ISO-IR-199, ISO_8859-14, ISO_8859-14:1998, L8, LATIN8 |
ISO-8859-15: | ISO-8859-15, ISO-IR-203, ISO_8859-15, ISO_8859-15:1998 |
ISO-8859-16: | ISO-8859-16, ISO-IR-226, ISO_8859-16, ISO_8859-16:2000 |
ISO-8859-2: | CSISOLATIN2, ISO-8859-2, ISO-IR-101, ISO_8859-2, ISO_8859-2:1987, L2, LATIN2 |
ISO-8859-3: | CSISOLATIN3, ISO-8859-3, ISO-IR-109, ISO_8859-3, ISO_8859-3:1988, L3, LATIN3 |
ISO-8859-4: | CSISOLATIN4, ISO-8859-4, ISO-IR-110, ISO_8859-4, ISO_8859-4:1988, L4, LATIN4 |
ISO-8859-5: | CSISOLATINCYRILLIC, CYRILLIC, ISO-8859-5, ISO-IR-144, ISO_8859-5, ISO_8859-5:1988 |
ISO-8859-6: | ARABIC, ASMO-708, CSISOLATINARABIC, ECMA-114, ISO-8859-6, ISO-IR-127, ISO_8859-6, ISO_8859-6:1987 |
ISO-8859-7: | CSISOLATINGREEK, ECMA-118, ELOT_928, GREEK, GREEK8, ISO-8859-7, ISO-IR-126, ISO_8859-7, ISO_8859-7:1987 |
ISO-8859-8: | CSISOLATINHEBREW, HEBREW, ISO-8859-8, ISO-IR-138, ISO_8859-8, ISO_8859-8:1988 |
ISO-8859-9: | CSISOLATIN5, ISO-8859-9, ISO-IR-148, ISO_8859-9, ISO_8859-9:1989, L5, LATIN5 |
armscii-8: | ARMSCII-8, ARMSCII8 |
big5: | BIG-5, BIG-FIVE, BIG5, BIGFIVE, CN-BIG5, CSBIG5 |
cp1250: | CP1250, MS-EE, WINDOWS-1250 |
cp1251: | CP1251, MS-CYRL, WINDOWS-1251 |
cp1252: | CP1252, MS-ANSI, WINDOWS-1252 |
cp1253: | CP1253, MS-GREEK, WINDOWS-1253 |
cp1254: | CP1254, MS-TURK, WINDOWS-1254 |
cp1255: | CP1255, MS-HEBR, WINDOWS-1255 |
cp1256: | CP1256, MS-ARAB, WINDOWS-1256 |
cp1257: | CP1257, WINBALTRIM, WINDOWS-1257 |
cp1258: | CP1258, WINDOWS-1258 |
cp437: | 437, CP437, IBM437 |
cp850: | 850, CP850, CSPC850MULTILINGUAL, IBM850 |
cp852: | 852, CP852, IBM852 |
cp855: | 855, CP855, IBM855 |
cp857: | 857, CP857, IBM857 |
cp860: | 860, CP860, IBM860 |
cp861: | 861, CP861, IBM861 |
cp862: | 862, CP862, IBM862 |
cp863: | 863, CP863, IBM863 |
cp864: | 864, CP864, IBM864 |
cp865: | 865, CP865, IBM865 |
cp866: | 866, CP866, CSIBM866, IBM866 |
cp869: | 869, CP869, IBM869, CP874, WINDOWS-874 |
EUC-JP: | CSEUCJP, EUC-JP, EUCJP, UJIS, X-EUC-JP |
EUC-KR: | CSEUCKR, EUC-KR, EUCKR |
GB2312: | CHINESE, CSGB2312, CSISO58GB231280, GB2312, GB_2312-80, ISO-IR-58 |
koi8-r: | CSKOI8R, KOI8-R, KOI8R |
KOI8-u | KOI8-U, KOI8U |
shift-JIS: | CSSHIFTJIS, MS_KANJI, S-JIS, SHIFT-JIS, SHIFT_JIS, SJIS |
cp367: | ANSI_X3.4-1968, ASCII, CP367, CSASCII, IBM367, ISO-IR-6, ISO646-US, ISO_646.IRV:1991, US, US-ASCII |
UTF8: | UTF-8, UTF8 |
viscii: | CSVISCII, VISCII, VISCII1.1-1 |
MacCyrillic: | MACCYRILLIC, X-MAC-CYRILLIC |
MacRoman: | MACROMAN, MACINTOSH, CSMACINTOSH, MAC |
MacCentralEurope: | MACCENTRALEUROPE, MACCE |
indexer detects document character set in this order:
Content-type: text/html; charset=xxx - HTTP response readers.
<META NAME="Content-Type" CONTENT="text/html; charset=xxx"> (for HTML documents) or
<?xml version="1.0" encoding="xxx"?> (for XML documents)
Note: Processing of the meta tags can be switched off by adding GuesserUseMeta no into indexer.conf.
The default value, according to the command RemoteCharset of the corresponding Server or Realm command.
Starting with the version 3.2.0, mnoGoSearch has an automatic character set and language guesser. It currently recognizes more than 100 various character set and language combinations. Charset and language detection is implemented using the "N-Gram-Based Text Categorization" technique. There is a number of so called language map files, one for every language-charset pair. They are installed under /usr/local/mnogosearch/etc/langmap/ directory by default. Have a look into this directory to check the list of the currently provided character set-language pairs.
Note: Character set and language guesser works fine for the texts longer than 500 characters. Shorter texts may not be guessed so well.
To build your own language map use the mguesser utility. In addition, you'll need a set of text files with the sample texts (the models) for the desired language and character set. To create a new language map, run the following command:
mguesser -p -c charset -l language < FILENAME > language.charset.lm
You can also use mguesser to guess language and character set for a document using the existing language maps. Try the following command:
mguesser [-n maxhits] < FILENAME
You may want to create map files for different character sets for the same language. To convert a model file between character sets supported by mnoGoSearch, use the mconv utility, which is part of mnoGoSearch distribution.
mconv [OPTIONS] -f charset_from -t charset_to [configfile] < infile > outfile
By default, both mguesser and mconv utilities are installed into the /usr/local/mnogosearch/sbin/ directory.
Starting from the version 3.2.14, mnoGoSearch can update the existing language and character set maps automatically during indexing, if the remote server supplies pages with correctly specified language and character set. To enable this function, specify command
LangMapUpdate yesin your indexer.conf.
Use the RemoteCharset indexer.conf command to choose the default character set of the sites you index.
You can also set the default language for the sites you index with help of the DefaultLang indexer.conf command.
Note: You can restricts search results to a specific language by using the g query string variable. Have a look into the Section called Search parameters in Chapter 10 for details.