mnoGoSearch HTML parser

mnoGoSearch 3.3.14 reference manual: Full-featured search engine software
Prev	Chapter 4. Supported file formats and mime types	Next

Tag parser

The tag parser understands these tag attribute notations:

<tagname attribute=value ... >
<tagname attribute="value" ... >
<tagname attribute='value' ... >

HTML entities

indexer understands the following special HTML characters:

< > &   "
All HTML-4 character entities: ä ü and other.
Characters in their Unicode notation: ê

META tags

mnoGoSearch HTML parser currently understands the following special META tags. Note that HTTP-EQUIV is also understood instead of NAME in all entries.

<META NAME="Content-Type" Content="text/html; charset=xxxx"> - This tag is used to detect the document character set when it is not specified in the Content-type HTTP response header.
<META NAME="REFRESH" Content="5; URL=http://www.somewhere.com"> - The URL value is inserted into the database.
<META NAME="Robots" Content="xxx"> with the values ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW.

Links

mnoGoSearch HTML parser collects links from these tags:

<A HREF="xxx">
<IMG SRC="xxx">
<LINK HREF="xxx">
<FRAME SRC="xxx">
<IFRAME SRC="xxx">
<AREA HREF="xxx">
<BASE HREF="xxx">
Note: mnoGoSearch ignores not well-formed BASE HREF values and uses the current URL to compose relative links in case when a badly formed base URL met.

The links having the rel="nofollow" attribute are ignored by mnoGoSearch.

Example:

<a href="http://site/" rel="nofollow">

Comments

Character data inside the  tag is recognized as an HTML comment.
You can use special purpose comment tags  ...  to hide the character data and the mark-up in between from indexer. It may find useful to put these tags around the things such as the site navigation menus, advertisement blocks, etc.
The <NOINDEX> ... </NOINDEX> tags are also understood as synonyms for  and .

Prev	Home	Next
Supported file formats and mime types	Up	External parsers