mnoGoSearch 3.3.14 reference manual: Full-featured search engine software | ||
---|---|---|
Prev | Chapter 4. Supported file formats and mime types | Next |
The tag parser understands these tag attribute notations:
<tagname attribute=value ... >
<tagname attribute="value" ... >
<tagname attribute='value' ... >
indexer understands the following special HTML characters:
< > & "
All HTML-4 character entities: ä ü and other.
Characters in their Unicode notation: ê
mnoGoSearch HTML parser currently understands the following special META tags. Note that HTTP-EQUIV is also understood instead of NAME in all entries.
<META NAME="Content-Type" Content="text/html; charset=xxxx"> - This tag is used to detect the document character set when it is not specified in the Content-type HTTP response header.
<META NAME="REFRESH" Content="5; URL=http://www.somewhere.com"> - The URL value is inserted into the database.
<META NAME="Robots" Content="xxx"> with the values ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW.
mnoGoSearch HTML parser collects links from these tags:
<A HREF="xxx">
<IMG SRC="xxx">
<LINK HREF="xxx">
<FRAME SRC="xxx">
<IFRAME SRC="xxx">
<AREA HREF="xxx">
<BASE HREF="xxx">
Note: mnoGoSearch ignores not well-formed BASE HREF values and uses the current URL to compose relative links in case when a badly formed base URL met.
The links having the rel="nofollow" attribute are ignored by mnoGoSearch.
Example:
<a href="http://site/" rel="nofollow">
Character data inside the <!-- .... --> tag is recognized as an HTML comment.
You can use special purpose comment tags <!--UdmComment--> ... <!--/UdmComment--> to hide the character data and the mark-up in between from indexer. It may find useful to put these tags around the things such as the site navigation menus, advertisement blocks, etc.
The <NOINDEX> ... </NOINDEX> tags are also understood as synonyms for <!--UdmComment--> and <!--/UdmComment-->.