Section {name} {number} {maxlen} [datatype] [when] [format] [cloneflag] [separator] [{source} {pattern} {replacement}]
When used in search.htm, the Section command requires only the first three parameters and activates recognition of section name references in search queries, for example:
title:word1 body:word2See the Section called Restricting search words to a section in Chapter 10 for details. There are no any other purposes of using the Section command in search.htm. The rest of this article applies mostly to indexer.conf.
string is the section name and
number is the section ID
between
0 and 255.
Use 0 if you don't want to index the sections.
Note: It is recommended to use different sections
ID
for different documents parts, which makes possible to set different weights for the different document parts, as well as restrict search to a section at search time.
The maxlen
argument contains the maximum
length of the section which should be stored in the database.
If maxlen is set to 0,
then this section is not stored in the database and therefore is not
available at search time using $(name)
syntaxt in
search.htm.
The datatype
parameter is optional.
If the parameter is omitted, then the words of this section
are treated as usual words, i.e. they are stored and compared
lexicographically.
If the datatype
is set to decimal,
then the words of this section are treated as decimal numbers with up
to 9 integral digits and up to 9
fractional digits. The words of this section are stored as a
18-digit words in the format
IIIIIIIIIFFFFFFFFF, where
IIIIIIIII is the integral part left padded with
zeroes, and
FFFFFFFFF is the fractional part right padded
with zeros.
when
is an optional parameter defining when the
section is to be created. The following values are possible:
afterheaders - creates the section after processing of HTTP headers, which allows to replace the headers returned by an HTTP server to your own values. For example, if the HTTP server is not well configured and returns Content-Type: text/plain headers for the documents which are in fact XML or HTML documents, or Content-Type: application/octet-stream for Word or Excel documents, you can overwrite the Content-Type header and thus have indexer invoke a proper external or internal parser.
afterguesser - creates the section
after execution of
character set guesser.
A special variable ${HTTP.LocalCharsetContent}
is additionally available for use in the source
argument,
which represents raw document content converted to LocalCharset.
afterguesser is suitable for user defined sections,
to cut pieces of text from between desired tags with help of the
source
, pattern
and replacement
parameters.
afterparser - creates the section after extracting pieces of text from the document (i.e. after removing tags in the case of HTML or XML), and before breaking them into individual words. This is the default value for the when parameter.
format
is a flag telling indexer
which parser to use for the section. Two values are understood:
text - use text parser
html - use HTML parser
format
parameter is designed for
use in combination with the simple
type of HTDBDoc queries
(i.e. consisting of a list of data columns,
without full HTTP headers). The default value is text.
If your SQL table contains data in HTML format, you can specify
the html option to force removing of HTML tags.
See the Section called Indexing SQL tables
(htdb:/ virtual URL scheme)
in Chapter 6 for details about simple
HTDBDoc queries.
The cloneflag
parameter is a flag
describing whether the section should affect clone detection.
It can be DetectClone (or cdon),
or NoDetectClone (or cdoff). By default,
all url.* section values (i.e. various URL parts) are not
taken in account for clone detection, while any other
sections take part in clone detection.
separator
is a string that separates
consequent chunks of the same section.
User-defined sections
Thesource
, pattern
and replacement
parameters can be used to extract user defined sections.
source
can include variable references using
${VARNAME}
syntax. Multiple variable references allowed.
pattern
represents a regular expression to specify which parts
of source
should go to the section.
replacement
defines how the extracted parts of source
are comnibed into the result. replacement
can contain references of the form
$n
, where n
is a number in the range 0-9.
Every reference is replaced to text captured by the n
-th parenthesized sub-pattern.
$0
refers to text matched by the whole pattern. Opening parentheses are counted
from left to right (starting from 1) to obtain the number of the capturing sub-pattern.
# Use a combination of URL and raw body content to extract # the host part of URL and title into the section "udef" Section HTTP.Content 0 0 Section udef 1 256 cdoff "" "${URL}:${HTTP.Content}" "^http://([^/]*)/.*<title>(.*)</title>" "$1 $2"
Conditional sections
Thesource
, pattern
and replacement
arguments can also be used to create sections only under certain conditions:
# Create "body" only for the given host name Section HTTP.Content 0 0 Section body 1 256 cdoff "" "${URL}:${HTTP.Content}" "^http://www.mysite.com/.*<body>(.*)</body>" "$1"
Special purpose sections
There is a special User.Date section. It makes possible to use a user defined meta tag (or any other document part) as an alternative Last-Modified value. A number of widespread formats is understood:Sun, 06 Nov 1994 08:49:37 GMT Sun, 6 Nov 1994 08:49:37 GMT Sunday, 06-Nov-94 08:49:37 GMT Sun Nov 6 08:49:37 1994 1994-11-06 06.11.1994 1104537600 -- Unix timestampWhen User.Date is defined, the Last-Modified HTTP header is ignored, and the document modification time is taken from User.Date instead. This can be useful when indexing dynamic documents.
nobody is another section with a special meaning.
When parsing HTML documents, indexer ignores the words outside
the <body> and </body> tags by default.
To activate indexing of these words, you can define a special section
nobody, which should have the same ID
and
length with the section body.
Making indexer see the words outside the body tags can be useful to
index a remote site with broken HTML mark-up (when you can't modify
the pages), or to index local HTML pages having SSI
(sever side include) directives directly from disk using file:/// schema,
even if the <body> and
</body> tags are not in the HTML
pages themselves, but in shared files included using SSI directives,
like <!--#include virtual="../include/top.html"-->.
For example:
Section body 1 256 Section nobody 1 256
Section body 1 256 Section title 2 128 Section meta.keywords 3 128 Section meta.description 4 128 Section header.server 5 64 Section url.file 6 0 Section url.path 7 0 Section url.host 8 0 Section url.proto 9 0 Section crosswords 10 0 Section Charset 11 32 Section Content-Type 12 64 Section Content-Language 13 16 Section attribute.alt 14 128 Section attribute.label 15 128 Section attribute.summary 16 128 Section attribute.title 17 128 Section References 18 0 Section Message-ID 19 0 Section Parent-ID 20 0 Section MP3.Song 21 128 Section MP3.Album 22 128 Section MP3.Artist 23 128 Section MP3.Year 24 128 Section CachedCopy 25 64000 Section attribute.face 27 0 Section attribute.title 28 0 "." # A user-defined section Section h1 29 128 "<h1>(.*)</h1>" $1 # User-defined date extracted from the "Date" meta-tag Section User.Date 0 10 '<META NAME="Date" +CONTENT="([^"]*)">' "$1" # Replacing Content-Type to application/msword Section Content-Type 0 64 afterheaders cdoff "" "${URL}" "http://site/*.doc" "application/msword" # Using "afterguesser" in conjuction with ${HTTP.LocalCharsetContent} Section HTTP.LocalCharsetContent 0 0 Section h1lcs 30 128 afterguesser cdoff "" "${HTTP.LocalCharsetContent}" "<h1>(.*)</h1>" $1 # Using a simple HTDBDoc query for a SQL table with text and HTML columns Section 1 256 column1 text Section 2 256 colimn2 html