mnoGoSearch has a built-in parser for MP3 files. It can extract the Album, the Artist, the Song as well as the Year MP3 tags from an MP3 file. You can create a full-featured MP3 search engine using mnoGoSearch.
To activate indexing of MP3 tags, you can use the CheckMP3 and CheckMP3Only commands into indexer.conf, as well as activate processing of MP3 sections (they are disabled by default). This is an example of an indexer.conf file with MP3 related commands:
Section MP3.Song 21 128 Section MP3.Album 22 128 Section MP3.Artist 23 128 Section MP3.Year 24 128 CheckMP3 *.mp3 Hrefonly *With the above configuration, indexer will check all *.mp3 files for MP3 tags, and will collect new links from other file types without indexing.
When you use the CheckMP3 command, indexer downloads only 128 bytes from the files with the given extension(s) to detect and parse MP3 tags.
Note: indexer downloads MP3 files efficiently from FTP servers, as well as from HTTP servers supporting HTTP/1.1 protocol with the Range request header, to request partial content. Old HTTP servers not supporting the Range HTTP header may not work well together with mnoGoSearch.
If you want to restrict searches by Author, Album, Song or Year, you can use the standard mnoGoSearch ways to restrict searches described in the Section called Changing weights of the different document parts at search time in Chapter 10 and the Section called Restricting search words to a section in Chapter 10. For example, if you want to restrict search by song and author name, you use the standard mnoGoSearch way to specify sections: Song: help Author:Beatles.
With the default sections given in indexer.conf-dist, you may find useful to add this HTML form element into search.htm to restrict search area:
Search in: <SELECT NAME="wf"> <OPTION VALUE="111100000000000000000000" SELECTED="$(wf)">All MP3 sections</OPTION> <OPTION VALUE="000100000000000000000000" SELECTED="$(wf)">MP3 Song name</OPTION> <OPTION VALUE="001000000000000000000000" SELECTED="$(wf)">MP3 Album</OPTION> <OPTION VALUE="010000000000000000000000" SELECTED="$(wf)">MP3 Artist</OPTION> <OPTION VALUE="100000000000000000000000" SELECTED="$(wf)">MP3 Year</OPTION> </SELECT>
mnoGoSearch can index SQL tables with long text columns with help of so called htdb:/ virtual URL scheme.
Using the htdb:/ virtual scheme, you can build a full-text index for your SQL tables as well as index your database driven Web servers.
Note: You have to have a PRIMARY KEY or an UNIQUE INDEX on the table you want to index with HTDB.
HTDB is implemented using the following indexer.conf commands: HTDBAddr, HTDBList, HTDBLimit, HTDBDoc.
The purposes of the HTDBAddr command is to specify a database connection string. It uses the same syntax to DBAddr. If no HTDBAddr command is specified, the data will be fetched using the same connection specified in DBAddr.
The HTDBList command is used to specify an SQL query which generates a list of documents using either absolute or relative URL notation, for example:
HTDBList "SELECT CONCAT('htdb:/',id) FROM messages"or
HTDBList "SELECT id FROM messages"
Note: HTDBList allows to fetch non-htdb URLs as well. So it gives another options to use HTDB: you can store the list of "real URLs" (e.g. HTTP-style URLs) in the database and fetch them with help of HTDB.
HTDBList "SELECT url FROM mytable" Server urllist htdb:/ Realm page *
The SQL query given in HTDBList is used for all documents having the '/' sign in the end of URL. This query is an analog for a file system directory listing.
The HTDBLimit command is used to specify the maximum number of records fetched by a single SELECT query given in the HTDBList command. HTDBLimit helps to reduce memory consumption when indexing large SQL tables. For example:
HTDBLimit 512
The HTDBDoc command specifies an SQL query to get a single document from the database using its PRIMARY KEY value. The HTDBDoc query is executed for all HTDB documents not having the '/' in the end of their URL.
An SQL query given in the HTDBDoc command must return a single row result. If the HTDBDoc query returns an empty set or multiple records, the HTDB retrieval system generates a HTTP 404 Not Found response. This can happen at re-indexing time if the record was deleted from the table since last re-indexing. You can use HoldBadHrefs 0 to remove the deleted records from the mnoGoSearch tables as well.
mnoGoSearch understands three types of HTDBDoc SQL queries.
A single-column result with a fully formatted HTTP response, including standard HTTP response status line. Take a look into the Section called HTTP response codes mnoGoSearch understands in Chapter 3 to know how indexer handles various HTTP status codes. A HTDBDoc SQL query can also optionally include HTTP headers understood by indexer, such as Content-Type, Last-Modified, Content-Encoding and other headers. So you can build a very flexible indexing system by returning different HTTP status codes and headers.
Example:
HTDBDoc "SELECT CONCAT(\ 'HTTP/1.0 200 OK\\r\\n',\ 'Content-type: text/plain\\r\\n',\ '\\r\\n',\ msg) \ FROM messages WHERE id='$1'"
A multiple-column result, with the status line starting from the "HTTP/" substring in the beginning of the first column. All columns are concatenated using the Carriage-Return + New-Line (\r\n) delimiters to generate a HTTP-alike response. The first column returning an empty string is considered as a delimiter between the headers and the content part of the HTTP response, and is replaced to "\r\n\r\n". This type of queries is a simpler way of the previous type. It helps to avoid using concatenation operators and functions, and the "\r\n" header delimiters.
Example:
HTDBDoc "SELECT 'HTTP/1.0 200 OK','Content-type: text/plain','',msg \ FROM messages WHERE id='$1'"
A single- or a multiple-column result without the "HTTP/" header. This is the simplest HTDBDoc response type. The SQL column names returned by the query are associated with the Section names configured in indexer.conf.
Example:
Section body 1 256 Section title 2 256 HTDBDoc "SELECT title, body FROM messages WHERE id='$1'"
In this example, the values of the columns
title
and body
are associated with the sections
title and body
respectively.
The columns with the names status
and last_mod_time
have a special
meaning - the HTTP status code,
and the document modification time respectively.
Status
should be an integer code according
to HTTP notation,
and the modification time should be in Unix timestamp format -
the number of seconds since
January, 1, 1970.
Example:
HTDBDoc "SELECT title, body, \ CASE WHEN messages.deleted THEN 404 ELSE 200 END as status,\ timestamp as last_mod_time FROM messages WHERE id='$1'"
The above example demonstrates how to use the special columns.
The SQL query will return
status "404 Not found" for
all documents marked as deleted, which will
make indexer
remove these documents from the search database
when re-indexing the data. Also, this query
makes indexer use
the column timestamp
as the document modification time.
If a column contains data in HTML format, you can specify the html keyword in the corresponding Section command, which will make indexer apply the HTML parser to this column and therefore remove all HTML tags and comments:
Example:
Section title 1 256 Section wiki_text 2 16000 html HTDBDoc "SELECT title, wiki_text FROM messages WHERE id='$1'"
The path parts
of an URL can be passed as
parameters to the HTDBList and
HTDBDoc SQL queries.
All parts are to be used as $1
,
$2
, ... $N
, where
the number represents the N-th path part,
that is the part of URL after
the N-th slash sign:
htdb:/part1/part2/part3/part4/part5 $1 $2 $3 $4 $5
For example, you have this indexer.conf command:
HTDBList "SELECT id FROM catalog WHERE category='$1'"
When mnoGoSearch prepares to fetch
a document with the URL htdb:/cars/,
$1
will be replaced to "cars":
SELECT id FROM catalog WHERE category='cars'
You can use long URLs to pass multiple parameters into both HTDBList and HTDBDoc queries. For example:
HTDBList "SELECT column4 FROM table WHERE column1='$1' AND column2='$2' and column3='$3'" HTDBDoc "SELECT title, body FROM table WHERE column1='$1' AND column2='$2' and column3='$3' column4='$4'" Server htdb:/path1/path2/path3/Using multiple parameters helps to refer to a certain record using parts of a compound PRIMARY KEY or UNIQUE INDEX.
It's possible to index multiple HTDB sources using multiple HTDBList, HTDBDoc and Server commands in the same indexer.conf.
Section body 1 256 Section title 2 256 HTDBList "SELECT id FROM t1" HTDBDoc "SELECT title, body FROM t1 WHERE id=$2" Server htdb:/t1/ HTDBList "SELECT id FROM t2" HTDBDoc "SELECT title, body FROM t2 WHERE id=$2" Server htdb:/t2/ HTDBList "SELECT id FROM t3" HTDBDoc "SELECT title, body FROM t3 WHERE id=$2" Server htdb:/t3/
With help of the htdb:/ scheme
you can quickly create a full-text index and use it
further in your SQL application.
Imagine you have a large SQL
table which stores a Web board messages in plain text format,
and you want to add search functionality to your Web board.
Say, the messages are stored in the table messages
with two columns id
and msg
, where id
is an integer PRIMARY KEY and
msg
is a long text column containing messages.
Using a usual SQL LIKE
search may take a very long time to return a result:
SELECT id, message FROM messages WHERE message LIKE '%someword%'
With help of the htdb:/ scheme provided by
mnoGoSearch you can create
a full-text index on the table messages
.
In order to do so you can
edit your indexer.conf as follows:
DBAddr mysql://foo:bar@localhost/mnogosearch/?dbmode=single Section msg 1 256 HTDBAddr mysql://foofoo:barbar@localhost/database/ HTDBList "SELECT id FROM messages" HTDBDoc "SELECT msg FROM messages WHERE id='$1'" Server htdb:/
When started, indexer will insert
the URL htdb:/
into the database and will execute the SQL
query given in HTDBList, which
will produce the values 1,
2, 3,..., N
in the result. The values will be interpreted as links relative
to htdb:/. A list of new URLs
in the form htdb:/1, htdb:/2,
..., htdb:/N will be added into the database.
Then the HTDBDoc SQL
query will be executed for every added URL.
HTDBDoc will return the column
msg
as a document content, which will be associated
with the section mgs
and parsed.
Word information will be stored in the table dict
(assuming the single storage mode).
After indexing is done, you can use mnoGoSearch tables to perform search:
SELECT url.url FROM url,dict WHERE dict.url_id=url.rec_id AND dict.word='someword';
The table dict
has an index
on the column word
, so the above
query will be executed much faster than the queries
using the LIKE operator on the
table messages
.
You can also use multiple words in search:
SELECT url.url, count(*) as c FROM url,dict WHERE dict.url_id=url.rec_id AND dict.word IN ('some','word') GROUP BY url.url ORDER BY c DESC;
Both queries will return htdb:/XXX
values from the url.url
field.
Then your application can cut the "htdb:/"
prefix from the returned values to get the
PRIMARY KEY values from the table
messages
.
You can also use HTDB to index your database driven Web server. It allows to index your documents without having to invoke your the Web server at indexing time, which should require less CPU resources than direct HTTP indexing and therefore should offload the Web server machine.
The main idea of indexing a database driven Web server is to map HTTP requests into HTDB requests at indexing time. So indexer will fetch the source data directly from the SQL database, meanwhile search.cgi will return real URLs in usual HTTP notation. This can be achieved using the aliasing mechanisms provided by mnoGoSearch.
Take a look at a sample file doc/samples/htdb.conf, which is included into mnoGoSearch source distribution. It is the indexer.conf file used to index the Web board at the mnoGoSearch site .
The HTDBList command generates URLs in the form:
http://www.mnogosearch.org/board/message.php?id=XXX
where XXX is
a PRIMARY KEY value
from the table messages
.
For every PRIMARY KEY value a fully formatted HTTP response is generated, containing a text/html document with headers and this content:
<HTML> <HEAD> <TITLE>Subject goes here</TITLE> <META NAME="Description" Content="Author name goes here"> </HEAD> <BODY> Message text goes here </BODY>
At the end of doc/samples/htdb.conf you can find these commands:
Server htdb:/ Realm http://www.mnogosearch.org/board/message.php?id=* Alias http://www.mnogosearch.org/board/message.php?id= htdb:/
The first command tells indexer to execute the HTDBList query, which generates a list of messages in the form:
http://www.mnogosearch.org/board/message.php?id=XXX
The second command tells indexer to allow messages matching the given pattern using string match with the '*' wildcard at the end.
The third command replaces the substring http://www.mnogosearch.org/board/message.php?id= in the URL to htdb:/ before a message is downloaded, which forces indexer to use the SQL table as the data source for a document instead of sending an HTTP request to the Web server.
After indexing is done, search.cgi will display search result using the usual HTTP notation, for example: http://www.mnogosearch.org/board/message.php?id=1000
mnoGoSearch offers special virtual URL methods exec:/ and cgi:/. These methods allow to use output of an external program as a source for indexing. mnoGoSearch can work with any executable program that returns results to STDOUT. The result must be conform to the HTTP standard and return full HTTP response headers (including HTTP status line and at least the Content-Type HTTP response header) followed by the document content.
For example, when indexing both cgi:/usr/local/bin/myprog and exec:/usr/local/bin/myprog, indexer will execute the /usr/local/bin/myprog program.
When executing a program given in a cgi:/ URL,
indexer emulates environment in the way
this program would run in when executed under a HTTP server. It
creates the REQUEST_METHOD=GET
environment variable,
and the QUERY_STRING
variable according to the HTTP
standards. For example, if
cgi:/usr/local/apache/cgi-bin/test-cgi?a=b&d=e
is being indexed, indexer creates
QUERY_STRING with
a=b&d=e value. cgi:/ virtual
URL scheme allows indexing your site without having to invoke web
servers even if you want to index CGI scripts. For example, you have
a web site with static documents under
/usr/local/apache/htdocs/ and with CGI scripts
under
/usr/local/apache/cgi-bin/. You can use the following
configuration:
Server http://localhost/ Alias http://localhost/cgi-bin/ cgi:/usr/local/apache/cgi-bin/ Alias http://localhost/ file:///usr/local/apache/htdocs/
In case of an exec:/ URL, indexer does not create the QUERY_STRING variable, instead it passes all parameters in the command line. For example, when indexing exec:/usr/local/bin/myprog?a=b&d=e, this command will be executed:
/usr/local/bin/myprog "a=b&d=e"
The exec:/ virtual scheme can be used as an external retrieval system. It allows using protocols which are not supported natively by mnoGoSearch. For example, you can use curl program which is available from http://curl.haxx.se/ to index HTTPS sites when mnoGoSearch is compiled without built-in HTTPS support.
Put this short script to /usr/local/mnogosearch/bin/ under name curl.sh.
#!/bin/sh /usr/local/bin/curl -i $1 2>/dev/null
This script takes an URL given as a command line parameter and executes curl to download the given URL. The -i argument tells curl to output result together with HTTP response headers.
Add these commands into indexer.conf:
Server https://some.https.site/ Alias https:// exec:/usr/local/mnogosearch/etc/curl.sh?https://
When indexing https://some.https.site/path/to/page.html, indexer will translate this URL to
exec:/usr/local/mnogosearch/etc/curl.sh?https://some.https.site/path/to/page.html
then execute the curl.sh script:
/usr/local/mnogosearch/etc/curl.sh "https://some.https.site/path/to/page.html"
and load its output for indexing.
Note: indexer loads up to MaxDocSize bytes when executing an exec:/ or cgi:/.
mnoGoSearch supports some mirroring functionality. To enable mirroring, you can specify the path where indexer will create the mirrors of your sites with help of the MirrorRoot command. For example:
MirrorRoot /path/to/mirror
You can also configure indexer to store HTTP headers on the disk. This can be helpful if you want to use the local mirror for quick reindexing of the remote site. Use the MirrorRoot command to activate storing the HTTP headers. For example:
MirrorHeadersRoot /path/to/headers
Note: MirrorRoot and MirrorHeadersRoot can point to the same directory.
Note: indexer does not download more than MaxDocSize bytes from every documents. If a document is larger, it will be only partially downloaded. Make sure that MaxDocSize is large enough if you want to use the mirror created by as a real site mirror.
mnoGoSearch can use a previously created mirror as a crawler cache. It can be useful when you do experiments with mnoGoSearch to find the best configuration: you modify your indexer.conf, then clear the database and index the same sites again. To reduce Internet traffic you can activate loading documents from the mirror using the MirrorPeriod command. For example:
MirrorPeriod 2h
MirrorPeriod specify the period of time when indexer considers the local mirrored copy of a document as valid. If indexer finds that the local mirrored copy is fresh enough, it will not download the same document again and use the local copy instead. If the local is older than MirrorPeriod says, then indexer will download the document from its original location again, and update the locally mirrored copy.
If MirrorHeadersRoot is not specified and therefore the original HTTP headers are not available, then indexer will detect Content-Type of a document using the AddType commands.
The parameter MirrorPeriod should be in the form: xxxA[yyyB[zzzC]], where xxx, yyy, zzz are numbers (can be negative!). Spaces are allowed between xxx and A and yyy and so on. A, B, C can be one of the following:
s - second M - minute h - hour d - day m - month y - year
Note: The letters are similar to the descriptors understood by the
strptime()
andstrftime()
C functions.
Examples:
15s - 15 seconds 4h30M - 4 hours and 30 minutes 1y6m-15d - 1 year and six month minus 15 days 1h-10M+1s - 1 hour minus 10 minutes plus 1 second
If you specify only a number without any characters, it is assumed that the time is given in seconds.
Note: If you start mirroring in a already existing database, indexer will refuse to create the mirror immediately because of the traffic optimization method described at the Section called Crawling time optimization in Chapter 3. You can run indexer -am once to turn off optimization, or clear the database using indexer -C and then run indexer without any arguments.
It is possible to dump and restore a mnoGoSearch SQL database using standard tools supplied with the database software, such as mysqldump or pg_dump. This approach works fine in case of a single SQL database.
However, if you use multiple SQL databases to store mnoGoSearch data, or use mnoGoSearch cluster solution and want to re-distribute data between more SQL databases (say, when adding a new machine into cluster), or want to reduce the number of separate SQL databases (say, when removing a machine from cluster), the standard method of dumping and restoring SQL data will not work because of conflicts in auto-generated values (auto_increment values, SEQUENCE values, IDENTITY values and so so).
Starting from the version 3.3.9, mnoGoSearch includes dump and restore tools which allows to workaround this problem.
Note: As of version 3.3.9, mnoGoSearch dump and restore tools work only with MySQL. Support for the other databases will be added in the future releases.
indexer -Edumpdata > dumpfile.sqlor pipe data to gzip:
indexer -Edumpdata | gzip > dumpfile.sql.gzto reduce the dump size.
The dump file created by indexer -Edump is a usual SQL dump file, which does not include auto-generated values. A piece of a dump file in case of MySQL database looks like:
--seed=39 INSERT INTO url (...all columns except rec_id...) VALUES (...); INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'body','Modules Directives FAQ...'); INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'CachedCopy','eNrtWc1v2zgWv+ev...'); INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'Charset','utf-8'); INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'Content-Language','en'); INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'Content-Type','text/html'); INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'title','Apache HTTP Server Ver...'); INSERT INTO bdicti VALUES(last_insert_id(),1,0x6B6F00011EC296170000726577726974696E6700017E4D,0...');The dump file consists of chunks of INSERT instructions for every document. The structure of the dump file forces MySQL to assign a new auto-increment value for the column
url.rec_id
and use this value to insert data
into the child tables urlinfo
and bdicti
at restore time.
Additionally, every chunk consists of the comment --seed=xxx which is used to distribute data between multiple database properly at restore time.
By default, indexer -Edump dumps data from all databases
specified in indexer.conf file. You can use the -D
command
line argument to dump data from a certain database only. For example:
indexer -Edump -D2will dump data from the database described by the second command DBAddr in indexer.conf.
To restore a search database from a dump file, use:
indexer -Esql -v2 < dumpfile.sqlor in case of .gz file:
zcat dumpfile.sql.gz | indexer -Esql -v2indexer will load the data back to the SQL database. In case if you have two or more DBAddr commands in the current indexer.conf file, indexer will also properly distribute the data between the corresponding SQL databases.