| mnoGoSearch 3.3.14 reference manual: Full-featured search engine software | ||
|---|---|---|
| Prev | Chapter 10. Searching documents | Next | 
By default, mnoGoSearch sorts results by score. Score is calculated as relevancy value mixed with various other factors listed in the Section called Commands affecting document score .
Note: You can also request a non-default document ordering with help of the s search parameter. Have a look into the Section called Search parameters to know how to order documents by Date, Popularity Rank, URL and other document parameters.
Have a look into these manual sections to know about various commands that affect document ordering and/or score values: DateFactor, DocSizeWeight, MinCoordFactor, NumDistinctWordFactor, NumSections, NumWordFactor, UserScore, WordDistanceWeight, WordFormFactor, WordDensityFactor.
Relevancy for every found document is calculated as the cosine of the angle formed by two weights vectors, the vector for the search query and the vector for the found document. The number of coordinates in the vectors is equal to the number of the words in the search query (NumWords) multiplied by the number of the active sections, defined by the NumSections command: NumWords * NumSections. Every coordinate in the vector corresponds to one word in one section, the coordinate value consists of thee factors:
section_weight, according to the wf value for this section (see the Section called Changing weights of the different document parts at search time).
word_weight, depending on whether this word is the original word from the search query typed by the user, or the word is a generated form such as a synonym or a stemming form.
Note: You can change the weight of the generated forms using the WordFormFactor command.
word_frequency - the frequency of the word in the section, with the WordDensityFactor value taken into account.
<HTML>
  <HEAD>
    <TITLE>
      Test
    </TITLE>
  </HEAD>
  <BODY>
    This is a test document to test the score value 
  </BODY>
</HTML>
  Also, for similicity reasons, imagine that
  NumSections is set
  to 2 (that is only the body and 
  title sections are active),
  wf is set to
  its default value (weight factors for alls sections are
  equal to 1), and
  WordDensityFactor
  is set to 255 (the strongest density effect).
  mnoGoSearch will use these two vectors to calculate relevancy:
Vq= (1, 1, 1, 1)for the search query and
Vd= (1, 0, 0.2, 0.1)for the above document, calculated as follows:
        The word test appears once in the section
        title and its word_frequency
        is 1,
        wf[title]=1,
        word_weight=1.
        Therefore, Vd[1]=1 * 1 * 1 =
        1.
        
        The word document does not appear
        in the section title at all,
        therefore, Vd[2]=0.
        
        The word test appears two times in the section
        body, with 10
        words total. word_frequency
        is 2/10.
        wf[body] is 1.
        word_weight is 1.
        Therefore, Vd[3] = 2/10 * 1 * 1 = 0.2.
        
        The word document appears once in the section
        body which is total 10
        words long. word_frequency
        is 1/10.
        wf[body] is 1.
        word_weight is 1.
        Therefore, Vd[4] = 1/10 * 1 * 1 = 0.1.
        
The cosine value value for the above two vectors is 0.634335.
  Now imagine that we set wf to "1111181"
  and therefore made the weight factor for the section
  title higher. Now relevancy will be calculated
  using these two vectors:
Vq= (8, 8, 1, 1)for the search query and
Vd= (8, 0, 0.2, 0.1)for the above document, which will result in the relevancy value 0.704660.
The relevancy value calculated as explained above is further mixed with various other parameters to get the final score value, for example the average distance between the words in the document, the distance of the words from the beginning of the section, and the other parameters listed in the Section called Commands affecting document score .
Note: In the default configuration mnoGoSearch produces quite small score values, because it expects the words to be found in up to 256 sections and therefore uses the 256 coordinate vectors. Have a look into NumSections search.htm command description how to specify the real number of sections and thus increase the score values. Changing NumSections does not affect the document order, it only changes the absolute score values for all documents.
Starting from the version 3.3.7, mnoGoSearch allows to debug score values calculated for the documents found and thus helps to find a combination of all score factors which is the best for you. In order to debug score values go through these steps:
<--restop-->
....
[DebugScore: $(DebugScore)]
<--/restop-->
        
<--res-->
....
[ID=$(ID)]
<--/res-->
        Note: Now URL will look approximately like this:
http://hostname/cgi-bin/search.cgi?q=test+query&DebugURLID=100
DebugScore: url_id=100 RDsum=98 distance=84 (84/1) minmax=0.99091089
            density=0.00196271 numword=0.90135133 wordform=0.00000000
        It will give you an idea why the score value for the
        selected document is too high or too low and help
        to fine tune various parameters
        like WordDistanceWeight
        or  WordDensityFactor.
      Note: Score debug information is currently displayed only for queries with multiple search words. Queries with a single search word don't return debug information.
Popularity rank is calculated when you start indexer -n0 -R and is done in two steps.
  At the first step, the value of the Weight parameter
  for every server is divided by the number of outgoing links from
  this server, so the weight of one link from this server is calculated.
  At the second step the sum of weights of all incoming links is
  calculated for every document and the result is stored as the document
  popularity value.
  
Self-links (when a document refers to itself), are ignored and do not affect the document popularity. You can also set PopRankSkipSameSite to yes to ignore all internal site links and thus have only inter-site links affect the popularity values.
  By default, the value of the Weight parameter is equal
  to 1 for all servers. You can change this value using
  the ServerWeight
  command in indexer.conf.
  
  
If you put set PopRankFeedBack to yes, indexer will re-calculate site weights before calculating popularity values. A site weight is calculated as the sum of popularity values for all document of this site calculated during the previous indexer -n0 -R run. If the sum is greater than 1, the site weight is set to this sum, otherwise, the site weight is set to 1.
If you set PopRankUseTracking to yes, indexer will also use the search statistics collected using the search query tracking module (see the Section called Tracking search queries for details).
If you set PopRankUseShowCnt
  to yes in the search template file
  search.htm, the url.shows
  value (that is the value of the column show in the 
  table url) will be incremented every
  time a document is displayed in search results, but only in the case
  when the score value of the document is greater than
  PopRankShowCntRatio
  (25.0% by default). That is, this option
  activates collecting information about the high scored
  documents seen by the users.
  
  You can set PopRankUseShowCnt
  to yes in indexer.conf.
  In this case indexer will use the collected
  value of url.shows multiplied to the
  PopRankShowCntWeight
  (0.01 by default) to the popularity value.
  
This feature makes the words written in between the <a href="xxx"> and </a> HTML tags belong to the document referenced in the link. To enable using Crosswords, use the CrossWords command in indexer.conf and search.htm.