* Front page
  * Overview
     Requirements
  * Install
  * Screenshots
  * Documentation
     FAQ
     User guide
     Related links
     API
  * License
  * Download
  * Evaluation

Mustru: Evaluation

TREC

Mustru Version 0.1 was tested on a TREC-8 Question & Answer dataset published in 1999. It consists of about 524K articles from various sources including the Financial Times, LA Times, and FBIS. The article text was segmented into passages with the maximum and minimum sizes of passages limited to 250 and 50 bytes respectively. 3.5 M passages were created with an average of 6.7 passages per article. The size of the text content alone excluding all tags was about 1.5 Gbytes.

A small set of development questions were provided and Q&A systems were tested on a set of 198 questions. An answer was judged correct if it matched a regular expression generated for the particular question AND if the answer sentence originated in a document considered relevant for the question.

A correct answer was awarded points based on the position in the hit list returned by the search engine. Points were scored only for the top ranking answer from the hit list.

1 point for an answer in the first hit
1/2 point for an answer in the second hit
1/3 point for an answer in the third hit
1/4 point for an answer in the fourth hit
1/5 point for an answer in the fifth hit
0 points for an answer found after the fifth hit

Results

Hit Position	No. Answered	Points
1	97	97
2	25	12.5
3	5	1.66
4	9	2.25
5	7	1.4
Total	143	114.81

The final precision count (aka Mean Reciprocal Rank) for Mustru was 0.58 (144.81 / 198). A question was converted to a search engine query with five components. Each of the five components added to the overall precision of the answer with different contributions.

Excluded	No. Answered	Precision
General hypernyms	137	0.56
Question hypernyms	139	0.56
All hypernyms	136	0.56
Bigrams	111	0.43
Unigrams	133	0.49
Transformations	143	0.58
None	143	0.58

Not suprisingly, bigrams appear to be the largest contributor to the overall precision followed by unigrams. When used in queries, entities (hypernyms) and transformations appear to provide a marginal improvement in precision.

Version 0.2

In version 0.2, the entity extractor was replaced by a simple table lookup to speed up indexing and reduce memory requirements. Instead of indexing sentences and documents twice as in Version 0.1, a document is indexed just once. The most likely document to answer a question is first retrieved followed by a search for the top two passages that may answer the question. In version 0.1, a search query was generated for the best passage and the associated document was not fetched.

The results in version 0.2 have lower precision , but are reasonable. As before the top 5 hits were used to judge if the search engine found an answer. Even though the document that contains the answer was fetched, passage retrieval extracted the sentence containing the answer in 122 out of 151 (80%) questions.

Mustru found the document that contains the answer in 151 out of 198 questions (76% ) with a precision count of 0.65 .
After passage retrieval, Mustru found the answers for 122 out of 198 questions ( 61% ) with a precision count of 0.49 .