Guidelines for Participation and Submission
(Note: These guidelines have been adapted from previous CLEF submission guidelines. Please read carefully before submitting results for a task.)
In these Guidelines, we provide information on the CLEF 2013 CHiC queries, data manipulation, query construction and results submission for the multilingual CHiC tasks (ad-hoc and semantic enrichment).
CHiC 2013 consists of 2 distinct tasks (ad-hoc and semantic enrichment) with different requirements for query and data processing. The focus of the 2013 multilingual CHiC tasks is on multilingual information retrieval, in particular searching a multilingual document collection.
Topics are provided in 13 source languages and documents are provided in 13 target languages. The languages are:
- Dutch
- English
- Finnish
- French
- German
- Greek
- Hungarian
- Italian
- Norwegian
- Polish
- Slovenian
- Spanish
- Swedish
Note: Whereas the topics in different source languages are equivalent in literal meaning (translations of each other), the document collections in the source languages are NOT equivalent.
Topic format & requirements
The topic set for the 2013 edition consists of 50 topics sampled both from Europeana logs and intellectually developed. The topics from Europeana query logs comprise queries for people, places, work titles (e.g. Mona Lisa), events or subjects. The intellectually derived topics mostly comprise topical queries. We down-sampled the proportion of named-entity topics in this year's topic set (from 60% in 2012 to about 20% this year) to put more emphasis on the more difficult topical queries. For comparison purposes, we will include 10 topics from CHiC 2012. Topics have the following format:
<topic lang="en">
<identifier>CHIC-021</identifier>
<title>chardonne</title>
<description>Jacques Chardonne, Writer (FR) OR place in Switzerland</description>
</topic>
Where:
<topiclang="en"> marks the begin and language of a topic.
<identifier> is the query identifier
<title> is the actual query
<description> is a description of the content that will be used for relevance assessment. Not every topic contains text in the description field.
THE DESCRIPTION FIELD MAY NOT BE USED FOR RETRIEVAL EXPERIMENTS.
Ad-hoc task submission requirements
Goal: To retrieve relevant documents for a given query (a result list of 1000 documents is expected). 50 topics will be provided (CHIC-051 – CHIC-100).
Collection processing: All document fields can be used for retrieval. The collections may not be altered in response to the CHiC 2013 topics, that is, new content may not be added specifically adapted to the topics. Other alterations (e.g. document translation or expansion) that are non-specific to the queries are permitted. Other external resources are also permitted, but must be noted in the run description later.
Conditions for participation: All participating groups must submit at least two monolingual runs corresponding to at least two of the 13 sub-collection languages and two multilingual runs(using as many collection languages as possible, optimally all 13, minimally 2).
A maximum number of 13 monolingual runs (at most 2 runs per language) and a total maximum number of 20 runs (in any combination) are permitted.
Result format: Results have to be submitted in ASCII format, with one line per document retrieved. The lines have to be formatted as follows:
CHIC-051 |
Q0 |
http://www.europeana.eu/resolve/record/11111/1A2B3C1111111111 |
0 |
0.017416 |
runindex1 |
1 |
2 |
3 |
4 |
5 |
6 |
The fields must be separated by ONE space and have the following meanings:
1) Query identifier. INPUT MUST BE SORTED NUMERICALLY BY QUERY NUMBER.
2) Query iteration (will be ignored. Please choose "Q0" for all experiments).
3) Document number (content of the "ims:identifier" attribute in the <ims:metadata> element).
4) Rank 0-n (0 is best matching document. If you retrieve 1000 documents per query, rank will be 0..999, with 0 best and 999 worst). Note that rank starts at 0 (zero) and not 1 (one). MUST BE SORTED IN INCREASING ORDER PER QUERY.
5) RSV value (system-specific, floating point value that expresses how relevant your system deems a document to be; higher relevance corresponds to higher values). If a document D1 is considered more relevant than a document D2, this must be reflected in the fact that RSV1 > RSV2. If RSV1 = RSV2, the documents may be randomly reordered during calculation of the evaluation measures. Please use a decimal point ".", not a comma. Do not use any form of separators for thousands. The only legal characters for the RSV values are 0-9 and the decimal point. MUST BE SORTED IN DECREASING ORDER PER QUERY.
6) Run identifier (please chose an unique ID for each experiment you submit). Only use a-z, A-Z and 0-9. No special characters, accents, etc.
The fields are separated by a single space. The result file contains nothing but lines formatted in the way described above.
You are expected to retrieve max. 1000 documents per query. An experiment that retrieves a maximum of 1000 documents each for 50 queries therefore produces a file that contains a maximum of 50000 lines.
You should know that the effectiveness measures used in CLEF evaluate the performance of systems at various points of recall. Participants must thus return at most 1000 documents per query in their results. Please note that, by its nature, the average precision measure does not penalize systems that return extra irrelevant documents at the bottom of their result lists. Therefore, you will usually want to use the maximum number of allowable documents in your official submissions. If you knowingly retrieved less than 1000 documents for a topic, please take note of that and check your numbers with those reported by the system during the submission.
Submission: Please submit your runs to the DIRECT system, which will be opened soon (a username and password will be sent to you). Result files should be uploaded as zip files and validated through the DIRECT system before the final submission. Runs can be deleted or added as necessary.
Please indicate the topic and sub-collection languages for each run.
Semantic enrichment submission requirements
Goal: To retrieve 10 related concepts (terms or phrases) for a given topic to semantically enrich the topic and / or guess the user's information need or original query intent. 25 topics will be provided (CHIC-051 – CHIC-075).
Collection processing: Theoretically, Europeana collections don't need to be used for this task as query enrichment can also be done with external resources. If the Europeana collections are used for enrichment, all document fields can be used.
Conditions for participation: All participating groups must submit at least one monolingual or multilingual enrichment file (containing 10 monolingual or multilingual terms or phrases per topic for enrichment). Monolingual enrichment files contain enrichment concepts in only one of the sub-collection languages; multilingual enrichment files contain enrichment concepts in two or more sub-collection languages. The related concepts should be weighted (assigning weights between 0 and 1) in order to represent the importance of a concept in query expansion. A weight assigned for query expansion means that the enrichment term or phrase will be used with this weight in query expansion, e.g. a weight of 0 means the concept will not be used query expansion, a weight of 0.5 means the enrichment concept will have half the weight of the original query and a weight of 1 means the enrichment concept will have the same weight as the original query term.
A maximum number of 2 monolingual files in one language (i.e. DE) and a total maximum number of 10 enrichment files (in any language combination) are permitted.
Result format: Files should be submitted in the tab-separated format shown below (where column 1 is the topic ID, column 2 the weight and column 3 the expansion term):
Topic ID /t<tab> weight /t<tab> enrichment concept
Example:
CHIC-001 0.4 netherlands
CHIC-001 0.3 vase
CHIC-001 0.1 porcelain
...
CHIC-001 0.4 pottery
CHIC-002 0.7 ship
...
Each topic should have at most 10 lines per topic containing one enriched concept (term or phrase) each.
You are expected to present at most 10 enrichment concept per query. An experiment that retrieves a maximum of 10 enrichments for each of the 25 queries therefore produces a file that contains a maximum of 250 lines.
Submission: Please submit your runs to the DIRECT system, which will be opened soon (a username and password will be sent to you). Result files should be uploaded as zip files and validated through the DIRECT system before the final submission. Runs can be deleted or added as necessary.
Please indicate for each file whether the enrichment concepts are for a monolingual run or a multilingual ones (and which languages are involved).