- Collections used for the Multilingual Task 2013
- Collections used for the Polish Task 2013
- Download the CHIC collections for 2013
Overview
The collection used for all 3 tasks in this lab is Europeana (www.europeana.eu), a large digital library, museum and archive, which provides access to over 20 million cultural heritage objects. The documents in the Europeana collection are metadata records consisting of brief descriptions of the object (title, keywords, description, date, provider) and occur in multiple languages. An XML-formatted version of all Europeana metadata objects is available to the participants.
The CHiC Europeana collection is divided into 13 language-based sub-collections (each collection containing more than 100,000 documents in one language) and 1 other collection, which contains the remaining multilingual documents (with less than 100,000 documents per language).
A detailed documentation of the CHiC Europeana collection (incl. element descriptions and collections) can be found here: http://www.promise-noe.eu/documents/10156/37d9a852-fcfd-49a7-94ac-91db3fc139e9
Collections used for the Multilingual Task 2013
The multilingual ad-hoc and semantic enrichment tasks will use the combination of the 13 language-based sub-collections as its multilingual test collection.
The "other" collection (containing documents in languages that have less than 100,000 documents represented in the collection) will NOT be used!
For the ad-hoc and semantic enrichment multilingual CHiC tasks, experiments will be judged on the whole multilingual collection, which contains 20.3 million documents in 13 languages. The following table provides an overview over the language-based sub-collections:
Sub-Collection |
Number of Documents in Collection |
---|---|
De |
3,865,680 |
Fr |
3,635,388 |
Sv |
2,360,050 |
It |
2,120,059 |
Es |
1,953,124 |
No |
1,557,820 |
Nl |
1,251,027 |
En |
1,107,176 |
Pl |
1,093,705 |
Fi |
800,302 |
Sl |
246,952 |
El |
197,371 |
Hu |
121,771 |
Total |
20,310,425 |
Participants are, however, allowed to only use individual (monolingual) collections in their experiments even though there will not be separate monolingual or bilingual tasks.
Collections used for the Polish Task 2013
In the Polish chapter of the Europeana corpus, we can find 1,093,705 documents (or CH object descriptors). To identify each document, the tag ims:identifieris used (and must be used to uniquely refer to the documents returned in the resulting ranked list). According to the Lucene search engine (rounded, with stopwords), the mean number of words in a typical object descriptor is around 35 words per document.
After examining the tags available in the Europeana collection, we found the following ones to be of particular interest:
<dc:contributor>
<dc:creator>
<dc:date>
<dc:language>
<dc:subject>
<dc:title>
<dc:type>
<dcterms:alternative>
<dcterms:created>
<europeana:language>
<europeana:type>
<europeana:uri>
<europeana:year>
<ims:chic/ims:metadata/ims:fields/enrichment:concept_broader_label>
<ims:chic/ims:metadata/ims:fields/enrichment:concept_label>
<ims:chic/ims:metadata/ims:fields/enrichment:period_label>
<ims:chic/ims:metadata/ims:fields/enrichment:place_broader_label>
And here you can find an example of a CH object description:
<ims:fields>
<dc:contributor>Kopera, Feliks (1871-1952)</dc:contributor>
<dc:creator>Gottlieb, Maurycy (1856-1879)</dc:creator>
<dc:date>[1923]</dc:date>
<dc:language>pol</dc:language>
<dc:subject>18-19 w. - ikonografia</dc:subject>
<dc:subject>19-20 w. - ikonografia</dc:subject>
<dc:subject>Judaica - ikonografia</dc:subject>
<dc:subject>Malarstwo zydowskie - ikonografia</dc:subject>
<dc:title>Maurycy Gottlieb 1856-1879 : 26 reprodukcji wedlug obrazów mistrza</dc:title>
<dc:type>grafika</dc:type>
<europeana:language>pl</europeana:language>
<europeana:type>IMAGE</europeana:type>
<europeana:uri>http://www.europeana.eu/resolve/record/92033/0970289D530CDAA11119BD4176B27D727C02A070</europeana:uri>
<europeana:year>1923</europeana:year></ims:fields>
First, an object can be described by only a subset of the possible tags. Not all tags are always present, and in many cases, some tags are empty. Moreover, some tags have multiple appearances in the description of a single object with different contents to them (e.g., the dc:subject tag). The dc:language and europeana:language tags indicate the language used to describe an object, but they are not necessarily equivalent to one another.For some objects, the title field can be written in another language (e.g., German, Yiddish, English, or undefined) but the dc:subject tags is written in Polish, and probably, an equivalent title inPolish is provided <dcterms:alternative> (at least to the best of our knowledge). Certain tags may have short contents such as the europeana:type tag whose content can be either IMAGE or TEXT which corresponds the type (but not the medium) of the CH object on hand.
After inspecting the Polish dataset, one can assume that the description is correctly spelled. However, occasional spelling errors may be encountered but we can estimate that this phenomenon is marginal. In the title field however, we may encounter some formulations reflecting the Polish language used in the 18th century.
Download the CHiC Collections for 2013
The collections are available for download (password-protected). The password for access t to the collections can be obtained by filling out the CHiC 2013 End-User Agreement (agreement to use the collections for research purposes only) at this URL:http://www.promise-noe.eu/documents/10156/3d796587-213f-4c97-9b81-6c95ec057de1
and sending the scanned forms by email to vivien.petras@ibi.hu-berlin. A password to access the collections will be sent to via email.