29 March: DCLRS -- Tomaz Erjavec, "MULTEXT-East Version 4: multilingual

Dublin Computational Linguistics Research Seminar: Index of March 2009 | Dublin Computational Linguistics Research Seminar - Index of year: 2009 | Full index


Seminar Announcement:

Dublin Computational Linguistics Research Seminar
DCLRS 2008/2009
DCU DIT TCD UCD


Venue: Jonathan Swift Lecture Theatre (Arts Building 2041a)
Trinity College Dublin
Time: 16:00, Friday, April 3, 2009



Title:

MULTEXT-East Version 4: multilingual morphosyntactic specifications
for lots of languages


Speaker:

Dr. Tomaz Erjavec
Dept. of Knowledge Technologies
Josef Stefan Institute
Ljubljana, Slovenia


Abstract:

The talk presents work in progress on the fourth version of the
multilingual language resources originating in the MULTEXT and
MULTEXT-East projects in the '90s. The resources are focused on
language technology oriented morphosyntactic descriptions of
languages, i.e. on providing features and tagsets useful for
word-level tagging of corpora, what is commonly known as
part-of-speech tagging. But unlike English, where »part-of-speech«
tagsets number around 50, most other (inflectional, agglutinating)
languages have much richer word-level morphosyntactic structures; the
tagset for Slovene, for example, has almost 2,000 different tags. The
MULTEXT-East resources comprise morphosyntactic specifications,
defining the features and their tagsets, lexica, and annotated
corpora. Version 3 (2004) is the last released version, with the
resources being freely available for research from
http://nl.ijs.si/me/ and having been downloaded by over 200 registered
users, mostly from universities and research institutions. The talk
introduces the XML structure of the specifications in Version 4, to
contain data for over 13 languages. We discuss the characteristics of
the languages covered, the use of the Text Encoding Initiative
Guidelines as the encoding scheme and XSLT in transforming the
specifications into other formats. An application of this framework is
then given, namely the JOS language resources for Slovene,
http://nl.ijs.si/jos/, which provide a manually validated
morphosyntactically annotated reference corpus for the
language. Finally, the methodology of adding new languages to the
specifications is presented.



Winter Schedule:

January 16 John Tait (Sunderland)
January 23 Gerhard Jaeger (Bielefeld)
January 30 Pat Healey (Queen Mary)
February 6 Sebastian Moeller [cancelled]
February 13 Alfredo Maldonado Guerra (Microsoft Dublin)
February 20 Dietmar Janetzko (NCI)
February 27 Steve Pulman (Oxford) [cancelled]
March 6 Tim Fernando (TCD)
March 13 Andreas Vlachos (Cambridge)


Spring Schedule:

April 3 Tomaz Erjavec (Institute Josef Stefan)
April 10 Public Holiday
April 17 Josef Van Genabith (DCU)
April 24 Frank Keller (Edinburgh)
May 1 held
May 8 Brian Murphy (Trento)
May 15 Kees van Deemter (Aberdeen)
May 22 Elisabeth Andre (Augsburg)

Dublin Computational Linguistics Research Seminar - Index of March 2009 | Index of year: 2009 | Full index