16 January: DCLRS -- Sylviane Granger, Friday, January 18, 4pm, Walton Theatre

Dublin Computational Linguistics Research Seminar: Index of January 2002 | Dublin Computational Linguistics Research Seminar - Index of year: 2002 | Full index


----------------------------------------------------------
| Dublin Computational Linguistics Research Seminar |
| DCLRS 2001/2002 |
| DCU TCD UCD |
///////////////////////////////////////////////////////////


NOTE: All talks for the remainder of the year are located
in Davis House, Trinity College, University of Dublin.
This talk is in the Walton Lecture Theatre, which is
near the Arts Block entrance to the Lecky Library.

venue: Walton Lecture Theatre
Trinity College
time: 4:00-6:00, Friday, January 18
speaker: Sylviane Granger
Centre for English Corpus Linguistics
Université catholique de Louvain, Belgium


title: Compiling and Mining Error Tagged Learner Corpora


abstract:


After a brief introduction, where I define corpus
linguistics and its theoretical underpinnings, I will
give an overview of the field of learner corpus
research and describe two learner corpora compiled at
Louvain: the International Corpus of Learner English
(ICLE) and the French Interlanguage Database (FRIDA).
I will then focus on the issue of error annotation,
which is relevant for any corpus, but is particularly
important in the case of learner corpora, which
contain non-native data and therefore typically have a
much higher error rate (Granger 1998 & Dagneaux et al
1998).

I will describe a system of error annotation which has
been developed within the framework of the FreeText
project , whose aim is to develop a hypermedia
computer assisted language learning (CALL) system for
learners of French that relies on natural language
processing and incorporates error typologies based on
the analysis of extended corpora from learners of
different linguistic backgrounds.

Through a three-tiered system of XML tags, it is
possible to annotate every error in the corpus in
terms of (1) error domain (spelling, morphology,
grammar, lexis, etc.); (2) error category (number,
tense, voice, redundant, etc.); (3) part-of-speech
(noun, adjective, verb, etc.). The error tags and
correct forms are inserted in the corpus with the help
of a menu-driven error editor (the Louvain Error
Editor Toolbox). To ensure maximum
inter-annotator-consistency, the error annotation
system is described and illustrated in an Error
Tagging manual. Once annotated, the corpora can be
submitted to text retrieval software tools, such as
WordSmith Tools (OUP) to derive both quantitative
(error statistics) and qualitative (concordances)
results, both of which are then used to inform the
error diagnosis system incorporated in the CALL
program. Another possible use of error tagged corpora
include improvement of automatic error detection
systems, such as Hardt's (2001) comma checking system,
which uses the transformation-based learning system of
the Brill tagger.

References

Dagneaux E., Denness E., Granger
S. 1998. Computer-aided Error Analysis. System. An
International Journal of Educational Technology and
Applied Linguistics 26(2): 163-174.

Granger S. 1998. Learner English on Computer. Addison
Wesley Longman: London & New York.

Hardt D. 2001. Comma checking in Danish. In Rayson P.,
Wilson A., McEnery T., Hardie A. & Khoja S. (eds.)
Proceedings of the Corpus Linguistics 2001
Conference. University Centre for Computer Corpus
Research on Language. Lancaster University: Lancaster:
266-271.

FreeText stands for French in Context: An advanced
hypermedia CALL system featuring NLP tools for a smart
treatment of authentic documents and free production
exercises . The FreeText project is funded under the
User-friendly Information Society (IST) programme of
the 5th framework programme of the European Commission
(Key Action: Multimedia Content and Tools; Action
Line: Education and Training; Contract number:
IST-1999-13093).


The Dublin Computational Linguistics Research Seminar series is run
jointly by DCU (Dublin City University), TCD (Trinity College Dublin)
and UCD (University College Dublin).

The 2001/2002 seminar series is hosted by Trinity College with the
support of the Department of Computer Science, the Centre for Language
and Communication Studies, the Department of Germanic Studies, the
School of Irish, the Department of French and the Centre for Computing
and Language Studies.

For an indication of parts of recent seminar contents, see:
http://www.cs.tcd.ie/research_groups/clg/DCLRS.html

Dublin Computational Linguistics Research Seminar - Index of January 2002 | Index of year: 2002 | Full index