22 November: DCLRS: G Lynch, Nov 30, 4pm, LB08, TCD

Dublin Computational Linguistics Research Seminar: Index of November 2012 | Dublin Computational Linguistics Research Seminar - Index of year: 2012 | Full index


this academic year will be given Friday, next week, Nov 30th at the usual
time, 4pm, in LB08 (Lloyd Building Basement, TCD).

Speaker: Gerard Lynch

Title: Detecting the source language of a literary translation

Abstract
In recent times there has been an increased interest in problems in
translation stylistics
from researchers in computational linguistics. Baroni and Bernardini
(2006) spearheaded this new movement of collaboration between
translation studies and the computational sciences with their study which
applied machine learning techniques from the text classification
literature to learn textual features which distinguish between
translated and non-translated Italian journalistic text. Their work was
also novel for their experiment which compared human
classification/identification of translated text with the performance of
computational methods on the same task. A related task was examined by van
Halteren (2008) who used similar methods to detect the source language of
translated text from the Europarl corpus in several
European languages.
Our work examines this question but in relation to literary
translations, the question remains whether one can detect the source
language of a literary translation, a genre for which automatic
classification could be considered more complex due to the varying nature
of literary style. A corpus of 19th century literary works was assembled
for experimental purposes, including translations from
German, French and Russian. In reference to Bernardini et al, English
original texts were also included in the classification task. We
present results on our classification experiments including analysis of
the textual features found to be discriminatory in our task (word and POS
ngrams and document statistics such as type-token ratio etc ).
Classification results were found to be comparable to the state of the
art(ca. 80%) based on 10-fold cross validation experiments and testing on
a held out set. Testing on unseen data resulted in lower accuracy however
results
were still well above the baseline.









_______________________________________________
cogsci mailing list
cogsci@scss.tcd.ie
https://lists.scss.tcd.ie/mailman/listinfo/cogsci

Dublin Computational Linguistics Research Seminar - Index of November 2012 | Index of year: 2012 | Full index