30 January: DCLRS -- Hans van Halteren, Friday, February 2, 16:00 (TCD Arts
Dublin Computational Linguistics Research Seminar: Index of January 2018 | Dublin Computational Linguistics Research Seminar - Index of year: 2018 | Full index
Friday of this week (February 2, 2018), at 16:00, in room 3074 of the
Arts Building (TCD), Prof. Hans van Halteren (Radboud University, NL)
speaks.
Title:
Authorship Recognition: Features in Focus
Abstract:
In Authorship Recognition (as by the way in other text classification
tasks), the accepted approach is to count specific "features" in the
texts, and try to determine authorship on the basis of similarities or
dissimilarities in the counts. This can be done by a human expert,
focusing on intuitively interesting features or using fixed lists; the
number of features used will tend to be small. Another approach is to
use (very) (very very) many features and apply statistical techniques
in an attempt to calculate a probability of a specific author having
produced the disputed texts.
When this approach was first used, the list of chosen features was
still small, e.g. fifty most frequent function words. These days there
are millions of potential features from which to choose, or to just
use all. When using all, one hopes that random factors are filtered
out by the sheer number of features used. When choosing, those
features are picked that appear to be most distinctive for the author
in question.
In this talk, I want to discuss specific properties of features that
may influence their usefulness, such as the sample size needed for a
reliable measurement or the degree to which they are restricted by
text genre. These properties could be used to identify those features
that would better not be selected or that should better be kept out
when using all features. I will exemplify the discussed properties on
the basis of measurements made on texts from the British National
Corpus, of features representing both vocabulary and syntactic
structure.
ABOUT THE SPEAKER
Hans van Halteren currently works at the Department of Language and
Speech, Radboud University in the Netherlands. His research
contributes to advances in Data Mining, Computing in Social Science,
Arts and Humanities and Computational Linguistics. His main areas of
research are inforensic linguistics, language variation, and
processing of non-standard text (for example, Twitter texts).
------
The Dublin Computational Linguistics Research Seminar series, hosted
this year by the Trinity Centre for Computing and Language Studies, is
a cooperation among Trinity College Dublin, Dublin City University,
University College Dublin and the Dublin Institute of Technology, a
long standing collaboration which overlaps with the SFI CNGL/ADAPT
centres.
Dublin Computational Linguistics Research Seminar - Index of January 2018 | Index of year: 2018 | Full index