Dr. Jennifer Foster (DCU) 

#hardtoparse: The Challenges of Parsing the Language of Social Media

The emergence of social media represents a significant challenge for
natural language processing researchers. How suitable are existing NLP
tools,often trained on newswire and expecting grammatically
well-formed input, for processing the linguistically diverse mix of genres
and domains that constitutes the modern web? How robust are these tools to
the non-standard forms found in unedited, casually
written language? To what extent can domain adaptation techniques be used
to improve performance? How important are data pre-processing and
normalisation? In this talk, I will focus on the problem of syntactic
parsing and describe the work carried out to date by researchers in the
National Centre for Language Technology in Dublin City University on the
problem of parsing the language of social media.
This work includes an evaluation of four widely used statistical
parsers on a new dataset of tweets and discussion forum posts, as well as
experiments which aim to improve parsing performance using the
following methods:

1. Modelling the target domain, i.e. transforming the parser training data
(in our case, Penn Treebank) so that it more closely resembles the data to
be parsed, and then training a new parsing model

2. Self-training and up-training using large quantities of
automatically labelled data

3. A combination of data normalisation, parser accuracy prediction to
select suitable training data, genre classification, and self-training
using products of random latent variable grammars

The third approach proved to be very effective in the recent shared task
on parsing English web data
(https://sites.google.com/site/sancl2012/home/shared-task/results).