31 January: fyi -- internships, CA
Index of January 2001 | Index of year: 2001 | Full index
Internships in Natural Language Processing
USC/Information Sciences Institute
Summer 2001
We are looking for interested and qualified students (graduate and
undergraduate) to spend the summer working with ongoing research
projects at USC/ISI on natural language processing, machine learning,
statistical modeling, automatic translation, human/computer dialog,
discourse analysis, and other areas.
Currently we have positions open in the following areas:
1. Statistical Machine Translation
Translating human languages (e.g., Chinese to English) is a longstanding
challenge for computer science. We are developing and applying
statistical algorithms to this problem, extracting large amounts of
relevant translation knowledge automatically from bilingual text (e.g.,
Hong Kong government documents). We face many interesting challenges in
this quest to improve significantly on the quality of commercially
available translators, and to build translation systems for "smaller"
languages (e.g., Tamil and Tetun) that have not yet received significant
commercial interest.
2. Statistical Summarization
When humans produce summaries of documents, they do not simply extract
sentences and concatenate them. Rather, they create new sentences that
are grammatical, that cohere with one another, and that capture the most
salient pieces of information in the original document. Given that large
collections of text/abstract pairs are available online, it is now
possible to envision algorithms that are trained to mimic this process.
We have already developed statistical algorithms capable of compressing
sentences; these algorithms produce short sentences that are grammatical
and that retain the most important pieces of information in the original
sentences. Current plans call for scaling up the statistical-based
compression techniques that we developed for sentences, so that they are
applicable to texts. This work will use discourse and
summarization corpora in order to build statistical models that produce
coherent abstracts.
3. Alignment and Exploitation of Biological and Natural Language
Sequences
The analysis of sequences in molecular biology (e.g., DNA and proteins)
is of great scientific and practical interest. The same is true of
natural language sequences (e.g., newswire and bilingual text). Both
fields have just witnessed an explosion in available online data. We
have sequenced the human genome: but now what? We have tens of millions
of words that have been translated manually from English to French: but
now what? Fortunately, many sequence-analysis algorithms developed in
one field can be usefully applied to the other. For example, if we
imagine the New York Times as a live organism, then we can view its
linear text stream as its "DNA," and this may run to billions of
characters. Le Monde (the French daily) is a related organism with a
similar function, but different "DNA." Biological algorithms now exist
for aligning the DNA of two organisms -- by applying these kinds of
algorithms to natural language text, we can automatically align stories,
sentences, phrases, and words. From this aligned data, we know how to
automatically construct translation systems. Likewise, algorithms
developed in statistical machine translation may be profitably applied
to biological sequences. USC has expertise in both of these fields; we
also have a newly-installed cluster computer for executing large-scale
sequence computations.
4. Human/Computer Dialog for Automated Agents in Simulations
Human/computer speech dialog is a research area of increasing
importance. We are working on dialog in the context of virtual-reality
simulations, where automated agents interact with people and with each
other. Natural language is critical for making these simulations seem
real. We are also working on "chatterbot" technology to provide robust,
realistic conversation capabilities for automated agents. This work is
being carried out in collaboration with USC's new Institute for Creative
Technologies, which is bringing together Hollywood scriptwriters, game
designers, artificial intelligence scientists, and state-of-the-art
virtual reality graphics/sound to build compelling simulated worlds.
The Internships will be available for a three months period, preferably
during the summer of 2001. The starting date is negotiable.
If you are interested, please contact either Kevin Knight
(knight@isi.edu) or Daniel Marcu (marcu@isi.edu)! Please include a
resume and let us know what is/are the project(s) that you would be
interested in. We plan to make decisions by February 28, 2001.
For more information, visit the webpage at
http://www.isi.edu/natural-language/projects/rewrite/jobs2001.html.
Index of January 2001 | Index of year: 2001 | Full index