NLP Software List
This aims to be an inventory of software for nlp that is available in some form or another at Trinity. Its a document that anyone can edit and the idea is that if you buy/install/write some resource that might conceivably be of any use to somebody else then you write an entry for it here. Ideally the software would be installed somewhere people can access it such as:/scss/disciplines/intelligent_systems/clg/clg_web/Software (ugrads cannot access this) /users/Public/CSLL/4thYrProjects/Software (ugrad can access this) /unsupported (ugrad can access this)but another possibility is that the software is on a CD and that this is used for installing a local copy on a user's machine. Another possibility is that the software can't be freely distributed but only after consultation with a particular person. Either way the idea of this page is to have an inventory of such things. Good things to put into the entry would be what it does, where it is, and any comments from first hand experience with it. The file to edit is
/scss/disciplines/intelligent_systems/clg/clg_web/Software/index.php
C++ class for tokenising, tagging, parsing
(added by Martin Emms) I've put together a C++ class for tokenising, tagging, parsing. documented here . The headers and object files are in:/users/Public/CSLL/4thYrProjects/Software/ThoughtIf you look at the documentation there are some example which should explain how to use the class and how to compile code which uses it There is a GUI to some of things that you can do with the class which is
/users/Public/CSLL/4thYrProjects/Software/ThoughtInterface/TagParseInterfacethis will run on sun workstations (and so also remotely if you have a PC running Xceed) GUI documentation .
A chartparser implementation in Prolog
the parser accessed by the above mentioned C++ class is/users/Public/CSLL/4thYrProjects/Software/Parserand documentation is Parser documentation . (added by Martin Emms)
British National Corpus
is 100,000,000 words of part-of-speech tagged text, from a representative range of sources. There is the data itself and also programs to interact with it using an index.Data
Data itself is installed on the machine known asallenwhich you can connect to from other suns with
ssh. Go into the directory
/corpus
Access Software
- On public access Windows PCs a program called SARA is installed which accesses the corpus -- this is written by the BNC distributors.
- There is a web-page to access the corpus. This was written by a final year student.
Corpora from the Linguistics Data Consortium
the following corpora have been purchased from the Linguistics Data Consortium. They are not, for space and license reasons, just installed on publicly accessible machines, but if you contact with Saturnino Luz or Martin Emms, we may be able to work something out.- Penn Treebank
1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional 1 million words tagged for part-of-speech. Also fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank (PTB) tag set. Also included are tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3 and ATIS.
- ACL/DCI corpus
text from: Wall Street Journal, scientific abstracts, variety of grammatically tagged and parsed materials from the Treebank The Collins English Dictionary is present in two forms.
- UN Parellel Text corpus
documents provided to the LDC by the United Nations, for use in
research on machine translation technology.
English, French and Spanish. Parallel directory structure
for each language
- Hansard Corpus Parallel Text in English and French
The Hansard Corpus consists of parallel texts in English and Canadian French, drawn from official records of the proceedings of the Canadian Parliament.
- Comlex Database
This is a moderately broad coverage English lexicon (with about 38,000 lemmas) developed at New York University under LDC sponsorship. It contains detailed information about the syntactic characteristics of each lexical item and is particularly detailed in its treatment of subcategorization (complement structures).
Data In the current dictionary, nouns have 9 possible features and 9 possible complements; adjectives have 7 features and 14 complements; verbs have 5 features and 92 complements. The entries for 750 frequent verbs contain 100 tags each, where a tag includes: a pointer to an instance of that verb in a corpus and the subcategorization appropriate for that instance.
- Celex Lexical Database
lexical databases of English (version 2.5), Dutch (version 3.1) and German (version 2.0). For each language, contains detailed information on: the orthography, the phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress) the morphology (derivational and compositional structure, inflectional paradigms) the syntax (word class, word class-specific subcategorizations, argument structures) word frequency (summed word and lemma counts, based on recent and representative text corpora)
- European Language News Collection
This corpus includes roughly 100 million words of French, 90 million words of German and 15 million words of Portuguese and has been marked using SGML. The text is taken from the following sources:
- Tipster Information-Retrieval Text Research Collection
The TIPSTER project is sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections. The detection data is comprised of a new test collection built at NIST to be used both for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection consists of 3 CDROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.
- European Corpus Initiative Multilingual Corpus (ECI/MCI)
46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. Twelve of the component corpora are multilingual parallel corpora
- HCRC Map Task Corpus
18 hours of spontaneous speech
that was recorded from 128 two-person conversations, involving 64 different speakers (32 female, 32 male,
all adults, each taking part in four conversations). The conversations were carried out in an experimental setting, in
which each participant has a schematic map in front of them, not visible to the other. Each map is
comprised of an outline and roughly a dozen labelled features (e.g. "a white cottage", "an oak forest",
"Green Bay", etc). Most features are common to the two maps, but not all. One map has a route drawn in,
the other does not. The task is for the participant without the route to draw one on the basis of discussion
with the participant with the route. In addition to the conversations, each speaker provides a wordlist
reading, consisting of the major vocabulary items contained in the conversations.
- TRAINS spoken language corpus
a corpus of task-oriented spoken dialogs. These dialogs were collected in a project to develop a conversationally proficient planning assistant, which helps a user construct a plan to achieve some task involving the manufacturing and shipment of goods in a railroad freight system. The collection procedure was designed to make the setting as close to human-computer interaction as possible, but was not a "wizard" scenario, where one person pretends to be a computer. Thus these dialogs provide a snapshot into an ideal human-computer interface that would be able to engage in fluent conversations. Altogether, this corpus includes 98 dialogs, collected using 20 different tasks and 34 different speakers. This amounts to six and a half hours of speech, about 5,900 speaker turns and 55,000 transcribed words.
- Road Rally Conversational Speech Corpora
Rally corpus was designed for the development and testing of word-spotting systems and was collected in a conversational domain using a road rally planning task as the topic.
Martin Emms Last modified: Thu Dec 19 09:16:13 GMT 2002