Syntactic Analysis and Error Correction in the SCARRIE Project
- Patrizia Paggio
This paper reports on work carried out at CST in Copenhagen to develop
the Danish version of the SCARRIE prototype, addressing in particular the
issue of how a form of shallow parsing is combined with error detection and
correction. The syntactic grammar for Danish has been developed with the
aim of dealing with the most frequent context-dependent errors found in
a parallel corpus of unedited and proofread texts specifically collected for
the project. By focussing on certain grammatical constructions and error
types, the linguistic `intelligence' provided by syntactic parsing could be
exploited without impairing robustness and efficiency. The resulting system
is a pre-industrial prototype which compared to existing spelling...
The Symbolic Coding Of Segmental Duration And Tonal Alignment: An Extension To The Intsint System.
- Daniel Hirst
This paper presents work based on an analysis-by-synthesis
approach which aims to develop a reversible coding system for
prosody, capable of deriving a `linguistic-like' surface
phonological representation directly from acoustic data that is
sufficient to reproduce a synthetic version of the original
utterance without significant loss of linguistic information.
With such a coding system, capable of representing any
significant prosodic distinctions, the task of predicting such
representations would be greatly simplified, becoming one of
mapping between sets of symbolic representations. This
approach has already been applied to the stylisation and
symbolic coding of fundamental frequency curves by means of
the INTSINT transcription system. An automatic version has
also been proposed. This paper presents...
Using Neural Nets for Portuguese Part-of-Speech Tagging
- Nuno C. Marques,Gabriel Pereira Lopes,Lisboa Faculdade
We will describe the use of neural networks to
perform part-of-speech (POS) disambiguation
of textual corpora. Available part-of-speech
taggers need huge amounts of hand tagged
text, but for Portuguese there is no such corpora
available. In this paper we propose a
neural network that, apparently, is capable
of overcoming the huge training corpus problem.
Distinct network topologies are applied
to the problem of learning the parameters of
a part-of-speech tagger from a very small Portuguese
training corpus. The experiments carried
out are discussed. The results obtained
point to a correction rate above the 96% using
a hand tagged training corpus with approximately
The application potential of textual corpora increases,
when the corpora is annotated....
Perception of Linguistic Rhythm By Newborn Infants
- Franck Ramus,Bertoncini Mehler
Xavier Jeannin and Michel Dutat for technical assistance, and the
Dlgation Gnrale pour l'Armement for financial support.
A Calculus of Module Systems
- Davide Ancona,Elena Zucca
We present CMS, a simple and powerful calculus of modules supporting mutual recursion
and higher order features, which can be instantiated over an arbitrary core calculus
satisfying standard assumptions.
The calculus allows to express a large variety of existing mechanisms for combining
software components, including parameterized modules like ML functors, extension with
overriding of object-oriented programming, mixin modules and extra-linguistic mechanisms
like those provided by a linker. Hence CMS can be used as a paradigmatic calculus
for modular languages, in the same spirit the lambda calculus is used for functional programming.
As usual, we rst present an untyped version of the calculus and then a type system; we
Multilingual Generation of Grammatical Categories
- Gabriele Scheler
We present an interlingual semantic representation for the synthesis
of morphological aspect of English and Russian by standard backpropagation.
Grammatical meanings are represented symbolically and
translated into a binary representation. Generalization is assessed by
test sentences and by a translation of the training sentences of the
other language. The results are relevant to machine translation in a
hybrid systems approach and to the study of linguistic category formation.
In this paper, we propose a representation for grammatical meanings which
are relevant to the morphological forms progressive/simple in English and
imperfective/perfective in Russian. We show that these grammatical meanings
are sufficient to predict the correct aspectual form in the generation...
A Corpus-based Analysis of Speech Repairs in Japanese
- Yuu Haruki,Masato Ishizaki,Yasuharu Den
this paper is preliminary, it basically showed
that Levelt's monitoring theory holds for speech repairs naturally occurring in Japanese dialogues
except for the timing of trouble detection and the nonretracing strategy in substitution
of function words. These exceptions might indicate a difference of cognitive linguistic units
between Japanese and other languages like Dutch. This issue is the next target of our study.
The Creation, Distribution And Use Of Linguistic Data: The Case Of The Linguistic Data Consortium
- Mark Liberman,Christopher Cieri
The Linguistic Data Consortium (LDC) is an open consortium of
universities, companies and government research laboratories. It
creates and distributes speech and text databases, lexicons and
other resources. The University of Pennsylvania is the LDC's
host institution. The LDC was founded in 1992 with a grant
from the Defense Advanced Research Projects Agency
(DARPA). Currently, all LDC publication and distribution
activities are self-supporting, while new data creation is partly
supported by grant IRI 9528587 from the Information, Robotics
and Intelligent Systems division of the National Science
Foundation (NSF). The LDC's core mission remains the support
of pre-competitive research and development in speech and
language technology, but support of other language-related
research is also...
Representing Aspects of Language
- Robert F. Port,Timothy Van Gelder
We provide a conceptual framework for understanding
similarities and differences among various
schemes of compositional representation, emphasizing
problems that arise in modelling aspects
of human language. We propose six abstract dimensions
that suggest a space of possible compositional
schemes. Temporality turns out to play
a key role in defining several of these dimensions.
From studying how schemes fall into this space,
it is apparent that there is no single crucial difference
between AI and connectionist approaches
to representation. Large regions of the space of
compositional schemes remain unexplored, such as
the entire class of active, dynamic models that do
composition in time. These models offer the possibility
of parsing real-time input into useful segments,
A Decision-Based Approach to Rhetorical Parsing
- Daniel Marcu
We present a shift-reduce rhetorical parsing algorithm
that learns to construct rhetorical structures
of texts from a corpus of discourse-parse action sequences.
The algorithm exploits robust lexical, syntactic,
and semantic knowledge sources.
The application of decision-based learning techniques
over rich sets of linguistic features has
improved significantly the coverage and performance
of syntactic (and to various degrees semantic)
parsers (Simmons and Yu, 1992; Magerman,
1995; Hermjakob and Mooney, 1997). In this paper,
we apply a similar paradigm to developing a
rhetorical parser that derives the discourse structure
of unrestricted texts.
Crucial to our approach is the reliance on a corpus
of 90 texts which were manually annotated with
discourse trees and the adoption of a...
Lisp -- Almost a whole Truth !
- Christian Queinnec
Lisp is well known for its metacircular definitions. They differ by their intent (what they want to
prove), their means (what linguistic features are allowed for the definition) and by their scope (what
linguistic features are described). This paper provides a new metacircular definition for a complete Lisp
system including traditionally neglected features such as cons, read, print and error. The programming
style adopted for this interpreter is inspired both by denotational semantics and its continuation passing
style (to explain continuation handling) and by the object oriented paradigm as highlighted by typedriven
generic functions. The resulting interpreter lessens the number of primitives it uses to only...
Data Structures For Page Readers
- Henry S. Baird,David J. Ittner
Software-engineering aspects of an experimental printed-page reader are
described, with emphasis on data-structure choices. This reader performs a wide
variety of tasks, including geometric layout analysis, symbol recognition, linguistic
contextual analysis, and user-selectable output encoding (e.g. Unicode). We
have implemented a single common data structure to support all these tasks. It
embraces iconic, geometric, probabilistic, and symbolic data, and can represent
the full physical document hierarchy as well as many partial stages of analysis.
We illustrate the evolution of this data structure, in the course of reading a page,
from purely iconic to purely symbolic form. The data structure can be snapshot in
machine- and OS-independent peripheral files. Careful...
A Practical, Declarative Theory of Dialog
- Susan W. Mcroy,Syed S. Ali
The general goal of our work is to investigate
computational models of dialog that can support
e#ective interaction between people and
computer systems. We are particularly interested
in the use of dialog for training and education.
To support e#ective communication, dialog
systems must facilitate users' understanding
by incrementally presenting only the most
relevant information, by evaluating users' understanding,
and by adapting the interaction to
address communication problems as they arise.
Our theory provides a specification and representation
of the linguistic, intentional, and social
information that influence how people understand
and respond in an ongoing dialog and
an architecture for combining this information.
We represent knowledge uniformly in a single,
declarative, logical language where the interpretation
and performance of...
A Computational Model of Minimalist Syntactic Theory
This paper has presented a theory based on principles of
Minimalist syntax, with several virtues. The model can
handle a wide range of human preference and complexity
judgements including Late Closure, Garden Path, and
Minimal Attachment effects, and does so in a way that
can accommodate the lexically-based factors discussed in
Stevenson and Merlo (1997, forthcoming). As Stevenson
and Merlo convincingly point out, these sharp lexical
contrasts are not predicted by a theory based on pure
frequency effects nor does their corpus analysis support
the view that acceptability is tracking frequency. In addition,
the theory can handle cross linguistic variation.
An Environment For The Labelling And Testing Of Melodic Aspects Of Speech
- Christel Brindopke,Arno Pahde,Franz Kummert,Gerhard Sagerer
In this paper, we present an environment for labelling
and testing of melodic aspects of spoken language.
The environment has three modes of application:
First, the environment provides labelling facilities
for a model-based melodic description for German.
Second, it supports a language independent
pre-theoretical description of speech melody allowing
the development of new melodic categories. Third,
our test bed can be used to generate speech samples
with controlled melodic parameters for further use in
perception experiments. The melodic description facilities
(model-based, pre-theoretical) are supported
by visual and audible feedback allowing a step-bystep
refinement of the melodic description in question.
Melodic aspects of speech are related with several
linguistic and extralinguistic phenomenon. Linguistic
aspects are for...
Resolution of Governor Selection Ambiguity For Korean Noun Phrase Using Automatically Constructed Lexical Information
- Hoojung Chung,Young-sook Hwang,Yong-jae Kwak
A natural language parsing system using dependency grammar analyzes a sentence by identifying governor for each linguistic constituent in the sentence. One of the difficult problems in parsing Korean language is to select a correct governor for a noun phrase. To solve this problem, we propose an automatic method to generalize cooccurrence data into conceptual-level lexical information using a thesaurus and raw corpus. And we also present a method to resolve governor selection ambiguity for Korean noun phrase using those lexical information. Experimental result shows that the parser using conceptual-level lexical information as well as cooccurrence information resolves governor selection...
Dutch Stress Acquisition: OT and Connectionist Approaches
- Marc Joanisse
phonological units cannot be directly observed and need to be inferred by a
learner; likewise, the language learner does not have information about which
words are regular or irregular at their disposal, since it is never expose to overt
evidence to this effect. Both these facts greatly complicate the task of inferring a
linguistic rule; it is proposed that the CP model might lend a better understanding
of how children are able to learn phonological patterns. Finally, as we explain
below, Dutch children seem to acquire stress in a way that suggests they are
acquiring these patterns in a specific stage-like way. Accounting for these facts
Simulation as an Environment for the Knowledge Acquisition of Procedural Expertise
- Adrian Pearce,Claude Sammut,Simon Goss
: Knowledge engineering is the discipline of encoding the knowledge of an expert into an operational form
such as an expert system. Some forms of expertise, "show me" rather than "tell me", are not readily available to
linguistic access by the expert or to symbolic codification by a knowledge engineer. Behavioural cloning, in which
examples of expert behaviour, are generalised into a performance model by machine learning techniques is one strategy.
This work has been focused on building a controller for dynamic systems. A variant which is more formal in the
knowledge structures and task analyses decomposes tasks into hierarchies of procedures and learns generalisations...
Identifying Discourse Markers in Spoken Dialog
- Peter A. Heeman,Donna Byronz,James F. Allen
In this paper, we present a method for identifying discourse
marker usage in spontaneous speech based on machine
learning. Discourse markers are denoted by special
POS tags, and thus the process of POS tagging can be used
to identify discourse markers. By incorporating POS tagging
into language modeling, discourse markers can be identified
during speech recognition, in which the timeliness of the information
can be used to help predict the following words.
We contrast this approach with an alternative machine learning
approach proposed by Litman (1996). This paper also
argues that discourse markers can be used to help the hearer
predict the role that the upcoming utterance plays in the dialog.
A Computational Architecture for Conversation
- Eric Horvitz
. We describe representation, inference strategies, and control procedures employed
in an automated conversation system named the Bayesian Receptionist. The
prototype is focused on the domain of dialog about goals typically handled by receptionists
at the front desks of buildings on the Microsoft corporate campus. The system
employs a set of Bayesian user models to interpret the goals of speakers given evidence
gleaned from a natural language parse of their utterances. Beyond linguistic features, the
domain models take into consideration contextual evidence, including visual findings. We
discuss key principles of conversational actions under uncertainty and the overall architecture
of the system, highlighting the use of a hierarchy of...