Improved Language Modeling By Unsupervised Acquisition Of Structure
- Klaus Ries,Finn Dag Bu,Ye-yi Wang,Alex Waibel
The perplexity of corpora is typically reduced by more than
30% compared to advanced n-gram models by a new method
for the unsupervised acquisition of structural text models.
This method is based on new algorithms for the classification
of words and phrases from context and on new sequence
finding procedures. These procedures are designed to work
fast and accurately on small and large corpora. They are
iterated to build a structural model of a corpus.
The structural model can be applied to recalculate the
scores of a speech recognizer and improves the word accuracy.
Further applications such as preprocessing for neural
networks and (hidden) markov models in language processing,
which exploit the...
Compiling Nested Data-Parallel Programs for Shared-Memory Multiprocessors
- Siddhartha Chatterjee
While data parallelism is well-suited from algorithmic, architectural, and linguistic considerations
to serve as a basis for portable parallel programming, its characteristic fine-grained parallelism makes
the efficient implementation of data-parallel languages on MIMD machines a challenging task. The
design, implementation, and evaluation of an optimizing compiler are presented for an applicative nested
data-parallel language called VCODE targeted at the Encore Multimax, a shared-memory multiprocessor.
The source language supports nested aggregate data types; aggregate operations including elementwise
forms, scans, reductions, and permutations; and conditionals and recursion for control flow. A small set
of graph-theoretic compile-time optimizations reduce the overheads on MIMDmachines in several ways:
by increasing the grain size...
LAIR (Linguistic Analysis, Interpretation and Reasoning): Reference Manual
- Allan Ramsay
This document can also be viewed on the WEB at
Generalization And Discrimination In Tree-Structured Unit Selection
- Michael W. Macon,Andrew E. Cronk,Johan Wouters
Concatenative "selection-based" synthesis from large databases has emerged as a viable framework for TTS waveform generation. Unit selection algorithms attempt to predict the appropriateness of a particular database speech segment using only linguistic features output by text analysis and prosody prediction components of a synthesizer. All of these algorithms have in common a training or "learning" phase in which parameters are trained to select appropriate waveform segments for a given feature vector input. One approach to this step is to partition available data into clusters that can be indexed by linguistic features available at runtime. This method relies critically on...
Language and Pronunciation Modeling in the CMU 1996 Hub 4 Evaluation
- Kristie Seymore,Stanley Chen,Maxine Eskenazi,Ronald Rosenfeld
We describe several language and pronunciation modeling techniques
that were applied to the 1996 Hub 4 Broadcast News transcription
task. These include topic adaptation, the use of remote
corpora, vocabulary size optimization, n-gram cutoff optimization,
modeling of spontaneous speech, handling of unknown linguistic
boundaries, higher order n-grams, weight optimization in rescoring,
and lexical modeling of phrases and acronyms.
The language modeling component of the CMU 1996 Hub 4 system
was developed througha series of experiments in topic adaptation, the
use of remote corpora, vocabulary size optimization, n-gram cutoff
optimization, modeling of spontaneousspeech,handlingof unknown
linguistic boundaries, higher order n-grams, weight optimization in
rescoring, and lexical modeling of phrases and acronyms. These
A Language/Action perspective on Cooperative Information Agents
- E. Verharen,F. Dignum,H. Weigand
Research in Information Systems has switched its focus from data to
communication. The communication between different autonomous ICS's (Information
and Communication System) requires a certain amount of intelligence of each system.
The system should be able to know which queries it can/may handle and also be able to
negotiate about the information that it will give. In short these systems evolve into what
is called Cooperative Information Agents. We show that basing the information
contents of these agents on linguistic concepts and furthermore modelling the
communications between the agents using speech act theory provides for a natural and
sound setting for these CIA's as well as for Business...
IRIDIA, Universit Libre de Bruxelles 50 av. F. Roosevelt, CP 194/6 B1050 Bruxelles, Belgium firstname.lastname@example.org
- Alessandro Saffiotti
This contribution aims at matching Dempster-Shafer (DS) theory to the needs of
knowledge representation (KR). We first survey the most common approaches to
representing knowledge in DS theory, putting a stronger emphasis on recent work on
graph-based approaches. We then pin-point to some limitations of these approaches.
To overcome these limitations, we suggest marrying DS theory with KR by
proposing a formal framework where DS theory is used for representing strength of
belief about our knowledge, and the linguistic structures of an arbitrary KR language
are used for representing the knowledge itself. We exemplify this framework by
integrating (an extension of) the KRYPTON KR system with DS theory,...
Compiling HPSG constraint grammars into logic programs
- Thilo Gotz
. This paper defines a translation from HPSG constraint grammars
into constraint logic programs that preserves the parsing problem.
We will show that there can be no such complete translation, yet we will
argue that for theoretical as well as practical reasons, it is interesting to
see how closely one can approximate HPSG grammars with logic programs.
We will thus examine the properties of the translation in detail
and come up with a restriction on HPSG grammars that ensures the
completeness of the translation. An optimised version of the translation
procedure has been implemented in Prolog.
Head-driven Phrase Structure Grammar (HPSG, ) has become a de facto
A Hypothetical Reasoning Algorithm for Linguistic Analysis
- Esther Konig
The Lambek calculus, an intuitionistic fragment of Linear Logic, has recently been rediscovered
by linguists. Due to its built-in hypothetical reasoning mechanism, it allows for
describing a certain range of those phenomena in natural language syntax which involve
incomplete subphrases or moved constituents.
Previously, it seemed unclear how to extent traditional parsing techniques in order to
incorporate reasoning about incomplete phrases, without causing the undesired effect of
derivational equivalences. It turned out that the Lambek calculus offers a framework to
formulate equivalent but more implementation-oriented calculi where this problem does not
occur. In this paper, such a theorem prover for the Lambek calculus, i.e. a parser for Lambek
Terminological Importation for Adapting Reusable Knowledge . . .
- Jos L. Sierra,Martn Molina
This paper describes the adaptation approach of reusable knowledge representation components used in the KSM environment for
the formulation and operationalisation of structured knowledge models. Reusable knowledge representation components in KSM are
called primitives of representation. A primitive of representation provides: (1) a knowledge representation formalism (2) a set of
tasks that use this knowledge together with several problem-solving methods to carry out these tasks (3) a knowledge acquisition
module that provides different services to acquire and validate this knowledge (4) an abstract terminology about the linguistic
categories included in the representation language associated to the primitive. Primitives of representation usually are domain
independent. A primitive...
Lexical Functional Grammars and Lexical Grammars
- Esther Konig
This paper gives a comparison of the descriptive tools of two types of grammar formalisms, Lexical
Functional Grammar (LFG), and (extended) categorial grammar, by defining a transformation of an LFG
into a "strictly lexicalized grammar" (i.e. a categorial grammar). By this transformation, the grammar and
the parser become more concise, i.e. computationally and conceptually more tractable, without loosing any
essential expressive power.
For the implementation of large-scale grammars and for the definition of sophisticated syntax-semanticsinterfaces,
it is essential that the basic data-structures of the grammar formalism are as simple as possible
to reduce the cognitive effort of grammar engineering, and to keep the parsing and generation...
Catherine Macleod, Adam Meyers, and Ralph Grishman,
- Catherine Macleod,Adam Meyers,Ralph Grishman
A large corpus (about 100 MB of text)
was selected and examples of 750 frequently
occurring verbs were tagged with
their complement class as defined by a
large computational syntactic dictionary,
COMLEX Syntax. This tagging task
led to the refinement of already existing
classes and to the addition of classes that
had previously not been defined. This
has resulted in the enrichment and improvement
of the original COMLEX Syntax
dictionary. Tagging also provides
statistical data which will allow users to
select more common complements of a
particular verb and ignore rare usage.
We discuss below some of the problems
encountered in tagging and their resolution.
COMLEX Syntax is a moderately-broad-coverage
English lexicon (with about 38,000 root forms)...
Hybrid Belief System For Doubtful Agents
- Alessandro Saffiotti
This paper aims at bridging together the fields of Uncertain Reasoning and Knowledge
Representation. The bridge we propose consists in the definition of a Hybrid Belief System, a general modular
system capable of performing uncertain reasoning on structured knowledge. This system comprises two distinct
modules, UR-mod and KR-mod: the UR-mod provides the uncertainty calculus used to represent uncertainty
about our knowledge; this knowledge itself is in turn represented by the linguistic structures made available
by the KR-mod. An architecture is drawn for this system grounded on a formal framework, and examples
are given using Dempster-Shafer theory or probabilities as UR-mod, and first order logic or KRYPTON...
The 1996 Broadcast News Speech And Language-Model Corpus
- David Graff
The Linguistic Data Consortium handled the recording, digitization,
and transcription of 130 hours of radio and television news broadcasts.
Of this material, 50 hours' worth was designated and published
as a baseline training set, and 20 hours were prepared and distributed
by NIST for use as development and evaluation test data, for the
1996 ARPA CSR Benchmark Tests. The remaining 60 hours were
held in reserve for use as additional training and evaluation test data
in the 1997 benchmarks. The LDC also acquired, conditioned and
published a five-year archive of commercially produced broadcast
transcripts for use in constructing language models for the broadcast
news domain. These tasks posed a broad...
Parsing As Dynamic Interpretation Of Feature Structures
- Harry Bunt,Van Der Sloot
In this chapter we develop a new approach to parsing through the recursive,
model-theoretic interpretation of expressions representing linguistic
knowledge. We argue that in particular the dynamic variant of modeltheoretic
semantics can be useful, which approaches interpretation in a
On the theoretical side, we develop this approach by integrating feature
structures in an existing formal logical language, defined and implemented
for computational semantic purposes. The resulting language, called GEL,
allows us to represent semantic as well as syntactic knowledge. In particular,
we can represent grammatical rules as well as the phrase structures
that they generate. Whereas a phrase-structure rule is traditionally viewed
procedurally, as a recipe for phrase building,...
- Jerry Fodor,Ernie Lepore
A certain metaphysical thesis about meaning that we'll call Informational Role Semantics (IRS)
is accepted practically universally in linguistics, philosophy and the cognitive sciences: the meaning (or
content, or sense') of a linguistic expression
is constituted, at least in part, by at least some of its
inferential relations. This idea is hard to state precisely, both because notions like metaphysical
constitution are moot and, more importantly, because different versions of IRS take different views on
whether there are constituents of meaning other than inferential role, and on which of the inferences an
expression occurs in are meaning constitutive. Some of these issues will presently concern us; but...
Tagging of Speech Acts and Dialogue Games in Spanish Call Home
- Lori Levin,Klaus Ries,Alon Lavie
The Clarity project is devoted to automatic detection
and classification of discourse structures in
casual, non-task-oriented conversation using shallow,
corpus-based methods of analysis. For the
Clarity project, we have tagged speech acts and
dialogue games in the Call Home Spanish corpus.
We have done preliminary cross-level experiments
on the relationship of word and speech act n-grams
to dialogue games. Our results show that the label
of a game cannot be predicted from n-grams
of words it contains. We get better than baseline
results for predicting the label of a game from
the sequence of speech acts it contains, but only
when the speech acts are hand tagged, and not
when they are automatically detected....
Locality Abstractions for Parallel and Distributed Computing
- Suresh Jagannathan
ions for Parallel and
Computer Science Research, NEC Research Institute, 4 Independence Way,
Princeton, NJ 08540, email@example.com
Abstract. Temporal and spatial locality are significant concerns in the
design and implementation of any realistic parallel or distributed computing
system. Temporal locality is concerned with relations among objects
that share similar lifetimes and birth dates; spatial locality is concerned
with relations among objects that share information. Exploiting
temporal locality can lead to improved memory behavior; exploiting spatial
locality can lead to improved communication behavior. Linguistic,
compiler, and runtime support for locality issues is especially important
for unstructured symbolic computations in which lifetimes and sharing
properties of objects are not readily apparent.
Pomset Logic as an Alternative Categorial Grammar
- Alain Lecomte,Christian Retor
: Lambek calculus may be viewed as a fragment of linear logic, namely intuitionistic
non-commutative multiplicative linear logic. As it is too restrictive to describe
numerous usual linguistic phenomena, instead of extending it we extend MLL with a
non-commutative connective, thus dealing with partially ordered multisets of formulae.
Relying on proof net technique, our study associates words with parts of proofs, modules,
and parsing is described as proving by plugging modules. Apart from avoiding spurious
ambiguities, our method succeeds in obtaining a logical description of relatively free
word order, head-wrapping, clitics, and extraposition (these latest two constructions are
unfortunately not included, for lack of space).
0 Introduction 2
A Model for Multimodal Reference Resolution
- Luis A. Pineda,E. Gabriela Garza
In this paper a discussion on multimodal
referent resolution is presented. The
discussion is centered on the analysis of how
the referent of an expression in one modality
can be found whenever the contextual
information required for carrying on such an
inference is expressed in one or more
different modalities. In particular, a model
for identifying the referent of a graphical
expression when the relevant contextual
information is expressed through natural
language is presented. The model is also
applied to the reciprocal problem of
identifying the referent of a linguistic
expression whenever a graphical context is
given. In Section 1 of this paper the notion
of modality in terms of which the theory is
developed is presented....