TextNet -- A Text-based Intelligent System
- Sanda Harabagiu; Dan I. Moldovan
This paper describes a text-based knowledge representaion system, capable of abductive inference. The system, called TextNet, uses the semantic information rendered by WordNet to process texts and to connect texts on the fly. Our system contains (1) a linguistic backbone, enhancing the semantics of WordNet 1.5., (2) corpora comprising indexed Treebank texts, html documents provided by web crawlers and (3) an inference module capable of abductions on the linguistic knowledge base. The inference system produces the representation of compound concepts by recognizing their typical functions. Basic Idea Knowledge representation systems are of central importance to the field of AI since...
EUSLEM: A lemmatiser/tagger for Basque
- Itziar Aduriz; Izaskun Aldezabal; Iñaki Alegria; Xabier Artola; Nerea Ezeiza; Ruben Urizar
This paper presents relevant issues that have been considered in the design and development of a general purpose lemmatiser/tagger for Basque (EUSLEM). The lemmatiser/tagger is conceived as a basic tool for other linguistic applications. It uses the lexical database and the morphological analyser previously developed and implemented. We will describe the components used in the development of the lemmatiser/tagger and, finally, we will point out possible further applications of this tool. 1. Introduction An automatic lemmatiser/tagger is a basic tool for applications such as automatic indexation, documental databases, syntactic and semantic analysis, analysis of text corpora, etc. Its job is...
AMALIA -- A Unified Platform for Parsing and Generation
- Shuly Wintner; Evgeniy Gabrilovich; Nissim Francez
Contemporary linguistic theories (in particular, HPSG) are declarative in nature: they specify constraints on permissible structures, not how such structures are to be computed. Grammars designed under such theories are, therefore, suitable for both parsing and generation. However, practical implementations of such theories don't usually support bidirectional processing of grammars. We present a grammar development system that includes a compiler of grammars (for parsing and generation) to abstract machine instructions, and an interpreter for the abstract machine language. The generation compiler inverts input grammars (designed for parsing) to a form more suitable for generation. The compiled grammars are then executed...
Chinese Sentence Generation in a Knowledge-Based Machine Translation System
- Tangqiu Li; Eric H. Nyberg; Jaime G. Carbonell
This paper presents a technique for generating Chinese sentences from the Interlingua expressions used in the KANT knowledge-based machine translation system. Chinese sentences are generated directly from the semantic representation using a unificationbased generation formalization which takes advantage of certain linguistic features of Chinese. Direct generation from the semantic form eliminates the need for an intermediate syntactic structure, thus simplifying the generation procedure. The generation algorithm is top-down, data-driven and recursive. The descriptive nature of the pseudo-unification grammar formalism used in KANT allows the grammar developer to write very straightforward semantic grammar rules. We also discuss some of the crucial...
Test Suites for Natural Language Processing
- Lorna Balkan; Doug Arnold; Siety Meijer
This paper introduces the topic of evaluation of Natural Language Processing systems, and discusses the role of test suites in the linguistic evaluation of a system. The work on test suites that is being carried out within the framework of the TSNLP project is described in detail and the relevance of the project to the evaluation of machine translation systems considered. INTRODUCTION Evaluation is a topic that is currently attracting a great deal of interest in the Natural Language Processing community.
A DRT-based approach for formula parsing in textbook proofs
- Claus Zinn
Knowledge is essential for understanding discourse. Generally, this has to be common sense knowledge and therefore, discourse understanding is hard. For the understanding of textbook proofs, however, only a limited quantity of knowledge is necessary. In addition, we have gained something very essential: inference. A prerequisite for parsing textbook proofs is to being able to parse formulae that occur in these proofs. Parsing formulae alone in the empty context is trivial. But within the context of textbook proofs the task soon gets complex. Several kinds of references from the text to parts or sets of terms and formulae have to...
Intelligent Text Analysis For Dynamically Maintaining And Updating Domain Knowledge Bases
- Klemens Schnattinger; Udo Hahn
We propose a knowledge-intensive text analysis approach which deals with the continuous assimilation of new concepts into domain knowledge bases. Text understanding and knowledge acquisition proceed in tandem on the basis of terminological reasoning. Concept learning is considered an evidence-based choice problem the solution of which balances the "quality" of various clues from the linguistic structure of the texts and conceptual structures in the knowledge bases.
Understanding Mathematical Discourse
- Claus Zinn
Discourse Understanding is hard. This seems to be especially true for mathematical discourse, that is proofs. Restricting discourse to mathematical discourse allow us, however, to study the subject matter in its purest form. This domain of discourse is rich and welldefined, highly structured, offers a well-defined set of discourse relations and forces/allows us to apply mathematical reasoning. We give a brief discussion on selected linguistic phenomena of mathematical discourse, and an analysis from the mathematician's point of view. Requirements for a theory of discourse representation are given, followed by a discussion of proofs plans that provide necessary context and structure....
Layout and Language: lists and tables in technical documents
- Shona Douglas; Matthew Hurst
In this paper, we describe some of the interactions between layout and language we have been dealing with in recent applied NLP projects. We present two complementary views of lists and tables, intended to bridge the gap between considering them as a type of running text (which linguistics knows how to deal with) and as a multi-dimensional relation represented in two dimensions, which may have many reading-paths (which linguistics doesn't know how to deal with). Stated or inferred linguistic and world knowledge in the text surrounding tables and lists provides a context for the interpretation of a set of tuples...
Integrating Tense, Aspect and Genericity
- Diana Santos; Grupo Linguagem; Natural Inesc; Inesc Lisboa
This paper proposes an integrated theory of tense, aspect and genericity building on the work of Moens (1987) for tense and aspect in English. I start by motivating the treatment of tense and genericity at the same level, by showing that (1) it is the same linguistic system that is at work in both cases; (2) there are some facts of language that only receive an adequate explanation if the two phenomena are dealt together, namely, the "stative paradox" and the somehow strange behavior of activities. To handle these two facts, I propose the existence of two kinds of states....
How to Solve the Conflict of Structure-Preserving Translation and Fluent Text Production
- Karin Harbusch
. Compared with a `conventional' natural--language generation system, in Machine Translation (MT), the decisions in a what--to--say component, i.e. the selection of the content of an utterance and an adequate speech act, are made by the speaker. Although the speaker realizes the how--to--say task in the source language, i.e. does the linguistic shaping, a how--to--say component in the target language is required in an MT system. Especially, decisions in this component should be guided by the syntactic realization in the source language in order preserve the structure. Here, we describe a flexible how--to--say component for MT. On the one hand,...
Massively-Parallel Knowledge Processing For Complex Pattern Understanding
- V. Fischer; V. Fischer
this paper concentrates on the automatic understanding of spontaneous speech. A new dialog system will be presented. The any--time behaviour of the system will strongly be supported by the exploitation of parallelism on different levels and the use of linguistic and prosodic constraints. The paper is subdivided into 4 parts: the introduction is followed by a review of past research that provides the background for our work. The structure of the dialog system is then presented, followed by a description of the parallel control algorithm. The paper will be concluded by some final remarks. 2. BACKGROUND RESEARCH
Different Issues In The Design Of A Lemmatizer/Tagger For Basque
- I. Aduriz; I. Alegria; J.M. Arriola; X. Artola; A. Díaz de Ilarraza; N. Ezeiza; K. Gojenola; M. Maritxalar
This paper presents relevant issues that have been considered in the design of a general purpose lemmatizer/tagger for Basque (EUSLEM). The lemmatizer/tagger is conceived as a basic tool necessary for other linguistic applications. It uses the lexical data base and the morphological analyzer previously developed and implemented. Due to the characteristics of the language, the tagset here proposed is structured in four levels so that each level is a refinement of the previous one in the sense that it adds more detailed information. We will focus on the problems found in designing this tagset and on the strategies for morphological...
Knowledge Mining From Textual Sources
- Freiburg Im Breisgau; Udo Hahn; Udo Hahn; Udo Hahn; Klemens Schnattinger; Klemens Schnattinger; Klemens Schnattinger
We propose a new, knowledge-based methodology for deep knowledge discovery from natural language documents. Data mining, data interpretation, and data cleaning are all incorporated in a terminological reasoning process. We exploit qualitative knowledge about linguistic phenomena in natural language texts and structural configurations in text and domain knowledge bases to generate concept hypotheses, rank them according to plausibility, and select the most credible ones for knowledge assimilation. Appeared in: CIKM97 - Proceedings of the 6th Intl. Conference on Information and Knowledge Management. Las Vegas, Nevada, USA, November 10-14, 1997. Ed. by F.Golshani & K.Makki. New York/NY: ACM, 1997, pp.83-90. Knowledge...
Testing Gricean Constraints on a WordNet-based Coherence Evaluation System
- Sanda Harabagiu; Dan I. Moldovan; Takashi Yukawa
This paper presents a computational method for analyzing Gricean constraints for the purpose of evaluating text coherence. Our system consists of a knowledge base constructed on top of WordNet, an inference engine that establishes discourse semantic paths between the input concepts and a mechanism of relating them to the context. Grice's maxims provide conditions to test coherence, while the semantic paths provide the space on which these conditions are tested. The computational method is based on a markerpropagation technique that is independent of the size of the knowledge base. The paper describes the method and provides results obtained with the...
Improving Part of Speech Disambiguation Rules by Adding Linguistic Knowledge
- Nikolaj Lindberg; Martin Eineborg
This paper reports the ongoing work of producing a state of the art part of speech tagger for unedited Swedish text. Rules eliminating faulty tags have been induced using Progol. In previously reported experiments, almost no linguistically motivated background knowledge was used [5, 8]. Still, the result was rather promising (recall 97.7%, with a pending average ambiguity of 1.13 tags/word). Compared to the previous study, a much richer, more linguistically motivated, background knowledge has been supplied, consisting of examples of noun phrases, verb chains, auxiliary verbs, and sets of part of speech categories. The aim has been to create the...
KAIST Tree Bank Project for Korean: Present and Future Development
- Key-sun Choi; Young S. Han; Young G. Han; Oh W. Kwon
In this paper, we introduce the on-going project for building a large annotated corpus of Korean written texts undertaken by KAIST 1 since 1992. At present, the corpus consists of over 5 million word units of Korean covering 13 subject fields. The POS tagset used to annotate the corpus contains 74 tags. The tagset is designed to provide sufficient recoverability even for high level linguistic processing. Current efforts are put mostly on bracketing the corpus. The corpus is expected to be distributed through the Korean Linguistic Data Consortium before the end of 1994. Summary of The Resource Name KAIST Tree...
Automatic Detection Of Semantic Boundaries
- M. Cettolo; A. Corazza
In spoken language systems, the segmentation of utterances into coherent linguistic/semantic units is very useful, as it makes easier processing after the speech recognition phase. In this paper, a methodology for semantic boundary prediction is presented and tested on a corpus of person-to-person dialogues. The approach is based on binary decision trees and uses text context, including broad classes of silent pauses, filled pauses and human noises. Best results give more than 90% precision, almost 80% recall and about 3% false alarms. 1. INTRODUCTION This work focuses on the automatic segmentation of dialogue turns into homogeneous Semantic Units (SUs) ....
A Parallel System for Text Inference Using Marker Propagations
- Sanda M. Harabagiu; Dan I. Moldovan
This paper presents a possible solution for the text inference problem---extracting information unstated in a text, but implied. Text inference is central to natural language applications such as information extraction and dissemination, text understanding, summarization, and translation. Our solution takes advantage of a semantic English dictionary available in electronic form that provides the basis for the development of a large linguistic knowledge base. The inference algorithm consists of a set of highly parallel search methods that, when applied to the knowledge base, find contexts in which sentences are interpreted. These contexts reveal information relevant to the text. Implementation, results, and...
A Fuzzy Beam-Search Rule Induction Algorithm
- Cristina Fertig; Alex A. Freitas; Lucia V. R. Arruda; Celso Kaestner
. This paper proposes a fuzzy beam search rule induction algorithm for the classification task. The use of fuzzy logic and fuzzy sets not only provides us with a powerful, flexible approach to cope with uncertainty, but also allows us to express the discovered rules in a representation more intuitive and comprehensible for the user, by using linguistic terms (such as low, medium, high) rather than continuous, numeric values in rule conditions. The proposed algorithm is evaluated in two public domain data sets. 1 Introduction This paper addresses the classification task. In this task the goal is to discover a...