Term Sense Disambiguation Using a Domain-Specific Thesaurus
- Diana Maynard; Spohia Ananiadou
Term extraction is important for many information systems applications. Although terms should be monoreferential, in reality they exhibit a high degree of ambiguity. Whilst conventional solutions mainly involve statistical approaches, this paper proposes a more hybrid technique. The approach is based on the identification of relevant contextual information from sublanguage texts, enabling the incorporation of both syntactic and semantic knowledge in addition to statistical information. Our approach uses a linguistic filter to identify word classes which are expected to convey salient information about the term, and a semantics-based matching technique to find terms occurring in similar environments. Emphasis is placed...
- Towards Better Evaluations; Rolf Schwitter; Diego Moll; Rachel Fournier
We argue that re.ading comprehension tests are not particularly suited for the evaluation of NLP systems. Reading comprehension tests are specifically designed to evaluate human reading skills, and these require vast amounts of world knowledge and common-sense reasoning capabilities. Experience has shown that this kind of full-fledged question answering (QA) over texts from a wide range of domains is so difficult for machines as to be far beyond the present state of the art of NLP. To advance the field we propose a much more modest' evaluation set:up, viz. Answer Extraction (AE) over texts from highly restricted domains. AE aims...
Acquiring Lexical Generalizations from Corpora: A Case Study for Diathesis Alternations
- Maria Lapata
This paper examines the extent to which verb aliathesis alternations are empirically attested in corpus data. We automatically acquire alternating verbs from large balanced corpora by using partialparsing methods and taxonomic information, and discuss how corpus data can be used to quantify linguistic generalizations. We estimate the productivity of an alternation and the typicality of its members using type and token frequencies.
SiteQ/J: AQuestion Answering System for Japanese
- Seungwoo Lee; Gary Geunbae Lee
This paper describes our Question Answering system participated in QAC Task1 of NTCIR3 and reports the results with some observations. Through analyzing the previous TREC QA data, we defined passage and developed passage selection method suitable for Question Answering. Using LexicoSemantic Patterns (LSP), we identify answer type of a question and detect answer candidates without any deep linguistic analysis of the texts. Answer candidates are ranked by passage scores and distances between answer candidates and matched terms. As a result of better engineering, our system showed excellent performance when evaluated by mean reciprocal rank (MRR) in NTCIR 3.
Answer Extraction Using A Dependency Grammar In ExtrAns
- Diego Mollá; Gerold Schneider; Rolf Schwitter; Michael Hess
We report on the implementation of an answer extraction system, ExtrAns, that uses the output of a dependency-based parser and grammar. In order to increase speed, the parser and grammar used sacrifice functionalism (in the framework of dependency theory) in favour of projectivity. We have found that the resulting dependency structures, although cumbersome to handle, can be used by ExtrAns to find the syntactic and semantic dependencies needed in several of the linguistic processing stages. In particular, we focus on the minimal logical form generation.
Tools for Extracting and Structuring Knowledge from Texts
- Antoine Ogonowski; Autlmrs Antoine Ogonowski; Eva Dauphin; Marie Luce Herviou; The Ihank; M. Bernard; G Cldmencin; S. Lacep; Mg. Monteil; G. Morizc
We demonstrate an approach and an accompanying UNIX toolbox for perl:orming various kinds of Knowledge txlraclions and Structuring. The goal is to "practically" enhance the productivily while conslrncting resortroes for NLI syslcms on the basis of large corpora el' lechnical texts,. Users are lexicon/grammar bnilders, tcrminologists and knowledge engineers. We stay open to already explored tncthods in Ibis of ncighbonring activities but put a greater stress on the use of linguistic knowledge. The originality of the work presented here lies in the scope of applications addressed and in the degree of nse of linguistic knowledge.
Designing Spelling Correctors for Inflected Languages Using Lexical Transducers
- I. Aldezabal; I. Alegria; O. Ansa; J. M. Arriola; N. Ezeiza; I. Aduriz; A. Da Costa; Uzei Hizkia
This paper describes the components used in the design of the commercial XuxenII spelling checker/corrector for Basque. It is a new version of the Xuxen spelling corrector (Aduriz et al., 97) which uses lexical transducers to improve the process. A very important new feature is the use of user dictionaries whose entries can recognise both the original and infiected forms. In languages with a high level of inflection such as Basque spelling checking cannot be resolved without adequate treatment of words from a morphological standpoint. In addition to this, the morphologim cal treatment has other important features: coverage, reusability of...
Exact Trade-Off Between Approximation Accuracy and Interpretability: Solving . . .
- Domonkos Tikk; Péter Baranyi
Although, in literature various results can be found claiming that fuzzy rule-based systems (FRBSs) possess the universal approximation property, to reach arbitrary accuracy the necessary number of rules are unbounded. Therefore, the inherent property of FRBSs in the original sense of Zadeh, namely that they can be characterized by a semantic relying on linguistic terms is lost. If we restrict the number of rules, universal approximation is not valid anymore as it was shown for, including others, Sugeno and TSK type models [10,19]. Due to this theoretic bound there is recently a great demand among researchers on finding trade-off techniques...
An interlingua aiming at communication on the Web: How language-independent can it be?
- Ronaldo Teixeira Martins; Lucia Helena Machado Rino; Maria das Gracas Volpe Nunes; Lucia Helena; Machado Rino; Maria Graqas; Volpe Nunes; Gisele Montilha; Osvaldo Novais
In this paper, we describe the Universal Networking Language, an interlingua to be plugged in a. Web environment aiming at allowing for many-to-many information exchange, 'many' here referring to many natural languages. The interlingua is embedded in a Knowledge-Base MT system whose languagedependent modules comprise an encoder, a decoder, and linguistic resources that have been developed by native speakers of each language involved in the project. Issues concerning both the interlingua formalism and its foundational issues are discussed.
Dialogue Management in a Home Machine Environment: Linguistic Components over an Agent Architecture
- Joes F. Quesada; Federico Garcia; Esther Sena; Jose Angel Bernal; Gabriel Arnores; Grupo De Investigacisn Julietta
This paper presents the main characteristics of an Agent-based Architecture for the design and implementation of a Spoken Dialogue System. From a theoretical point of view, the system is based on the Information State Update approach, in particular, the system aims at the management of Natural Command Language Dialogue Moves in a Home Machine Environment. Specifically, the paper is focused on the Natural Language Understanding and Dialogue Management Agents, and discusses their integration over a global agent architecture (which includes Action and Knowledge Managers, Speech Input/Output compo- nents and HomeSetup controllers). I
Learning Distributed Linguistic Classes
- Stephan Raaijmakers
Error-correcting output codes (ECOC) have emerged in machine learning as a successful implementation of the idea of distributed classes. Monadic class symbols are replaced by bit strings, which are learned by an ensemble of binary-valued classifiers (dichotomizers). In this study, the idea of ECOC is applied to memory-based language learning with local nearest neighbor) classifiers. Regression analy- sis of the experimental results reveals that, in order for ECOC to be successful for language learning, the use of the Modified Value Difference Metric (MVDM) is an important factor, which is explained in terms of population density of the class hyperspace.
Lexical and Morphological Skills in English-Speaking Children with Williams Syndrome
- Harald Clahsen; Melanie Ring; Christine Temple; Co Sq
Introduction In a series of previous studies, we have investigated a group of four English-speaking children with Williams-Syndrome (WS) with respect to a range of linguistic phenomena and skills, among them the comprehension of passives and anaphoric pronouns (Clahsen & Almazan 1998), past-tense inflection (Clahsen & Almazan 1998), noun plurals and compounding (Clahsen & Almazan 2001), comparative adjective formation (Clahsen & Temple 2003), receptive vocabulary skills and naming (Temple et al. 2002), and reading (Temple 2003). The findings from these studies were interpreted within modular theories of linguistic representation and processing. Within such theories, language performance should reflect normal linguistic...
Shallow Parsing with PoS Taggers and Linguistic Knowledge - A Comparative Study of Three Algorithms and Four Learning Tasks
- Beáta Megyesi
In this study, three data-driven algorithms are applied to shallow parsing of Swedish texts by using state-of-the-art data-driven PoS taggers as the basis for parsing. The phrase structure is represented by nine types of phrases in a hierarchical structure containing labels for every constituent type the token belongs to in a hierarchical fashion. A special attention is directed to the algorithms' sensitivity to different types of linguistic information included in the training data, as well as the algorithms' sensitivity to the size of the various types of training data sets. Four types of linguistic features are used; the algorithms are...
Knowledge Extraction for Identification of Chinese Organization Names
- Keh-jiann Chen; Chao-jan Chen
In this paper, a knowledge extraction process was proposed to extract the knowledge for identifying Chinese organization names. The knowledge extraction process utilizes the structure property, statistical property as well as partial linguistic knowledge of the organization names to extract new organizations from domain texts. The knowledge extraction processes were experimented on large amount of texts retrieved from WWW. With high standard of threshold values, new organization names can be identified with very high precision. Therefore the knowledge extraction processes can be carried out automatically to self' improve the performance in the future.
Towards a Proper Linguistic and Computational Treatment of Scrambling: An Analysis of Japanese
- Saudiway Ng Nec
Phis paper describes how recent linguistic rcsu]ts in explaining ,Japanese short and long distance scram hling (:an be directly incorporated into an existing principles-and-parameters-based parser with only trivial modifications, The fact that this is realizable on a parser originally designed for a tixed-word-order language, together with the fact thai, Japanese scrambling is complex, attests to the high degree of cross linguistic generalization present in the theory.
Perception of Syllable Prominence by Listeners with and without Competence in the Tested Language
- Anders Eriksson; Esther Grabe; Hartmut Traunmüller
In an experiment reported previously, subjects rated perceived syllable prominence in a Swedish utterance produced by ten speakers at various levels of vocal effort. The analysis showed that about half of the variance could be accounted for by acoustic factors. Slightly more than half could be accounted for by linguistic factors. Here, we report two additional experiments. In the first, we attempted to eliminate the linguistic factors by repeating the Swedish listening experiment with English listeners who had no knowledge of Swedish. In the second, we investigated the prominence pattern Swedish subjects expect by presenting the utterance only in written...
CarSim: An Automatic 3D Text-to-Scene Conversion System Applied to Road Accident Reports
- Ola Åkerberg; Hans Svensson; Bastian Schulz; Pierre Nugues
CarSim is an automatic text-to-scene conversion system. It analyzes written descriptions of car accidents and synthesizes 3D scenes of them. The conversion process consists of two stages. An information extraction module creates a tabular description of the accident and a visual simulator generates and animates the scene. We implemented a first version of Car- Sim that considered a corpus of texts in French. We redesigned its linguistic modules and its interface and we applied it to texts in English from the National Transportation Safety Board in the United States.
Cohesion and Collocation: Using Context Vectors in Text Segmentation
- Stefan Kaufmann
Collocational word similarity is considered a source of text cohesion that is hard to measure and quantify. The work presented here explores the use of information from a training corpus in measuring word similarity and evaluates the method in the text segmentation task. An implementation, the VetTile system, produces similarity curves over.texts using pre-compiled vector representations of the contextual behavior of words. The performance of this system is shown to improve over that of the purely string-based TextTiling algorithm (Hearst, 1997). I Background The notion of text cohesion rests on the intuition that a text is "held together" by a...
Perception of Speaker Age, Sex and Vowel Quality Investigated Using Stimuli Produced with an Articulatory Model
- Hartmut Traunmüller; Anders Eriksson; Lucie Menard
This paper deals with the perception of linguistic and paralinguistic qualities conveyed by synthetic vowels produced with an articulatory model in which transfer functions of the French vowels/i y e o e ce/characteristic of five growth stages were each combined with five different F0 values. Listeners had to judge the speaker's age and sex in addition to vowel quality. Four subgroups of listeners were distinguished, according to sex and frequency of contact with children. The results were subjected to regression analysis based on ctitical band rate (z) and logarithmic values of F0, F1 to F5 and calculated values of F2'....
Audio Partitioning and Transcription for Broadcast Data Indexation
- J.L. Gauvain; L. Lamel; G. Adda
This work addresses automatic transcription of television and radio broadcasts in multiple languages. Transcription of such types of data is a major step in developing automatic tools for indexation and retrieval of the vast amounts of information generated on a daily basis. Radio and television broadcasts consist of a continuous data stream made up of segments of different linguistic and acoustic natures, which poses challenges for transcription. Prior to word recognition, the data is partitioned into homogeneous acoustic segments. Non-speech segments are identified and removed, and the speech segments are clustered and labeled according to bandwidth and gender. Word recognition...