  1. Harvesting Dutch trees: Syntactic properties of spoken Dutch

    Ton Van Der Wouden; Ineke Schuurman; Machteld Schouppe; Heleen Hoekstra
    In this paper, we report on quantitative research into certain word order phenomena in Dutch. In our research, we use the Spoken Dutch Corpus (CGN), a major new resource for research into contemporary spoken Dutch. After briefly introducing the primary data, the annotations added, and some of the tools to explore the primary data and the annotations, we illustrate how the Corpus may be utilized to answer certain linguistic questions concerning the Dutch language.

  2. Enhancing First-Pass Attachment Prediction

    Fabrizio Costa; Paolo Frasconi; Vincenzo Lombardo; Patrick Sturt; Giovanni Soda
    This paper explores the convergence between cognitive modeling and engineering solutions to the parsing problem in NLP. Natural language presents many sources of ambiguity, and several theories of human parsing claim that ambiguity is resolved by using past (linguistic) experience. In this paper we analyze and refine a connectionist paradigm (Recursive Neural Networks) capable of processing acyclic graphs to perform supervised learning on syntactic trees extracted from a large corpus of parsed sentences. Following a widely accepted hypothesis in psycholinguistics, we assume an incremental parsing process (one word at a time) that keeps a connected partial parse tree at all...

  3. A Corpus-Independent Feature Set for Style-Based Text Categorization

    Moshe Koppel; Navot Akiva; Ido Dagan
    We suggest a corpus-independent feature set appropriate for style-based text categorization problems. To achieve this, we introduce a new measure on linguistic features, called stability, which captures the extent to which a language element, such as a word or syntactic construct, is replaceable by semantically equivalent elements.

  4. Induction of Classifications from Linguistic Data

    Rainer Osswald; Wiebke Petersen
    We present a flexible approach for extracting hierarchical classifications from linguistic data. To this end, the framework of observational logic is introduced, which extends the logic that underlies standard Formal Concept Analysis by allowing disjunctive rules and exclusions. We give a rigorous mathematical characterization of how the chosen rule type affects the structure of the induced hierarchy. The framework is applied to the induction of hierarchical classifications from linguistic databases. The pros and cons of several types of hierarchies are discussed in detail with respect to criteria such as compactness of representation, suitability for inference tasks, and intelligibility for the...

  5. Using Spatial Language in a Human-Robot Dialog

    Marjorie Skubic; Dennis Perzanowski; Alan Schultz; William Adams
    ... to describe their environment, e.g., "There is a desk in front of me and a doorway behind it", and to issue directives, e.g., "Go around the desk and through the doorway." In our research, we have been investigating the use of spatial relationships to establish a natural communication mechanism between people and robots, in particular, for novice users. In this paper, the work on robot spatial relationships is combined with a multimodal robot interface developed at the Naval Research Lab. We show how linguistic spatial descriptions and other spatial information can be extracted from an evidence grid map and...

  6. Towards an Analysis of Multi-Party Discourse

    Robert Malouf Malouf
    Halliday (1984) argues that the study of language as a unified object can only be justified if linguists address language simultaneously as both a process and a system of relations. Birmingham-style Discourse Analysis (DA) is one model of discourse that has been developed to satisfy Halliday's challenge. However, DA has only been applied to two-party discourse, and would seem to fall short of accounting for the full range of linguistic communication. One attractive proposal for dealing with multiparty discourse in a very different theoretical framework is Clark and Carlson's (1992) Informative Hypothesis. They argue that speech acts are primarily directed...

  7. HIZKING21: Integrating language engineering resources and tools into systems with linguistic capabilities

    A. Diaz De Ilarraza; A. Gurrutxaga; I. Hernaez; N. Lopez De Gereñu; K. Sarasola; Elhuyar Fundazioa
    On prsente les lignes essentielles du projet HIZKING21. Le but principal de ce projet consiste favoriser la recherche sur l'ingnierie linguistique pour rpondre aux exigences de l'environnement globalis de nos jours. Notre domaine est le dveloppement des technologies du langage pour la langue basque ainsi que l'intgration des ressources et des instruments pour les industries de la langue --des ressources qui existent dj et d'autres dvelopper dans ce projet-- dans des diffrents dispositifs (PCs, PDAs, lectromnagers, quipements des voitures, etc.). L'objective que nous poursuivons est de contribuer l'interaction avec toutes sortes de dispositifs faciles utiliser, employant la langue comme le...

  8. Embodied Construction Grammar in Simulation-Based Language Understanding

    Benjamin K. Bergen; Nancy Chang
    We present Embodied Construction Grammar, a formalism for linguistic analysis designed specifically for integration into a simulation-based model of language understanding. As in other construction grammars, linguistic constructions serve to map between phonological forms and conceptual representations.

  9. An account of negated sentences in the DRT framework

    Pascal Amsili; Anne Le Draoulec
    this paper is to show how this set of phenomena can be accounted for, either by means of Kamp & Reyle's proposal, or through other relevant principles, inspired in particular from Asher's treatment of abstract objects (Asher 1993). Our proposal is grounded on French linguistic data, involving in particular temporal clauses. Moreover, we will study here only sentential negation, which is achieved in French via the locution ne... pas . It is worth noting that the behaviour of this French "canonical" construction is different from that of its English equivalent, aux+not. Some of the examples we give in this paper...

  10. Probabilistic Context-Free Grammars for Phonology

    Karin Müller
    We present a phonological probabilistic contextfree grammar, which describes the word and syllable structure of German words. The grammar is trained on a large corpus by a simple supervised method, and evaluated on a syllabification task achieving 96.88% word accuracy on word tokens, and 90.33% on word types. We added rules for English phonemes to the grammar, and trained the enriched grammar on an English corpus. Both grammars are evaluated qualitatively showing that probabilistic context-free grammars can contribute linguistic knowledge to phonology. Our formal approach is multilingual, while the training data is language-dependent.

  11. Fuzzy Analogy: A New Approach for Software Cost Estimation

    Ali Idri; Alain Abran; Taghi M. Khosgoftaar
    Estimation models in software engineering are used to predict some important attributes of future entities such as software development effort, software reliability and programmers productivity. Among these models, those estimating software effort have motivated considerable research in recent years. Estimation by analogy is one of the most attractive technique in software effort estimation field. However, the procedure used in estimation by analogy is not yet able to handle correctly categorical data such as `very low', `complex' and `average'. In this paper, we propose a new approach based on reasoning by analogy, fuzzy logic and linguistic quantifiers to estimate effort when...

  12. Efficient Language Model Lookahead Through Polymorphic Linguistic Context Assignment

    Hagen Soltau; Florian Metze; Christian Fügen; Alex Waibel
    In this study, we examine how fast decoding of conversational speech with large vocabularies profits from efficient use of linguistic information, i.e. language models and grammars. Based on a re-entrant single pronunciation prefix tree, we use the concept of linguistic context polymorphism to achieve an early incorporation of language model information. This approach allows us to use all available language model information in a one-pass decoder, using the same engine to decode with statistical n-gram language models as well as context free grammars or re-scoring of lattices in an efficient way. We compare this approach to...

  13. Tokenization and Proper Noun Recognition for Information Retrieval

    Fco Mario Barcala; Jesus Vilares; Miguel A. Alonso; Jorge Grana; Manuel Vilares
    In this paper we consider a set of natural language processing techniques that can be used to analyze large amounts of texts, focusing on the advanced tokenizer which accounts for a number of complex linguistic phenomena, as well as for pre-tagging tasks such as proper noun recognition. We also show the results of several experiments performed in order to study the impact of the strategy chosen for the recognition of proper nouns.

  14. Evaluating Word Alignment Systems

    Magnus Merkel; Lars Ahrenberg
    ... In this paper we use the notion of word alignment systems as a general term for systems that align linguistic units below the sentence level across two languages. These linguistic units could be expressed as single words, phrases, terms or collocations. The majority of the word alignment systems described in the literature fall into two main categories: (1) Full-text alignment systems, and (2) Bilingual lexicon extraction systems. Below these main categories, it is possible to make further divisions into, for example, bilingual concordancing and bilingual information retrieval for the first category, and technical terminology systems and systems that compile...

  15. Knowledge-Based Extraction of Named Entities

    Jamie Callan; Teruko Mitamura
    The usual approach to named-entity detection is to learn extraction rules that rely on linguistic, syntactic, or document format patterns that are consistent across a set of documents. However, when there is no consistency among documents, it may be more e#ective to learn document-specific extraction rules.

  16. Optimization Under Fuzzy Linguistic Rule Constraints

    Christer Carlsson; Robert Fullér; Silvio Giove
    Suppose we are given a mathematical programming problem in which the functional relationship between the decision variables and the objective function is not completely known. Our knowledge-base consists of a block of fuzzy if-then rules, where the antecedent part of the rules contains some linguistic values of the decision variables, and the consequence part is a linear combination of the crisp values of the decision variables. We suggest the use of Takagi and Sugeno fuzzy reasoning method to determine the crisp functional relationship between the objective function and the decision variables, and solve the resulting (usually nonlinear) programming problem to...

  17. Integrating Linguistic and Domain Knowledge for Spoken Dialogue Systems in Multiple Domains

    Myroslava O. Dzikovska; James F. Allen; Mary D. Swift
    One challenge for developing spoken dialogue systems in multiple domains is facilitating system component communication using a shared domain ontology. Since each domain comes with its own set of concepts and actions relevant to the application, adapting a system to a new domain requires customizing components to use the ontological representations required for that domain. Our research in multiple domain development has highlighted differences in the ontological needs of a generalpurpose language interface and a task-specific reasoning application. Although different domain applications have their own ontologies, many aspects of spoken dialogue interaction are common across domains. In this paper, we...

  18. ORIGINAL COMMUNICATION Logopenic syndrome in posterior cortical atrophy

    Eloi Magnin; Geraldine Sylvestre; Flora Lenoir; Elfried Dariel; Louise Bonnet; Gilles Chopard; Gregory Tio; Julie Hidalgo; Sabrina Ferreira; Catherine Mertz; Mikael Binetruy; Ludivine Chamard; Sophie Haffen; El Lucien Rumbach
    Abstract Few language disorders have been reported in posterior cortical atrophy (PCA). Furthermore, no study has focused on screening for them and described these language deficits. The goal of this work was to describe linguistic examination of PCA patients and the impact of language disorders on neuropsychological performances compared to patients with other neurodegenerative syn-dromes and control groups. Linguistic examination of 9 PCA patients was carried out. The neuropsychological performance of the PCA group (16 patients) in the RAPID battery tests was compared with performances of patients with a logopenic variant of primary progressive aphasia (LPPA), patients with Alzheimer’s disease...

  19. Structural Analysis of Chinese Dialect Speakers and Their Automatic Classification

    Xuebin Ma; Nobuaki Minematsu; Akira Nemoto; Max Takazawa; Yu Qiao; Keikichi Hirose
    Abstract:In China, there are many different kinds of dialects and sub-dialects. Because there are many grammatical, lexical, phonological, and phonetic differences among them in varying degrees, people from different dialect regions always have difficulties in oral communica-tion. Since 1956, standard Mandarin has been popularized all over the country as official language and almost every dialect speaker began to learn Mandarin just as a second language. But affected by their native dialects, many of them speak Mandarin with regional accents. In modern speech processing technologies, speech is represented by spectrum which contains not only the dialectal linguistic information but also extra-linguistic...

  20. Learning words and speech units through natural interactions

    Jonas Hörnstein; José Santos-victor
    This work provides an ecological approach to learning words and speech units through natural interactions, without the need for preprogrammed linguistic knowledge in form of phonemes. Interactions such as imitation games and multimodal word learning create an initial set of words and speech units. These sets are then used to train statistical models in an unsupervised way. Index Terms: multimodal learning, ecological approach, motor learning, interactions

