Natural Language Engineering http://journals.cambridge.org/NLE

Additional services for Natural Language Engineering:

Email alerts: Click here

Subscriptions: Click here

Commercial reprints: Click here

Terms of use : Click here

A statistical model for grammar mapping

A. BASIRAT, H. FAILI and J. NIVRE

Natural Language Engineering / FirstView Article / March 2015, pp 1 - 41

DOI: 10.1017/S1351324915000017, Published online: 20 February 2015

Link to this article: http://journals.cambridge.org/abstract_S1351324915000017

How to cite this article:

A. BASIRAT, H. FAILI and J. NIVRE A statistical model for grammar mapping. Natural Language

Engineering, Available on CJO 2015 doi:10.1017/S1351324915000017

Request Permissions : Click here

Downloaded from http://journals.cambridge.org/NLE, IP address: 141.217.58.200 on 08 Apr 2015

Natural Language Engineering: page 1 of 41. c© Cambridge University Press 2015 doi:10.1017/S1351324915000017 1

A statistical model for grammar mapping

A. B A S I R A T1,2, H. F A I L I1,3 and J. N I V R E2 1School of Electrical and Computer Engineering, College of Engineering,

University of Tehran, Tehran, Iran email: ali.basirat@lingfil.uu.se 2Department of Linguistics and Philology, Uppsala University, Uppsala, Sweden email: joakim.nivre@lingfil.uu.se 3School of Computer Science, Institute for Research in Fundamental Sciences (IPM),

P. O. Box 19395-5746, Tehran, Iran email: h.faili@ut.ac.ir (Received 1 July 2012; revised 20 January 2015; accepted 22 January 2015 )

Abstract

The two main classes of grammars are (a) hand-crafted grammars, which are developed by language experts, and (b) data-driven grammars, which are extracted from annotated corpora.

This paper introduces a statistical method for mapping the elementary structures of a datadriven grammar onto the elementary structures of a hand-crafted grammar in order to combine their advantages. The idea is employed in the context of Lexicalized Tree-Adjoining Grammars (LTAG) and tested on two LTAGs of English: the hand-crafted LTAG developed in the

XTAG project, and the data-driven LTAG, which is automatically extracted from the Penn

Treebank and used by the MICA parser. We propose a statistical model for mapping any elementary tree sequence of the MICA grammar onto a proper elementary tree sequence of the XTAG grammar. The model has been tested on three subsets of the WSJ corpus that have average lengths of 10, 16, and 18 words, respectively. The experimental results show that full-parse trees with average F1-scores of 72.49, 64.80, and 62.30 points could be built from 94.97%, 96.01%, and 90.25% of the XTAG elementary tree sequences assigned to the subsets, respectively. Moreover, by reducing the amount of syntactic lexical ambiguity of sentences, the proposed model significantly improves the efficiency of parsing in the XTAG system. 1 Introduction

Computational grammars are among the mathematical tools used for modeling natural languages. Information about the morphology, syntax, and semantics of a modeled language can be incorporated into a grammar to represent the structure of the language. These resources can be presented in various formalisms such as context-free grammars (CFGs) (Chomsky 1959), head-driven phrase structure grammars (HPSGs) (Pollard and Sag 1994), combinatory categorial grammars (CCGs) (Steedman 2000), lexical-functional grammars (LFG) (Kaplan and Bresnan 1982), and LTAGs (Joshi 1985).

Generally, there are two sources of knowledge for the development of a grammar: (1) linguistic theories and (2) the implicit knowledge embedded in treebanks (Xia 2001). A grammar resulting from the former source of knowledge gives a theoretical 2 A. Basirat et al. linguistic perspective on the language. Aside from the syntactic descriptions, the elementary structures of these grammars are enriched with complex descriptions, such as semantic representation. The linguistically motivated properties of these grammars make them suitable for various NLP tasks that require complex descriptions of grammar elements (e.g., semantic interpretation, and machine translation) (Bangalore, Haffner and Emami 2005). On the other hand, a grammar resulting from the latter source of knowledge gives a data-oriented perspective on the language. The main focus of this kind of grammars is on the statistical distribution of co-occurring syntactic phenomena induced from training corpora. These grammars are suitable for use in the modern statistical approaches proposed for the NLP applications such as statistical parsing, and statistical machine translation.

This paper deals with the problem of linking these two perspectives together (i.e., theory-oriented and data-oriented perspectives) by making a bridge between their related grammars. The bridge can provide a way to augment the theoretical perspective of a language with the data-oriented perspective of that language (and vice versa) in order to combine their advantages.

The idea is implemented in the context of LTAG. The main reason for this choice is the special and practical properties of LTAGs for representing the structural descriptions of syntactic phenomena (Kroch and Joshi 1985; Frank 2004). The power of this formalism is the direct result of the way it factors out recursions and dependencies from the domain of locality. Moreover, the virtue of having the extended domain of locality allows the formalism to impose syntactic and semantic constraints on the relevant arguments of the same elementary structure.

The widespread use of this formalism in natural language processing applications such as summarization (Eddy, Bental and Cawsey 2001), discourse parsing (Forbes et al. 2003), syntactic parsing (Schabes and Joshi 1988; Bangalore and Joshi 1999;

Bangalore et al. 2009), information retrieval (Chandrasekar and Bangalore 1997), and machine translation (Shieber and Schabes 1990; DeNeefe and Knight 2009; Ma and McKeown 2012) is another reason to concentrate on this formalism.

Formally, an LTAG is a tree-generating system that forms a language by a set of derived trees (Joshi and Schabes 1997). The elementary units of rewriting in this formalism are the basic syntactic structures, which are represented in the form of trees, called elementary trees.1 Each elementary tree is associated with a lexical item, called the anchor, and defines a syntactic environment in which the anchor can appear. Accordingly, the elementary trees directly address the concept of the supertag, which is defined as a complex description of the syntactic environment of the lexemes (Bangalore et al. 2005; Bangalore et al. 2009).