Dictionary Parsing Project


CL Research is parsing definitions of the 1913 Webster's 2nd International Dictionary (W2) for the Dictionary Parsing Project. The primary purpose of the parsing is to identify semantic relations between entries (see details below). The dictionary includes 120,000 headwords (including phrasal entries) and 280,000 definitions. We are using the Proximity Technology, Inc. parser embedded in the DIMAP-3 dictionary creation and maintenance programs available at CL Research. Source code for the parser, suitable for Unix or Windows programs, is available to interested persons or groups. Parsing the full dictionary takes approximately 10 hours. Preparing DIMAP dictionaries for the definitions required about 30 hours. These dictionaries are available at the Dictionary Parsing Project site. (Semantic networks can be prepared for any other dictionary just as easily.)

Our objective is to create a semantic network, including an ontology, from parsing the definitions. We are also adding synactic information so that it can be used in further parsing. There is a certain bootstrapping quality to this effort. This page will describe the steps that have been taken and the next tasks.

Selection of material

At the present time, we are excluding only the sample usages and quotations from the definitional material contained in W2. The components of the raw data are first broken down into components and tagged with an SGML notation (described in detail at the Dictionary Parsing Project web page). The definitions in W2 typically consist of a phrase whose first word is capitalized and which is terminated by a period. There may be several verbalizations or slight definitional nuances in this phrase, separated by semicolons. We treat each semicolon delimited phrase as a definition for parsing. Many of the definitions contain encyclopedic information in the form of complete sentences following the definition; this material is not parsed at the present time.

At the present time, the parser does not handle incomplete sentences very well; definitions are always constituent phrases and hence imcomplete sentences. To counter this, we have sought ways to create neutral context that will make definitions of different parts of speech into sentences. For noun definitions, we add "This is" to the beginning of the definition, add the definition to the string, and terminate the string with a period. For prepositions, we add "Something is" to the beginning of the definition and "something" to the end, terminating it with a period. For verbs, we remove any leading "to" and add "They are" to the beginning of the definition and, in the case of transitive verbs, "something" at the end. For adjectives, we add the phrase "This is something" at the beginning of the definition. We have found that this systematically improves results.

Nature of parse results

The parse results are in the form of a parse tree. This tree breaks the definition into its constituents, with grammar nonterminals as interior nodes and lexical items as leaf nodes. Nonterminal nodes may include annotations. Leaf nodes will always include information about the root form of the lexical item, including its inflection characteristics. The parse tree is surrounded by information identifying the head word, its sense number, its field or subject label (if applicable), and the definition that has been parsed.

Some parse tree results are available at the main Dictionary Parsing Project site and can be used as is. However, it would be best to use the DIMAP dictionaries to generate your own results, so as to take advantage of the latest functionality in DIMAP.

In addition to the parse results, DIMAP also generates several other files: (1) a file of semantic relations, showing the headword and all semantic relations generated from each sense; (2) a file containing the WordNet analysis results; (3) definitions for which a parse tree was not generated (currently less than 100 for the entire W2); (4) parses which were not completely clean (about 30 percent); and (5) definitions containing words unknown to the parsing dictionary (about 10 percent, and almost always resulting in parses that were not clean). The last three files are used primarily as diagnostic, to identify areas for improving the overall system.

Semantic Relations

We identify several semantic relations (semrels) during the parsing, principally the hypernymic relation, which establishes the basic ontological hierarchy for nouns and verbs. We also identify synonymic, meronymic, pertainymic, and several other relations. With DIMAP, it is now possible for the user to define and characterize additional relations by adding defining patterns to the lexical entries for specific words in the (separate) parsing dictionary. In addition, we compare our results and assess whether these are consistent with WordNet (for those relations that are common). See next steps below for a description of the process for further development of semantic relations and links to inventories of semantic relations that are being investigated.

The rules for recognizing the individual relations are as follows:

Progress report

As of December 31, 1998, we have parsed entries for the complete dictionary, identifying 142,000 hypernyms for noun and verb senses. Parsing is fairly rapid, proceeding at about 400 definitions per minute on a Pentium II, 64 MB, 233 MHz machine. Since we are currently in a stage of developing additional semantic relations, we have not run the parser against the entire dictionary to determine how many additional relations are identified.

Contemplated next steps

We are in the process of rapidly expanding the set of other semantic relations that may be recognized, as well as extending the functionality to handle additional mechanisms for recognizing relations. We will always be continuing efforts to improve the parser. Any suggestions for improvement or offers of assistance are welcomed.

Investigation of semantic relation types

The now classic implementation of a lexical knowledge base (LKB) containing semantic relation links is WordNet, with 10 noun relations, 6 verb relations, 5 adjective relations, and 2 adverb relations, in addition to the base relation of synonymy. The set primarily includes antonymy, hyperonymy, meronymy, holonymy, entailment and cause (for verbs), and basic cross-references. EuroWordNet has followed WordNet to a large extent but has some variations and additions (primarily dealing with roles/involvement such as agent, patient, instrument, location, direction, and manner). (See their documents D005 and D006 for more details.) Microsoft's MindNet currently uses 24 relations that generally follow those used by WordNet and EuroWordNet, with this set being considered "optimal" in terms of automatic identification from machine-readable dictionaries (MRDs) and not being unwieldly. (For details on MindNet, see Lucy Vanderwende's and Steve Richardson's theses, available at Microsoft, using "Vanderwende" and "Richardson" for your searches; there are several other papers describing MindNet in more general terms as well.) These systems have been developed primarily within the linguistic community and have not necessarily been optimized for use in applications or specific domains. (Early work on formalizing semantic relations based on MRDs can be found in Ann Copestake's thesis, search for "aacthesis".)

In the medical community, the National Library of Medicine, in its development of the Unified Medical Language System (UMLS), has developed a semantic relation hierarchy containing 56 relations. Each relation identifies the types of entities or events in the UMLS semantic network that are arguments of the relation. The relations are hierarchized under physical, spatial, function, temporal, and conceptual categories. They clearly subsume the relations in WordNet, EuroWordNet, and MindNet. Further expansion of semantic relations based on domain specificity is envisioned in the FrameNet project.

Patrick Cassidy has elaborated on the UMLS relations to correspond to his investigation of Webster's 2nd International Dictionary. Robert Parks, in conjunction with his Wordsmyth English Dictionary Thesaurus has developed an inventory of relations.

Most semantic relations are based on prepositions. Under the Dictionary Parsing Project, Bruce Jakeway has developed a hierarchy of prepositions (search for "preposition chart"). We have developed a DIMAP dictionary of prepositions (containing their definitions and the beginnings of a hierarchical arrangement).

The task at hand is, therefore, to reconcile these different notions of semantic relations and to develop procedures for their identification by (semi)-automatic means.

Site Map

This document maintained by Kenneth Litkowski ken@clres.com .
Material Copyright © 2001 CL Research