
CL Research Knowledge Management System (KMS)
The Knowledge Management System (KMS) is designed to provide a single
interface for a range of text processing functions:
- creating repositories of texts (frequently based on integrated web
searches),
- creating a single XML representation of texts for several types of analysis
(incorporating full parsing, discourse analysis, discourse entity analysis, and
XML tagging of text elements),
- characterizing document contexts through automatic keyword generation and
headline creation,
- answering natural language questions,
- creating general and topic-based summaries (where a topic can be described
by a single word or a full paragraph),
- semantic category analysis of major text elements (nouns and verbs), and
- creation of single or multiple document ontologies.
(See the KMS
slide
show for a detailed description, including screen shots.)
CL Research is now seeking beta-testers for KMS. KMS is best viewed as a
tool for some regularized knowledge intensive process, such as intelligence
gathering, scientific research, litigation support, or other continuing need
for information of a specific type. In working with clients, CL Research has
found that each client has a different information need (that is, follows a
unique user model). No general models of user behavior in making use of the
technologies incorporated in KMS (i.e., question answering, summarization,
information extraction, document exploration, and ontology use) have been
developed in the research community. CL Research has developed a beta-testing
paradigm designed to examine and characterize different user models.
The Beta-Testing Program
Acceptable beta-testers will have a reasonably well-developed and
characterized information need. CL Research will provide, at no cost, all
components of KMS and its supporting programs for a one-year period, upon the
beta-tester signing a non-disclosure agreement and a
beta-testing agreement. KMS
contains an integrated component for requesting assistance, reporting bugs, and
suggesting features. CL Research will not provide any direct assistance, other
than attempting to incorporate user comments in revisions of KMS (unless the
beta-tester wishes to enter into a separate contractual agreement). The
beta-tester may keep any output generated by KMS, without any restriction. (KMS
output is all in an XML format, with simple structures and may include answers
to questions, keyword lists, single or multiple document summaries, and single
or multiple document ontologies.) CL Research makes no promise that KMS will be
released as a formal product. For further information, contact
CL Research.
Core Technology
KMS incorporates the latest language engineering technologies covering the
full spectrum of text processing from the word level to summaries of multiple
texts. Text from a variety of common formats (such as HTML, DOC, PDF, and WPD)
is converted into XML documents and is then processed into a unified framework
(XML tagged) that enables full exploitation of the meaning of the text. Using a
single interface to access the XML-tagged representation, the user can create
general summaries of one or more documents, create topical summaries focused on
events or points of view, obtain answers to fact-based questions (with the
sentences in which they're found), create essay summaries answering more
general questions, extract information for databases, examine a document's
semantic network structure, and probe the details of documents from many
perspectives. CL Research's software consists of three principal components:
text processing, text summarization, and text analysis. The overall
architecture of KMS is shown below.
Text Processing
CL Research's core text processing technology creates an XML representation
of the text and includes the following features:
- Full Text Parsing: Separates text into sentences, parses each
sentence, and analyzes the semantic content of each sentence (including
relationships to previous sentences in the text).
- XML Representation: Creates an XML representation of the
text as a nested and tagged structure of sentences, clauses, noun phrases,
verbs, and prepositions (important carriers of semantic relationships), each of
which is tagged with important syntactic and semantic characteristics.
- Multiple File Formats: Processes arbitrary XML and SGML
files (with user specifying the DTD text elements to be processed). Auxiliary
programs are available to convert several web page styles (HTML, CFM, SFM),
Word documents (DOC), Adobe Acrobat files (PDF), and WordPerfect documents into
a common XML representation.
- Rapid Processing: State of the art speed processes 400
sentences per minute.
- Dictionary and Thesaurus Incorporation: Makes use of
standard publicly available dictionary/thesaurus (WordNet) and, optionally,
licensable dictionaries and thesauruses from major publishers. CL Research's
core dictionary technology (DIMAP) facilitates the
incorporation of specialized dictionaries into KMS (law dictionaries,
technology dictionaries, medical dictionaries, or any dictionary/thesaurus
specific to the user's needs. (Will be extended to incorporate word sense
disambiguation technologies developed by CL Research.)
Text Summarization
Text summarization is performed with an XML analyzer that enables
examination of one or many documents from many angles (virtually
instantaneously for moderately-sized collections, such as 50 newspaper
articles), including the following:
- General Text Summarization: Creates an extractive summary
(selecting sentences) of all or selected documents in a user-selected length,
based on a unique frequency analysis of noun phrases that includes substitution
of full names for referring phrases (such as pronouns or use of person last
names.
- Topical Text Summarization: Uses the same methods as general
summarization, with the addition that the user can write a topical statement to
focus the summary to a particular topic, event, or slant (the "topical
statement" can be a simple list of key words).
- Sentence Question Answering: Uses the same methods as
topical summarization, except with "topics" phrased as questions to
select the most likely sentences answering the question (user can also create a
summary of the sentences).
- Batch Screening against Established Topics or Questions:
Allows user to specify a set of topics or questions to screen each document or
all documents, to create a summary for each topic or question (optionally
outputting all the answers to an XML file).
Text Analysis
Text analysis allows the user to probe more deeply into the document
collection (exploiting the rich underlying XML structure) with a variety of
tools.
- Factual Question Answering: Using the XML query language
(XPath expressions), provides the exact answer to a question, identifying the
document and sentence number and providing the full sentence that answers the
question.
- Information Extraction: (Planned) Using the underlying
technology described in batch screening summarization, allows the development
of a set of syntactic and semantic patterns to extract information nuggets from
the document set, for use in filling templates (e.g., a doctor's diagnosis and
treatment, or an organization's mergers and acquisitions).
- Frequency Counts: Generates a frequency list of words in noun
phrases, uniquely substituting the full reference for co-referring expressions
(such as pronouns and shortened coreferences to organizations or people), and
ignoring commonly used words (a stop list).
- Finding Occurrences: Finds occurrences of user-selected words
from frequency counts and returns their full noun phrases, the document and
sentence number, and the full sentence.
- Finding Relations: For the selected words, examines the
relations they hold in their occurrences, such as subject, object, and
prepositional object, along with the words that govern them (such as the verbs
or prepositions).
- Synonyms and Hierarchical Relations: Establishes a hierarchy
of the nouns and verbs in the document(s), using the publicly available
WordNet, allowing users to identify sentences expressing similar concepts with
different words.
Maintained by Ken
Litkowski.
August 20, 2005
Copyright © 2005 CL Research