Re: Natural language analysis and Database

Stephen Johnson (johnson@cucis.cis.columbia.edu)
Thu, 30 Mar 1995 09:54:44 -0500

>I am working with a graduate student whose main training has been in
>mathematics and computer science -- he has also studied linguistics recently.
>He is working on a model of natural language that involves using a database to
>encode sentences -- ultimately to enable one to query the database and obtain
>correct answers. My involvement involves testing his programs with two
>non-Indo-European languages.
> A colleague has suggested that this entire line of inquiry "has been
>done before with entirely negative results". I would appreciate comment from
>those working more immediately in this area.
>
>E. Todd, Professor, Trent University, Peterborough, Ontario, Canada K9J 7B8

Naomi Sager at New York University has been doing this (successfully)
for years in the medical domain. She has published two books and many
papers on the subject. My Colleague, Carol Friedman, and I currently
have a such system running in production which structures the
sentences of radiology reports and stores them in a database to
support queries by physicians and others. It is one of the few
production NLP systems that I know of. The system is based on general
principles and can be used for other purposes. We have published
several papers on this system.

The key to success is working in a restricted semantic domain
(radiology reports, computer manuals, weather reports, cookbooks,
etc.). The language of such a domain is called a "sublanguage" -
there are two collections of papers on this topic: one by Kittredge
and Lehrberger and the other by Grishman and Kittredege. (I assume
most people on this list have access to on-line search facilities, so
I won't give the complete references.)

Your colleague may be referring to attempts to work with unrestricted
language. I would have to agree that a robust system for processing a
corpus of general language (novels, newspaper articles, text books,
etc.) is beyond the current state of the art. The database model
would, of course, have to be very generic. However, some useful
retrievals may be possible by structuring sentences into a predicate
argument representation. Be warned though that the parsing problems
of unrestricted text are notoriously difficult (e.g. coordinate
conjunction)!

-Stephen B. Johnson, Ph.D.
-Assistant Professor
-Department of Medical Informatics
-Columbia University