Naomi Sager at New York University has been doing this (successfully)
for years in the medical domain. She has published two books and many
papers on the subject. My Colleague, Carol Friedman, and I currently
have a such system running in production which structures the
sentences of radiology reports and stores them in a database to
support queries by physicians and others. It is one of the few
production NLP systems that I know of. The system is based on general
principles and can be used for other purposes. We have published
several papers on this system.
The key to success is working in a restricted semantic domain
(radiology reports, computer manuals, weather reports, cookbooks,
etc.). The language of such a domain is called a "sublanguage" -
there are two collections of papers on this topic: one by Kittredge
and Lehrberger and the other by Grishman and Kittredege. (I assume
most people on this list have access to on-line search facilities, so
I won't give the complete references.)
Your colleague may be referring to attempts to work with unrestricted
language. I would have to agree that a robust system for processing a
corpus of general language (novels, newspaper articles, text books,
etc.) is beyond the current state of the art. The database model
would, of course, have to be very generic. However, some useful
retrievals may be possible by structuring sentences into a predicate
argument representation. Be warned though that the parsing problems
of unrestricted text are notoriously difficult (e.g. coordinate
conjunction)!
-Stephen B. Johnson, Ph.D.
-Assistant Professor
-Department of Medical Informatics
-Columbia University