Corpora: Summarization of HTML documents

Noemi Preissner (noemi@dfki.de)
Wed, 6 Aug 1997 15:10:24 +0200 (MET DST)

Hi,

I would like to automatically summarize HTML documents found
by a search engine given a certain query. I am interested in
two different kinds of summary: a tailored summary which takes
into account the keywords of the query and might, e.g., consist
of all the sentences containing those keywords, and a neutral
summary which should be independent of the query. The second
case obviously is more difficult than the first one, although
I have some intuitions such as listing all the headings (which
should be quite easy to detect in an HTML document ... ) or
determining keywords by taking into account word frequencies
in the document (if a word happens to occur very often in the
document although it's not that widespread in the language in
general, it could be considered as a keyword ... ).

I would like to summarize English, French and German texts,
and I would be very thankful for further suggestions. Also,
I am interested in literature concerning that subject, so
thanks in advance for any hints!

Noemi (noemi@coli.uni-sb.de)