Corpora: "The Start of a Stop List at BA" by Barbara J. Flood

From: Einat Amitay (einat@ics.mq.edu.au)
Date: Thu May 18 2000 - 04:57:05 MET DST

  • Next message: Gabriella Rundblad: "Corpora: XML programmes and tagging"

    Hi all,

    I know I'm probably doing something I shouldn't - posting a full text of
    an article that doesn't belong to me - but it is so short - and so
    relevant that I had too.

    Many times corpora people ask about stop lists. This short text tells a
    good story, and may add some references to the list some of us maintain.

    +:o)
    einat

    ------------------
    This article has been accepted for publication in the Journal of the
    American Society for Information Science. Copyright ? 1999 John Wiley &
    Sons, Inc.

    The Start of a Stop List at BA
    Barbara J. Flood
    ARC/Philadelphia Developmental Disabilities Corp.
    PDDC/ARC@Libertynet.org

    The start was 1961. Biological Abstracts' (BA's) traditional subject
    index was falling further and further behind. BA decided to try the
    recently introduced keyword-in-context indexing. This title index became
    Biological Abstracts' Subject in Context or BASIC, first published in
    October-November of 1961 after a year of planning (Biological Abstracts,
    1961).

    I was given early runs of computer print-outs to look through
    editorially. There were inches of line printer pages with index entry
    words such as >of,= >the=, >in,= >and.= It might be of trivial interest
    that >of= was the most prevalent, perhaps because these were titles in
    biology. I chose to delete these words and began to compile a list of
    words that were going to be automatically stopped from printing. This
    became a Stop List. I don't know whether it was the first to be called a
    >Stop List.= By 1959, Luhn (1959, 1960) suggested types of
    non-significant words to be omitted. >Stop words= was used by Parkins
    (1963), but >Stop List= was Stevens= (1965) term of choice, and
    >stoplist= was used by Fischer (1966).

    Members of the editorial department met often to discuss candidate stop
    list words generated from the print-outs to make sure that homographs
    (such as Aa@ in AVitamin A@ or Aare@ as a measure of area) were not
    overlooked. Thus, the Stop List was generated from frequency data with
    concurrence by committee.

    Multiple word terms such as Rana esculenta had to be evaluated as to
    whether the second word provided a significant index entry. Should the
    second word be added to the Stop List? Was it frequent enough? The cost
    of adding a word to the Stop List with resultant added computer time had
    to be compared to the cost per copy of printing the extra line.

     The Stop List grew rapidly. Parkins soon reported (Parkins, 1963) that
    14 words prevented eighty percent of the entries for BASIC and that at
    the time there were already 1,000 words. This is comparable to the
    experience at Chemical Abstracts Service with Chemical Titles; Freeman &
    Dyson (1963) report an initial list of 750 words, expanded to 950, and
    then culled to 328 words. But each additional comparison added to the
    cost of the computer run. Was eighty percent enough? The decision was
    made on the basis of frequency. A word that did not show up often was
    not worth the extra sort comparison. My favorite examples are
    typographical such as >hte= and >fo.= Because nobody is going to look up
    these words in an index, it isn't important that there be an extra line
    or two.

    Later considerable editorial >augmentation= modified titles, but the
    Stop List was the start. The Stop List could be run automatically. It
    removed clutter for the user and reduced cost for the producer. The
    title index could also be produced in a timely manner. The Stop List
    provided an improvement over the raw computer output of titles while
    retaining the advantages of a computer produced keyword-in-context
    index.

    Acknowledgment
    I thank the editor and anonymous reviewers for helpful suggestions.

    References

    Biological Abstracts (1961). Introduction to the BASIC Index Volume 36
    part 4, October-November.

    Fischer, Marguerite (1966). The KWIC index concept: a retrospective
    view, American Documentation, 17:57-70.

    Freeman, R. R. & Dyson, G. Malcolm (1963). Development and production of
    Chemical Titles, a current awareness index publication prepared with the
    aid of a computer, Journal of Chemical Documentation, 3:16-20.

    Luhn, H. P. (1959). Keyword in Context Index for Technical Literature
    (KWIC Index), Yorktown Heights, N.Y., IBM, Report RC 127. Also in:
    American Documentation, 11:288-295, 1960.

    Parkins, Phyllis V. (1963). Approaches to vocabulary management in
    permuted-title indexing of Biological Abstracts, in Automation and
    Scientific Communication Part 1, Proceedings of The 26th Annual Meeting
    of the American Documentation Institute, (pp 27-28), Washington, D.C.:
    ADI.

    Stevens, Mary Elizabeth (1965). Automatic Indexing: A State-of-the-Art
    Report. National Bureau of Standards Monograph 91, pp 41, 64-66.
    -------------------

    --
    Einat Amitay
    einat@ics.mq.edu.au
    http://www.ics.mq.edu.au/~einat
    



    This archive was generated by hypermail 2b29 : Thu May 18 2000 - 04:58:08 MET DST