9 Developments

Though the tagged LOB Corpus is now presented as a completed product,43 it should only be seen as a stepping-stone in a more long-term development. Johansson and Hofland (forthcoming) will present analyses on the following levels: tag frequencies, word frequencies, tag combinations, and word combinations. The results should lead to improvements of the tagging programs.

Over the last few years the Lancaster group has continued its work on automatic word tagging. A revised tag set is proposed in Booth (1985). More important, the probabilistic principles used in word tagging have been extended to higher-level grammatical analysis; see Garside and Leech (1985) and Garside, Leech and Sampson (forthcoming). One of the aims is to produce a syntactically parsed version of the LOB Corpus.

Notes

  1. Note that inflected forms of numerals were tagged as nouns in the Brown Corpus. One in the Brown Corpus is tagged as a numeral, pronoun or noun, corresponding to one major tag (CD1) in the LOB Corpus.
  2. The Lancaster team was responsible for the pre-editing (4.1) and the writing of the tagging programs (4.2-4). The WORDLIST and the SUFFIXLIST (4.2) were prepared in Norway. Post-editing (4.5) was originally shared by the English and Norwegian research teams, but a great deal of additional checking was made in Norway.The Norwegian contribution also includes the final preparation of the two versions of the text (Section 2) and the production of the concordance (Section 8).
  3. Reported in Francis (1980); for results and analysis of the automatic tagging, see Francis and Kucera (1982).
  4. Each of the three programs was written by a different member of the Lancaster team: A by Roger Garside, B by Eric Atwell, and C by Ian Marshall.
  5. An experiment carried out by Knut Hofland at Bergen in 1982 gave encouraging support to the view that manual pre-editing could be dispensed with. The LOB tagging programs were applied to a machine readable copy of John Osborne's Look back in Anger, a text not included in the LOB Corpus. Automatic pre-processing followed by automatic tagging resulted in a success-rate in the region of 90%. This was without modifications to the programs themselves, which are designed to accept the specially pre-edited text of the LOB Corpus. See further Booth (forthcoming).
  6. The Brown Wordlist contained c 3000 words, and the Brown suffixlist contained c 450 word-endings. See Johansson and Jahr (1982) on the LOB suffixlist.
  7. The marker @ indicates that a tag has (notinally) an intrinsic likehood of 10% or less; the marker % indicates that a tag has (notionally) an intrinsic likehoos of 1% or less. The tags are also output in order of likehood, more likely tags being places to the left of less likely ones. To this extent, the Tag Assignment program makes use of probabilities.
  8. We are grateful to Johan Elsness and Kay Wikberg (both of the University of Oslo) for helpful discussions of some problem areas, to Arne S. Svindland (University of Bergen) for suggestions concerning the treatment of so, and to Francoise Keulen (University of Nijmegen) for some assistance in post-editing.
  9. In giving examples we include as little coding as possible, usually only the tags under discussion. Abbreviation codes (\0) are normally omitted at the beginning of words. Contractions are not split up. Initial capitals are used at the beginning of sentences. The line reference gives a single line only, although an example may extend to neighbouring lines.
  10. Note that contractions are split up; see 5.3 and the end of 7.24. Also separated are combinations of numeral plus a unit of measurement; see 7.19.
  11. The same tagging was used for the inflected and uninflected forms, as the addition of $ to mark the genitive would have made the tag too long (the maximum length of a tag was five characters).
  12. Another, more unusual, example of conflicting contextual clues is found in: ... he is certainly right in the drawing attention to the apparent inconsistency of.. (B23:120). Drawing is preceded by the definite article (a noun-like feature) and is followed by a direct object (a verb-like feature). The following is a 'hybrid' of a different kind: ... were the fruits of putting into practice of this "modern" experimental scientific attitude (J37:42). Here there is a postmodifying of-phrase (pointing towards NN) but the -ing form is not preceded by a determiner as is usually the case with -ing forms postmodified by of-phrases.
  13. Uninflected forms of collective nouns with distinctive plural forms (committee, government, etc) were tagged NN irrespective of the form of co-occurring verbs and co-referent pronouns. Thus NN is used in: the committee are ... . the government have.... etc.
  14. A distinction is made between God (NP) and god (NN).
  15. Note also NP in the case of: the States (=the United States).
  16. The City (of London) was treated as NPL.
  17. A distinction is made between Aunt (NPT) and aunt (NN), Ma'am (NPT) and ma'am (NN), etc.
  18. NPT is kept in examples like: a B.A. degree (E31:201), his M.A. thesis (J37: 10).
  19. JNP also applies to the abbreviations Ltd and Inc.
  20. Note the treatment of East Germany and West Germany (NP NP) vs east Berlin and west Berlin (NR NP). The East End and the West End of London were tagged: NP NP.
  21. JJB was assigned to hyphenated attributive forms like 14-year-old and post-mortem, although these also occur in clearly nominal positions. But there is a difference in meaning between the attributive forms and the corresponding nouns. Cf notes 38 and 41. - Difficulties sometimes arise with compound nouns, as in: science-fiction stories (G36:138). Should the hyphenated form be JJB or NN? The tag NN was assigned, as the hyphenated form occurs in the LOB Corpus in non-attributive positions (though science fiction is the more common form). JJB was only assigned where occurrence in non-attributive position is excluded or very unlikely.
  22. But JJ for black and JJB for pin-stripe were kept in the following example, although the head noun is not recoverable from the context: His voice was like his black and pin-stripe, a grey superimposition of respectability over the original colour of his own natural vowels... (R03:98). It is the whole coordinate expression which functions as a noun rather than the individual words.
  23. But contrary was tagged NN in: on the contrary, to the contrary. This word can be used more freely in nominal positions (and can even be pluralised).
  24. Note that it is possible to insert a degree word which does not accompany nouns: the very best, the very poorest, etc.
  25. Other exceptions are RI/RP words which can be part of ditto-tagged sequences, e.g. but which can be RI (=only) but receives an RB tag in all but (RB RB") and in which can both be RP and appear in idioms: in general RB RB", in short RB RB", etc.
  26. The 'normalcy principle' (Section 6) was used in selecting ABN for half in expressions of the type half past two (ABN IN CD).
  27. Little can, of course, also be JJ. JJ was assigned before count nouns (a little boy, little boys, etc) and AP with uncountables (little money, little interest, etc) and in nominal positions (I know very little, the little I know, etc). The distinction is generally clear, though the automatic tagging programs could not handle it, as nouns were not subclassified as 'count' and 'mass'.
  28. A particular problem with less is the idiomatic expression exemplified in: me, a man, a writer of bloodthirsty tales, John Laker Considine, no less! (N26:180). Less was tagged AP here, following the 'normalcy principle' (Section 6).
  29. Other -ing forms which might have been treated as IN are adjoining and depending (on, upon). These were tagged VBG. - Another verb form treated as IN is given in examples like: ... given fine weather, another crop could still be gathered. (G19:178).
  30. Note further less in examples like: a total distribution of 10 per cent less tax (A38:27).
  31. Cf also next to (treated as RB IN) and next (most often tagged AP; cf 7.12), which is once found as a preposition: ... discovered that they could not sit next a man at dinner and be agreeable (K214).
  32. Supposing was always tagged VBG, although it might be treated as a conjunction in an example like: ... the sensitivity of some nerves would be bound to be affected at the finger extremities even supposing there has been no bruising of tissue (1,23:93). Suppose was always treated as VB. Granted that was tagged VBN CS.
  33. Note that these sequences were not idiom-tagged. As regards as soon as, see the end of this section. The moment... was simply tagged ATI NN.
  34. Though was always treated as CS in clause-initial position (provided that it was not marked off by a comma), even in examples like: "Did Mrs Cummings object to you bathing?" - I don't know about the bathing, but she didn't want her house messed up. Though one morning she did catch me, and I was the usual ingrate and so on and so on. ..." (P16:151).
  35. Prepositions occasionally precede other verb forms as well, e.g. with quotations, non-standard English, and parenthetical verb forms. Examples: in 'Let's Make Love' (F44:49); Bill, he's better man at catch 'em than Injun, Judge (N03:175); can be cutfrom, say, sheet celluloid (J76:158).
  36. This tagging, which follows from our 'consistency principle' (Section 6) is questionable. In the related combination take for granted there is some inconsistency of tagging (CS, IN).
  37. Less than is used in a similar way in: is less than desirable. Idiom-tagging of more than (and less than) should perhaps also have been used in examples like: for more than ten years, less than a week afterwards. Cf also: for upwards of five years (1301:73), it was close on eleven o'clock (N1 1:39)
  38. Since 14-year-old is not limited to attributive position, the tag JJB would seem inappropriate. But such hyphenated forms are characteristically attributive and are different in meaning from nouns like 14-year-old ('14-year old person').
  39. Damn(ed) and darn(ed) should really be JJB rather than JJ and QL rather than RB.
  40. There is some inconsistency in the treatment of hyphenated forms consisting of a numeral and a unit of measurement. In attributive position it was felt that JJB was appropriate, since corresponding sequences of a numeral and a full noun are JJB (cf 7.8). Nevertheless, hyphenated forms with a numeral and an abbreviated unit of measurement occur in some texts both in attributive position and in clearly nominal positions. The conditions for JJB are then not fulfilled.
  41. Since post-mortem is not limited to attributive position, the tag JJB would seem inappropriate. But this hyphenated form is characteristically attributive and is different in meaning from the noun post-mortem ('a post-mortem examination'). Cf note 38.
  42. It makes no sense to apply the ordinary tags to foreign phrases. This would lead to un-English sequences (NN JJ) in: persona grata, tic douloureux, etc. And what tags should be chosen in cases like mirabile dictu and sine qua non
  43. Some errors have been discovered after the concordance and the two versions of the text were produced. See the errata list in the brief descriptions accompanying the text and the concordance.