Corpora: CLAWS sentence enumeration

Kristine (kristine.monstad@eng.uib.no)
Thu, 16 Jul 1998 13:48:46 +0200

Hello everybody,

we are working on the COLT (the Bergen Corpus of London Teenage Language)
project at the University of Bergen. Currently we are in the process of
editing the tagged version of COLT, which has been tagged at the University
of Lancaster using the CLAWS6 tagset. The editing involves inserting both
new utterances and overlap tags. This means that we are dependent on
knowing the principles behind the enumeration of the sentences as well as
that of overlap tags/brackets, as illustrated by:

<u id=D11 who=W9>
<ptr t=P0003> <unclear> <ptr t=P0004>
</u>
<u id=D12 who=W1>
<s c="0000037 002" n=00012>
<ptr t=P0003> But&CCB; at <ptr t=P0004> least&RR; when&CS; we&PPIS2;

Particularly, we wonder about the 'c="0000037 002"' component. For
instance, are we correct in assuming that 002 refers to the first sentence
in a turn? If so, how are the following sentences within the same turn
numbered? And what about the '0000037' part?

We also wonder about the numbers in the overlap tags (<ptr t= >). As we
understand it, the example above is an illustration of correct enumeration
(the reason we are asking this, is that we have seen instances where the
numbering is
different).

We have tried to get this information both from Lancaster and the internet,
but with no result. Hopefully, some of you are able to help us, and we are
grateful for any response we get.

Hanne and Kristine

kristine.monstad@eng.uib.no