3. VERSIONS OF SEC MATERIAL

There are 5 versions of the material in the Corpus:

1. Spoken recording

2. Unpunctuated transcription

3. Orthographic (punctuated) transcription

4. Prosodic transcription

5. Grammatically tagged version

These five forms of the corpus are related in the following way:

The unpunctuated transcriptions and the prosodic transcriptions were produced using the spoken recordings; the punctuated transcriptions were produced from the unpunctuated transcriptions; and the tagged versions were produced using the punctuated transcriptions. Versions 1, 2, 3, and 4 are produced manually; version 5 is produced semiautomatically. The following sections give details of the different versions.

3.1 Spoken Recording

The spoken recordings are of three types: (1) those made on high-quality tape facilities at IBM and on cassette at UL; (2) those made on high-quality tape facilities using Media Services at UL; and (3) those purchased from the relevant source (on cassette), and later copied onto high-quality tape at IBM.

Type 1 covers all material obtained from the BBC. The relevant programmes are:

In Perspective
From our Own Correspcndent
News (R4, R3, R2)
The Reith Lectures
Daily Service
Money Box
Story Time
Listening & Reading
Morning Stoty
Time for Verse
Week's Good Cause
Motonng News
Weather Forecast
Programme News

In some cases it was necessary to edit out portions of the programmes where there was unintelligible speech, background noise, or any feature that was felt to be unacceptable in the corpus. The resulting "clean" copies of the recordings are now stored on video tape. The video tapes will be used as the "master" copies, and cassette copies made from these as necessary.

Type 2 recordings are those made using volunteer speakers from the University of Lancaster. The Media Services Unit at the University provided recording facilities and experienced staff to supervise recording sessions. Recordings were made using 10" reel-to-reel tape, 7 1/2 ips. The relevant samples are:

Colin Lyas: Nelson Mandela speech
Colin Lyas: Tom Stephenson speech
Rita Kempson and Heather Green: dialogue

These recordings have now been copied (after appropriate editing) onto video tape as for type 1 recordings.

Type 3 covers the following samples:

Open University tape: Modem Art
Open University tape: Science & Bellef in 18th Century France
Open University tape: Development of Fractions

Review of the Year

OUP: Streamline English series

Decca tapes: Betjeman reads Betjeman

The Open University samples were supplied on cassettes in the same format as is used by students or by the BBC for broadcasting, i.e. there is an introductory "tone" before each text, accompanied by the name of the course, course detaiis, etc. This information was edited out before the master copy was made.

The Review of the Year tape is not strictly a type 3 recording as it was supplied on a 10" reel-to-reel tape, and is therefore of higher quality than the cassettes obtained. It has been edited and copied to video tape.

The Oxford University Press Streamline English Series is supplied on cassettes with introductory material before each unit of the course. This material was edited out before copying onto video tape.

The Decca tapes of John Betjeman reading his own poetry are cassettes which were purchased from Virgin Records, Lancaster. The four samples used were copied onto video tape.

The video masters are now completed and include an identifying introduction before each sample, giving category letter and part number. There are pauses between the samples enabling easy cuing to any text on the tape. A full cassette copy of the corpus is held at Lancaster (one category per cassette), and video and reel- to-reel masters are held at IBM UKSC.

3.2 Uppunctuated transcriptions

The unpunctuated transcriptions were made using the spoken recordings. The text was typed directly onto computer, and it was at this point that unacceptable text was noted, and replaced by a comment in the transcription, for example "[speech extract omitted]". Speaker details were also included in comments, for example "[change of speaker: speaker name]. No word-initial capitals are used apart from those in proper narnes and abbreviations, thus no indication of start of sentence is given in this format of text. Stops are not used in abbreviations, for example, "PLO" is used rather than "P.L.0." and similarly "Mr" rather than "Mr.", as this seerns to be the more common convention in general use. Numbers are treated in the same way as in a standard text, for example, in addresses they are witten as digits: "10 Green Street", and also in telephone numbers, quantities of money, and decimal numbers. They are witten as words if they would normally be found so. Each text is preceded by four lines of comments giving details about category, part number, absolute number (the position of the text in the corpus as a whole), speaker(s), and recording details, e.g.:

[001 SPOKEN ENGLISH CORPUS TEXT A01]
[In Perspective]
[Speaker: Rosemary Hartill]
[Broadcast notes: Radio 4, 7.45a.m., 24th November, 1984]

The unpunctuated transcription was used in the production of the punctuated transcriptions and the prosodic transcriptions. Both these versions had to be produced without either version influencing the other, i.e. the punctuated version must be free from any prosodic information, and the prosodic version must be free from any punctuation clues. The only way to ensure this was to have the unpunctuated transcription as the starting point for both of these versions.

3.3 Orthographic transcriptions

As mentioned in (3.2) these were produced from the unpunctuated transcriptions. The volunteer punctuator was given a text (no volunteer had access to the spoken recording) and was asked to insert punctuation at appropriate points. As an aid, a handbook on punctuation conventions was provided (based on Appendix Ill of the Comprehensive Grammar of the English Language); if the punctuator was doubtful about a particular mark, he was to be guided by the handbook. Volunteers at UL and IBM participated in this exercise. Where the punctuator could not insert appropriate punctuation, for example, in cases where the sentence is ambiguous, and couid be interpreted in two ways, or where the unpunctuated section of text can seem to be nonsense whichever way it is punctuated, help was given. There were very few of these cases.

For the following samples a transcript or original text was already available:

Streamline English material (G03, G04, J02, J03, J04,J05)
The Reith Lectures - III (C01)
John Betjeman's poems (H0l, H02, H03)
Sir Henry Newbolt's poems (H04, H05)
The Kempson and Green dialogue (J06)
Colin Lyas speeches (M05, M06)

For the Streamline English material the transcriptions were taken from the students' book which accompanied the tapes.

For the Reith Lectures - III the transcription was taken from The Listener magazine (21st November, 1985), but was amended to take account of the speaker's use of enclitics "It's", "there's" in the broadcast as opposed to the longer forms "It is", "there is" in the written text.

The transcription of John Betjemans poems was taken from John Betjemans Collected Poems, John Murray Publishers. 1958. The prose sample was treated as an unscripted text (see above).

The transcription of Henry Newbolt's poerns was taken from A Perpetual Memory and other Poem, John Murray Publishers.

The Kempson and Green dialogue was partly scripted and transcribed orthographically by the speakers, and this was used to produce the full transcription.

The Nelson Mandela and Tom Stephenson speeches by Colin Lyas were supplied along with full orthographic transcriptions.

3.4 Grammatically tagged version

The corpus was grammatically tagged using the LOB suite of tagging programs (otherwise known as CLAWS1) with the orthographic text as input. It has also been tagged using the most recent version of the programs (CLAWS2).

A full description of CLAWS1 can be found in Garside et al (1987). Briefly, the system works in five stages, all but the last being automatic.

1. The first stage, pre-editing, involves converting the text into a vertical format, one word per line (punctuation is treated as a "word"), in preparation for the next stage of processing. Punctuation is separated from the word, if necessary.

2. The second stage, word-tagging, involves deciding which tags are appropriate for each word. This may be done by either finding the word in the word-list, or its ending in the suffix-list, or stripping the plural "s" off the word and searching the lists, or in the case of hyphenated words searching for each part of the word separately. lf all strategies fail, the word will be assigned a default list of tags.

3. The third stage, idiom-tagging, deals with combinations of words such as "in order to", assigning therm a "ditto" tag, i.e. overall tag, rather than leaving the structure with each word taggged individually.

4. The fourth stage, disambiguation, chooses the correct tag from the list of tags against the word, basing its decision on the surrounding context.

5. The last stage involves sorne manual post-editing. The post-editor has to check that the tag chosen against each word is the correct one, and if it is not, to indicate the correct tag, so that the re-formatter will pick it up.

The tagged texts are stored in two forms: a vertical format with one word per line along with line reference number and tag; and a horizontal format with the word linked to its tag by the underline character. The horizontal format requires far less storage space, but is more difficult to process if used as input to other programs. An additional form of the tagged texts has the text in horizontal format with the appropriate tag underneath the word on the next line. This form is a lot easier to read, and a program has been witten to produce it using the vertical format as input.

Details of the coding symbols used in the LOB corpus to aid tagging and to preserve the format of the original texts can be found in Johansson et al (1986). As the material in the Spoken English Corpus was transcribed from tape, there are no special codes in the tagged version to indicate type changes or special characters.

3.5 Prosodic version

The prosodic transcriptions were produced using the unpunctuated versions of the text and the audio tapes. Transcribers were Dr Gerry Knowles at the University of Lancaster and Dr Briony Williams at IBM UK Scientific Centre, Winchester. The transcribers were allocated equal sections of the corpus each, and at least one section from each category was selected and transcribed by both for comparison, and checking of the transcription system.

Overlap sections are:

Category No of Words    Category No of Words
A04168    F04235
B01179    G01291
B02150    G02211
B03201    G05222
B04141    H03157
C01116    H04148
D01114    J01200
D02165    J02279
D03152    J0474
E01128    J06597
E02134    M0193
F01219    M06306

The 24 overlap passages (a total of 4680 words) constitute 9 per cent of the corpus.

3. 5. 1 Prosodic characters

A set of 14 special characters is used to represent prosodic features in the texts:

Some notes on the interpretation of these characters are necessary:

(1) Stressed and accented syllables:

An accented syllable has an independent pitch movement associated with it, known as the tone. Tones are marked with iconic symbols representing the piich movement.

Syllables which are felt to be stressed but not accented (i.e. they are prominent but have no independent pitch movement) are marked with a circle.

Unstressed syllables are left unmarked.

The pitch of all unaccented syllables is predictable from the tone marks on neighbouring accented syllables.

(2) Pitch direction:

The terms fall, rise, and level describe pitch movernents which begin on the vowel of the accented syllable (or in the case of a falling diphthong, on the first element of the diphthong). Any pitch movernents before this point are ignored in this terminology.

Most of the pitch movement of a fall is completed on or soon after the accented syllable, leaving a slight drop over the tail of the tone group. A rise might start almost level with a marked increase in slope towards the end of the tone group.

The level tone is not stricfly level except possibly in special styles in which it is intoned. Any rising or falling is insufficient for the tone to be classed as a rise or a fall.

(3) Simple and complex tones:

Simple tones move in only one direction up or down. Complex tones change pitch direction: fall-rise and rise-fall

The fall of the fall-rise is completed quickly, and the completion of the rise is delayed to the end of the tone group. The rise of the rise-fall is completed quickly on the accented syllable, and the fall is completed as soon as possible thereafter.

A phonetic variant of the fall-rise is the "shallow fall': instead of falling to low and rising again, the pitch movement is cut off before it reaches low.

(4) High and low:

A distinction is made for all tones between a high and a low variety. A high tone begins higher than the preceding pitch level, and a low tone begins lower than the preceding pitch level.

(5) Up arrow and down arrow:

These are used to indicate significant changes of pitch which are not sufficiently marked by the tone symbols. The up arrow indicates a rise in pitch and the down arrow a drop in pitch.

When used in conjunction with tone marks they indicate a rise or drop in pitch which is significanfly greater than that indicated by the high or low position of the tone mark alone.

Used alone, on an unstressed syllable, the arrow marks a pitch pattern which is not predictable from neighbouring tone marks. At the beginning of a tone group, the arrow indicates that the pitch contour begins significantly above or below its expected level towards the bottom of the pitch range.