Re: Corpora: Papers and corpus software

James L. Fidelholtz (jfidel@siu.buap.mx)
Mon, 23 Feb 1998 12:43:14 -0600 (CST)

This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
Send mail to mime@docserver.cac.washington.edu for more info.

------=_NextPart_000_00E7_01BD409D.B8AF9600
Content-Type: TEXT/PLAIN; CHARSET=iso-8859-6
Content-ID: <Pine.LNX.3.95.980223122630.3110D@siu.buap.mx>

Dear Sameh:
It's not much for you, but I did an article on phonological
frequency effects in English in which I speculated (but on pretty solid
grounds, I think) that at least for English, the division between
relatively common and relatively uncommon words occurs at text frequencies
of about 5 per million words of running text. Now, it is well known that
relatively frequent words in written texts, on the whole, occur even more
frequently in spoken texts; conversely, relatively infrequent words occur
even less frequently in spoken texts. I suggested that they should be
about equal (that is, the frequency graph crossover point) at that
frequency of 5/M.
That, of course, is not the end of the story. Apart from the
genre skewing of frequencies I just mentioned, there are also types of
words which ONLY are found in speech (particularly: interjections
[markers, hesitation markers, etc.] and a few 'emotive' words like
_berserk_ [to my knowledge, found in NO published frequency count]).
While these factors are of course important, one would expect their
importance to diminish (except, perhaps, for interjections) insofar as you
are processing HUGE corpuses, since lowered probablility multiplied by
huge input still gives occurrences.
I hope some of these reflections are useful for you. The
reference, such as it is:
Fidelholtz, James L. 1975. Word frequency and vowel reduction in
English. _Chicago linguistic society. Regional meeting. Papers_
11.200-213. [be sure to check the footnotes carefully]
Jim

On Mon, 23 Feb 1998, Sameh-al-ansary wrote:

>Date: Mon, 23 Feb 1998 20:57:58 +0200
>From: Sameh-al-ansary <sameh-al-ansary@usa.net>
>To: CORPORA <CORPORA@HD.UIB.NO>
>Subject: Corpora: Papers and corpus software
>Resent-Date: Mon, 23 Feb 1998 20:29:23 +0100
>Resent-From: corpora-request@lists.uib.no
>
>Dear everyone :
>
> I am writing my Ph.D thesis in corpus linguistics. I am in need for papers regarding the corpus-based differences between spoken and written language. If any one has published a paper concerning any comparative difference between spoken and written language, their structural and typological differences, or any other difference, please Iet me know.
> Can anyone tell me where can I find a software for tagging and processing a corpus?
>email : sameh-al-ansary@usa.net

James L. Fidelholtz e-mail: jfidel@cen.buap.mx
A'rea de Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Beneme'rita Universidad Auto'noma de Puebla, ME'XICO

------=_NextPart_000_00E7_01BD409D.B8AF9600
Content-Type: TEXT/HTML; CHARSET=iso-8859-6
Content-Transfer-Encoding: QUOTED-PRINTABLE
Content-ID: <Pine.LNX.3.95.980223122630.3110E@siu.buap.mx>
Content-Description:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD W3 HTML//EN">


Dear everyone = : 
 
    I am = writing my Ph.D=20 thesis in corpus linguistics. I am in need for papers regarding the = corpus-based=20 differences between spoken and written language. If=20 any one has published a paper concerning any comparative difference = between=20 spoken and written language, their structural and typological = differences, or=20 any other difference, please Iet me know. 
 
 Can anyone tell me where = can I find=20 a software for tagging and processing a corpus?
 
          &nbs= p;            = ;   =20 Thanks for cooperation. 
 
Best regards,
Sameh 
email : sameh-al-ansary@usa.net
   
------=_NextPart_000_00E7_01BD409D.B8AF9600--