RE: Corpora: Corpus of scientific texts

GCW (williams@ensinfo.univ-nantes.fr)
Tue, 27 Oct 1998 07:37:44 +0100 (MET)

I may have not been clear in my use of 'self-publicity'. As I mentioned
this is in line with discourse analysis theories put forward by John
Swales. It is not a negative statement. What Swales implies by Create A
Research Space is the need to justify your presence in a given field. When
we publish this is often what we are doing, affirming our presence and
demanding a right to speak. The net is a wonderful place to publish, but
there is no direct peer review, this is what must be borne in mind.

It is true that much is being put on the net, this is good. But much work
of importance is not. Consequently in building any sublanguage theory one
must recognise the limitations on our sources. Mine is copyright, for
others it is the 'limited' resouces on the net. Should you read the small
print on many academic journals you will find it very clear who owns the
copyright, and who may then make it available.

Concerning PhDs on-line. Fine. You can build a corpus very quickly in that
way, but a young PhD is not yet a full member of the research community,
the text is a very different genre from that of the research article, not
only in length but in audience.

I must also stress that I was talking of the biological sciences where
full on-line papers are a rarety. I am also wary of the statistically
'proven', I cannot helping thinking of Mark Twain.

Best Wishes

Geoffrey

williams@ensinfo.univ-nantes

On Mon, 26 Oct 1998, Christopher A. Brewster wrote:

>
>
> On 26 October 1998 08:48, GCW [SMTP:williams@ensinfo.univ-nantes.fr] wrote:
> > On Fri, 23 Oct 1998, Adam Kilgarriff wrote:
> >
> > >
> > > Aren't 'technical scientific corpora' the easiest of all to produce?
> > > Increasingly, all the material is available online in a manner which
> > > invites you to download it, for free, direct, without a publisher
> > > intervening to create copyright problems.
> >
> > In this case, who controls the input? If you take what happens to be
> > available on the net then you have little control over the selection
> > process. Then, are we talking about 'technical' science in the sense of
> > technical how-to-do-it manuels, or learned research papers. The latter are
> > rarely available on-line for copyright reason. Some scientists do put
> > texts on their websites, but this is for self-publicity purposes,
> > 'creating a research space' in the terminology of Swales. You cannot cover
> > a sublanguage in this way.
> >
>
> I think that the extent to which 'learned research papers' are unavailable on-line varies
> considerably. In collecting material for my PhD, I am under the impression that more than 50%
> of the current generation of young professors in the fields of computational linguistics, NLP
> and IR have put their doctorates and the majority of their publications on the web. This is not I
> believe for self-publicity but to encourage the interaction of the academic community. I would
> expect other disciplines to be doing the same thing.
>
> A simple way to recover a sub-language is to use the citation indexes, find the top 100 items
> cited in the last ten years, and aim to recover 50% of these over the net. This gives you a
> corpus which is statistically proven to be the most influential (i.e. like getting top ratings
> on TV) and thus linguistically both representative of and influential on the linguistic community
> of the sub-language.
>
> What is the flaw in my method?
>
> Christopher Brewster
> University of Patras & University of Birmingham
>
>