Re: Corpora: e-mail corpus

Vasileios Hatzivassiloglou (vh@cs.columbia.edu)
Wed, 22 Apr 1998 14:47:50 -0400 (EDT)

From: "William C. Spruiell" <3lfyuji@cmich.edu>
Date: Sat, 18 Apr 1998 12:47:05 -0400
Sender: owner-corpora@lists.uib.no
Precedence: bulk
Content-Type: text/plain; charset="iso-8859-1"
Content-Length: 1020

I recently finished a pilot project that involved analyzing netnews
postings; I found a shareware program, Gravity, that allowed the user to
store all new messages from marked groups on a local hard drive, where they
could be subjected to standard string searches. Unfortunately, the program
does not allow the message database to be dumped as a plain ascii file, so
using a full concordancer with it is impossible. Messages containing
searched-for strings, of course, can be cut-and-pasted into ascii files (and
if you have lots of time or a phalanx of assistants, I suppose you could do
that with *all* messages). I looked at netnews because I was interested in
argumentation, but it has the added advantage of sidestepping the privacy
issue, since netnews postings are fully public.

The material itself raises a number of interesting analytical issues. For
example, what is the implication for, say, type/token ratios of a medium in
which users commonly copy the entirity of a preceding message into a current
one?

As I understand it, your local news server already does this, i.e.,
incoming news posts are locally stored (at your news/internet service
provider's disks). The format used is plain text (with headers at the top,
similar to mail format; I forget the exact RFC number, but there is one you
could check for the exact format specifications). So, if you have access to
these disks that your provider uses (which would be usually the case for a
university account), you can just read the data from there. Typically it
goes in some place like /usr/spool/news, and then your news client reads it
from there.

At a Unix system, you can automate the process of fetching new articles by
having a program, running as a cron job, automatically check the spool
directory every day or so, and copy the articles you want. Then you can run
your analysis tools on the collected text, stripping headers, etc. You
need to store the articles in a separate area as they will be automatically
deleted from the system spool directory after a while.

The question of "sidestepping the privacy issue" is far from resolved, I
believe. Posting an article does not place the material in it in the public
domain, although this is a common misconception. Under both U.S. and
European law, the author of the message retains full copyright rights over
it, even when this is not mentioned in the message, unless there is an
explicit statement to the effect that the message is now in the public
domain. Thus, we almost certainly need permission to do research with news
articles, although it is questionable if the authors whose copyright we
infringe will actually bother seeking relief. Our safest bet is probably
to ask for permission to use a commercial news posting source (e.g., the
clari.* hierarchy), which they may grant for free for research purposes.

Best,
Vasileios