RE: Corpora: Unpacking BNC with WinZip

Christopher Tribble (ctribble@lanka.ccom.lk)
Fri, 1 Jan 1999 22:31:16 +0530

Dear Douglas McCarthy,

Thanks for this. I've tried changing the TAR file settings in WinZip -
toggling TAR file smart CR/LF conversion - and this makes no difference.
The problem I keep coming back to is that the contents of A.TGZ, B.TGZ and
C.TGZ on the BNC distribution CD ROMS all appear as large single files to
WinZip, while all the other TGZ distribution files show unpack as TAR
archives containing _many_ files - which is what I want.

Is it simply that the poor sods without Unix boxes are out in the cold? I
continue to hope for illumination! Surely someone else has had the same
problem?

Chris

On Friday, January 01, 1999 4:27 PM, Douglas McCarthy
[SMTP:saltertn@ecp.fr] wrote:
> If it is Winzip at the root of the problem, you can check by changing the
Tar
> CR/LF conversion setting in the Options/Configuration menu.
>
> This used to cause problems with gunzipped tar archives. I say "used to"
because
> this setting is on in my version 6, and I'm fairly sure that I've
correctly
> restored archived file systems.
>
> I'll try to check further at my end, but in the meantime, try the advice
above.
>
> Best regards,
>
> Douglas McCarthy
>
> Christopher Tribble wrote:
>
> > Dear All - I've put this request to Lou, and appended his reply - I'd
be
> > really grateful if anyone has any ideas. (BTW - I'm using Winzip 32
(6.3)
> > so I don't think Lou's thoughts that it's a Winzip problem hold)
> >
> > Really grateful for any comments / suggestions
> >
> > ----------------------------------------------
> > PROBLEM
> >
> > I've been tinkering with the raw text files on the BNC CDROM and find
some
> > analomolies which confuse me.
> >
> > These are:
> >
> > 1. The compressed files for A.tgz, B.tgz, and C.tgz unpack from the
CDROM
> > to create a single text file - respectively A, B & C. All the other
.TGZ
> > files unpack to create a .TAR file. Each of these can in turn be
unpacked
> > to create a large number of individual corpus files. A useful
arrangement
> > if you want to work with subsets of the BNC - which is my intention.
> >
> > 2. The large A,B, & C files are text files. They appear to contain the
> > original data files but in a concatenated form. There also appears to
be a
> > certain amount of "noise" in these files:
> >
> > a) "loose" tags are included which also seem to be associated with the
loss
> > of some tags - as in the example below where the tag for "should" has
gone
> > missing:
> >
> > <w NP0>EVELYN <w NP0>McEWEN <w NP0>Divisional <w NN1>Director<c PUN>,
> > <w NN2>Services
> > </p>
> > </div1>
> > </text>
> > </bncDoc>
> > should <w VBI>be
> > <w VVN>ensured <w PRP>for <w AJC>older <w NN2>workers<c PUN>;
> > These appear close to the end of texts at what seem to arbitrary
intervals
> > throughout the file - without any corresponding <text> starting tag
> >
> > b) there is a block of unprintable characters at the beginning of each
text
> > - eg:
> >
> > </B/B0/B02
> >
> > 440 15530 15000 1621613 5725401633
5212
> >
> > Any idea what the problem / solution might be? I'm able to split the
files
> > back into constituent texts using WordSmith, but then lose the file
names -
> > it's all a bit confusing.
> > ----------------------------------------------
> > LOU'S REPLY
> > On Thursday, December 31, 1998 11:49 PM, Lou Burnard
> > [SMTP:lou.burnard@computing-services.oxford.ac.uk] wrote:
> > > Hi Chris
> > >
> > > sorry not to have replied to your query earlier: it got swamped
by
> > > xmas xcesses.
> > >
> > > the short answer is: upgrade your version of winzip. or unpack the
cds
> > > with something that knows how to deal with a GNU tar file properly.
> > >
> > > I dont know why A B and C are different from the others (probably
> > > because they were done first) but, clearly what you are getting is a
> > > TAR archive instead of the proper file structure. Later versions of
> > > Winzip (mine is 6) recognize this file format correctly,. and will
> > > unpack it into the cxorrect file system.
> > >
> > > stand by for exciting announcements abouyt the bnc sampler (wot? that
> > > old thing?)
> > >
> > > best wishes to you for 99
> > >
> > > Lou
> > ----------------------------------------------
> > As I say above - the Winzip version doesn't seem to be the problem ...
> >
> > Enlightenment greatly welcomed!
> >
> > bestest
> >
> > Chris Tribble
> >
> > --
> > Sri Lanka 21 Wijerama Mawatha, Colombo 7
> > TEL +94 75 332 309
> > UK 122, Queen Alexandra Mansions, Judd Street
> > London WC1 H 9DQ
> > TEL +44 171 833 4271
> > UK Mailing c/o FCO (Colombo)
> > The British Council: Sri Lanka
> > King Charles Street, London SW1A 2AH
> > E-mail ctribble@serendib.ccom.lk
> > Home Page
http://ourworld.compuserve.com/homepages/Christopher_Tribble
>
>