Re: [Corpora-List] Comparing files

From: Ken Beesley (Ken.Beesley@xrce.xerox.com)
Date: Sun Nov 16 2003 - 14:50:13 MET

Next message: Lluís Padró: "Re: [Corpora-List] Comparing files"

Previous message: Vlado Keselj: "Re: [Corpora-List] Comparing files"
Maybe in reply to: Tine Lassen: "[Corpora-List] Comparing files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

In Unix, there's a built-in 'comm' utility for
comparing two lists of (sorted) words:

NAME
comm - select or reject lines common to two files

SYNOPSIS
comm [-123] file1 file2

DESCRIPTION
     The comm utility will read file1 and file2, which should be
     ordered in the current collating sequence, and produce three
     text columns as output: lines only in file1; lines only in
     file2; and lines in both files.

     If the input files were ordered according to the collating
     sequence of the current locale, the lines written will be in
     the collating sequence of the original lines. If not, the
     results are unspecified.

OPTIONS
The following options are supported:

-1 Suppress the output column of lines unique to file1.

-2 Suppress the output column of lines unique to file2.

-3 Suppress the output column of lines duplicated in
file1 and file2.

So if your original files are 'one' and 'two', you can just do:

sort one | uniq > one.uniq
sort two | uniq > two.uniq
comm one.uniq one.uniq two.uniq > output

using the optional flags, as desired, to suppress one or more of
the columns of output.

Ken

> Subject: Re: [Corpora-List] Comparing files
> To: miles@inf.ed.ac.uk (Miles Osborne)
> Date: Sat, 15 Nov 2003 17:16:14 -0500 (EST)
> Cc: otto@lassen.mail.dk (Otto Lassen), tine.lassen@tdcadsl.dk (Tine Lassen),
CORPORA@HD.UIB.NO
> From: radev@umich.edu
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> X-checked-clean: by exiscan on rolf
> X-Scanner: de33e062bbda4db3e8275cce8f768c70 http://tjinfo.uib.no/virus.html
> X-UiB-SpamFlag: NO UIB: 0.7 hits, 7.0 required
> X-UiB-SpamReport: spamassassin found; * 0.7 -- From: does not include a real
name
>
> Here is a UNIX script:
>
> % sort one | uniq > one.uniq
> % sort two | uniq > two.uniq
> % cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr > output
>
> Here is an example
>
> one:
> ==========
> cat
> dog
> cat
> mouse
>
> two:
> ==========
> cat
> rabbit
> elephant
> rabbit
>
> output:
> ==========
> 3 cat
> 2 mouse
> 2 dog
> 1 rabbit
> 1 elephant
>
>
> Words with a count of 3 appear in both "one" and "two".
> Words with a count of 2 appear in "one" only.
> Words with a count of 1 appear in "two" only.
>
> --
> Drago
>
>
> Miles Osborne wrote:
> >
> > that's far too slow -use a hash table instead.
> >
> > now, this wouldn't be homework, would it?
> >
> > Miles
> >
> > Quoting Otto Lassen <otto@lassen.mail.dk>:
> >
> > > Hi
> > > That could be done in any language:
> > > 1. sort then two lists
> > > 2. compare them word for word
> > > 3. output words which are not in both lists
> > > Regards
> > > Otto Lassen
> > >
> > > At 21:54 15-11-2003 +0100, you wrote:
> > > >Hi,
> > > >
> > > >I'm doing a project that involves comparing two very large word lists
> > >
> > > >(~40.000 and 70.000 words). What I need to find out, is which words are
> > > on
> > > >one list and not on the other (and/or vice versa).
> > > >Can anyone give me a hint as to how to do this? (I was thinking; maybe
> > > a
> > > >perl script?)
> > > >
> > > >Any help will be greatly appreciated.
> > > >Best,
> > > >Tine Lassen
> > >
> > >
> >
> >
>
>
> --
> Dragomir R. Radev radev@umich.edu
> Assistant Professor of Information, Electrical Engineering and
> Computer Science, and Linguistics, the University of Michigan, Ann Arbor
> Phone: 734-615-5225 Fax: 734-764-2475 http://www.si.umich.edu/~radev
>

**********************************************************************
Kenneth R. Beesley ken.beesley@xrce.xerox.com
Xerox Research Centre Europe Tel from France: 04 76 61 50 64
6, chemin de Maupertuis Tel from Abroad: +33 4 76 61 50 64
38240 MEYLAN Fax from France: 04 76 61 50 99
France Fax from Abroad: +33 4 76 61 50 99

XRCE page: http://www.xrce.xerox.com
Personal page: http://www.xrce.xerox.com/people/beesley/beesley.html
**********************************************************************

Next message: Lluís Padró: "Re: [Corpora-List] Comparing files"
Previous message: Vlado Keselj: "Re: [Corpora-List] Comparing files"
Maybe in reply to: Tine Lassen: "[Corpora-List] Comparing files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sun Nov 16 2003 - 14:48:24 MET