Re: [Corpora-List] Comparing files

From: Vlado Keselj (vlado@cs.dal.ca)
Date: Sun Nov 16 2003 - 14:30:29 MET

  • Next message: Ken Beesley: "Re: [Corpora-List] Comparing files"

    On Sat, 15 Nov 2003 radev@umich.edu wrote:

    > Here is a UNIX script:
    >
    > % sort one | uniq > one.uniq
    > % sort two | uniq > two.uniq
    > % cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr > output

    A similar question was asked about 2.5 years ago on the corpora-list.
    (Is this a candidate for FAQ?) This was my answer:

    Date: Fri Apr 20 2001 - 17:04:03 MET DST
    Subject: Re: Corpora: FW: help - comparing word lists

    On Unix, Linux and similar: You can sort both lists and use comm, e.g.:
    sort -u < list1 > list1.sorted
    sort -u < list2 > list2.sorted
    comm -23 list1.sorted list2.sorted

    It will output the words that are on list1 but not on list2.
    Both commands are pretty efficient.

    Vlado

    On Fri, 20 Apr 2001, Wiesheu, Martin wrote:

    > hello out there,
    >
    > could anyone help me on the following question?:
    >
    > is there any tool or method to efficiently compare two very long word lists
    > to see what words are on one list and not on the other?
    >
    > thanks,
    >
    > martin
    >
    >
    > Martin Wiesheu
    > ZGS Research
    > COMMERZBANK Securities
    >
    > Tel. + 49 - 69 - 136 43730
    > Fax. + 49 - 69 - 136 27445

    > Here is an example
    >
    > one:
    > ==========
    > cat
    > dog
    > cat
    > mouse
    >
    > two:
    > ==========
    > cat
    > rabbit
    > elephant
    > rabbit
    >
    > output:
    > ==========
    > 3 cat
    > 2 mouse
    > 2 dog
    > 1 rabbit
    > 1 elephant
    >
    >
    > Words with a count of 3 appear in both "one" and "two".
    > Words with a count of 2 appear in "one" only.
    > Words with a count of 1 appear in "two" only.
    >
    > --
    > Drago
    >
    >
    > Miles Osborne wrote:
    > >
    > > that's far too slow -use a hash table instead.
    > >
    > > now, this wouldn't be homework, would it?
    > >
    > > Miles
    > >
    > > Quoting Otto Lassen <otto@lassen.mail.dk>:
    > >
    > > > Hi
    > > > That could be done in any language:
    > > > 1. sort then two lists
    > > > 2. compare them word for word
    > > > 3. output words which are not in both lists
    > > > Regards
    > > > Otto Lassen
    > > >
    > > > At 21:54 15-11-2003 +0100, you wrote:
    > > > >Hi,
    > > > >
    > > > >I'm doing a project that involves comparing two very large word lists
    > > >
    > > > >(~40.000 and 70.000 words). What I need to find out, is which words are
    > > > on
    > > > >one list and not on the other (and/or vice versa).
    > > > >Can anyone give me a hint as to how to do this? (I was thinking; maybe
    > > > a
    > > > >perl script?)
    > > > >
    > > > >Any help will be greatly appreciated.
    > > > >Best,
    > > > >Tine Lassen
    > > >
    > > >
    > >
    > >
    >
    >
    > --
    > Dragomir R. Radev radev@umich.edu
    > Assistant Professor of Information, Electrical Engineering and
    > Computer Science, and Linguistics, the University of Michigan, Ann Arbor
    > Phone: 734-615-5225 Fax: 734-764-2475 http://www.si.umich.edu/~radev
    >



    This archive was generated by hypermail 2b29 : Sun Nov 16 2003 - 14:33:54 MET