Re: [Corpora-List] Comparing files

From: Preslav Ivanov Nakov (nakov@eecs.berkeley.edu)
Date: Sun Nov 16 2003 - 00:47:19 MET

  • Next message: Bob Krovetz: "Re: [Corpora-List] Comparing files"

    If you have enough memory to fit the smaller one there (and for any
    reasonable wordlist size I can imagine, you should have),
    a hash table is much better than sorting - it is linear in the size of both
    files.

    Preslav

    ----- Original Message -----
    From: "Tony Abou-Assaleh" <taa@cs.dal.ca>
    To: "Miles Osborne" <miles@inf.ed.ac.uk>; "Otto Lassen"
    <otto@lassen.mail.dk>; "Tine Lassen" <tine.lassen@tdcadsl.dk>;
    <CORPORA@HD.UIB.NO>; <radev@umich.edu>
    Sent: Saturday, November 15, 2003 3:25 PM
    Subject: Re: [Corpora-List] Comparing files

    > If the files are really big, sorting them first and then merging using a
    > merge sort is probably the fastest. Sorting is O(N lg N) and merging is
    > O(N). Alternatively, you could concatenate, sort, and then if needed
    > traverse.
    >
    > However, 40K and 70K words isn't really that much, especially if you don't
    > need to repeat every few seconds. So in your case, Drago's method is the
    > simplest and would be fast enough.
    >
    > TAA
    >
    > --------------------------------------------------
    > Tony Abou-Assaleh
    > Ph.D. Candidate, Faculty of Computer Science
    > Dalhousie University, Halifax, NS, Canada, B3H 1W5
    > Fax: 902-492-1517
    > Email: taa@acm.org
    > WWW: http://www.cs.dal.ca/~taa/
    > ---------------------[THE END]--------------------
    >
    >
    > On Sat, 15 Nov 2003 radev@umich.edu wrote:
    >
    > > Here is a UNIX script:
    > >
    > > % sort one | uniq > one.uniq
    > > % sort two | uniq > two.uniq
    > > % cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr > output
    > >
    > > Here is an example
    > >
    > > one:
    > > ==========
    > > cat
    > > dog
    > > cat
    > > mouse
    > >
    > > two:
    > > ==========
    > > cat
    > > rabbit
    > > elephant
    > > rabbit
    > >
    > > output:
    > > ==========
    > > 3 cat
    > > 2 mouse
    > > 2 dog
    > > 1 rabbit
    > > 1 elephant
    > >
    > >
    > > Words with a count of 3 appear in both "one" and "two".
    > > Words with a count of 2 appear in "one" only.
    > > Words with a count of 1 appear in "two" only.
    > >
    > > --
    > > Drago
    > >
    > >
    > > Miles Osborne wrote:
    > > >
    > > > that's far too slow -use a hash table instead.
    > > >
    > > > now, this wouldn't be homework, would it?
    > > >
    > > > Miles
    > > >
    > > > Quoting Otto Lassen <otto@lassen.mail.dk>:
    > > >
    > > > > Hi
    > > > > That could be done in any language:
    > > > > 1. sort then two lists
    > > > > 2. compare them word for word
    > > > > 3. output words which are not in both lists
    > > > > Regards
    > > > > Otto Lassen
    > > > >
    > > > > At 21:54 15-11-2003 +0100, you wrote:
    > > > > >Hi,
    > > > > >
    > > > > >I'm doing a project that involves comparing two very large word
    lists
    > > > >
    > > > > >(~40.000 and 70.000 words). What I need to find out, is which words
    are
    > > > > on
    > > > > >one list and not on the other (and/or vice versa).
    > > > > >Can anyone give me a hint as to how to do this? (I was thinking;
    maybe
    > > > > a
    > > > > >perl script?)
    > > > > >
    > > > > >Any help will be greatly appreciated.
    > > > > >Best,
    > > > > >Tine Lassen
    > > > >
    > > > >
    > > >
    > > >
    > >
    > >
    > > --
    > > Dragomir R. Radev
    radev@umich.edu
    > > Assistant Professor of Information, Electrical Engineering and
    > > Computer Science, and Linguistics, the University of Michigan, Ann Arbor
    > > Phone: 734-615-5225 Fax: 734-764-2475
    http://www.si.umich.edu/~radev
    > >
    >
    >



    This archive was generated by hypermail 2b29 : Sun Nov 16 2003 - 00:46:22 MET