Re: [Corpora-List] Comparing files

From: radev@umich.edu
Date: Sat Nov 15 2003 - 23:16:14 MET

  • Next message: Tony Abou-Assaleh: "Re: [Corpora-List] Comparing files"

    Here is a UNIX script:

    % sort one | uniq > one.uniq
    % sort two | uniq > two.uniq
    % cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr > output

    Here is an example

    one:
    ==========
    cat
    dog
    cat
    mouse

    two:
    ==========
    cat
    rabbit
    elephant
    rabbit

    output:
    ==========
       3 cat
       2 mouse
       2 dog
       1 rabbit
       1 elephant

    Words with a count of 3 appear in both "one" and "two".
    Words with a count of 2 appear in "one" only.
    Words with a count of 1 appear in "two" only.

    --
    Drago
    

    Miles Osborne wrote: > > that's far too slow -use a hash table instead. > > now, this wouldn't be homework, would it? > > Miles > > Quoting Otto Lassen <otto@lassen.mail.dk>: > > > Hi > > That could be done in any language: > > 1. sort then two lists > > 2. compare them word for word > > 3. output words which are not in both lists > > Regards > > Otto Lassen > > > > At 21:54 15-11-2003 +0100, you wrote: > > >Hi, > > > > > >I'm doing a project that involves comparing two very large word lists > > > > >(~40.000 and 70.000 words). What I need to find out, is which words are > > on > > >one list and not on the other (and/or vice versa). > > >Can anyone give me a hint as to how to do this? (I was thinking; maybe > > a > > >perl script?) > > > > > >Any help will be greatly appreciated. > > >Best, > > >Tine Lassen > > > > > >

    -- Dragomir R. Radev radev@umich.edu Assistant Professor of Information, Electrical Engineering and Computer Science, and Linguistics, the University of Michigan, Ann Arbor Phone: 734-615-5225 Fax: 734-764-2475 http://www.si.umich.edu/~radev



    This archive was generated by hypermail 2b29 : Sat Nov 15 2003 - 23:14:17 MET