[Corpora-List] Summary: Comparing files

From: Tine Lassen (tine.lassen@tdcadsl.dk)
Date: Tue Nov 18 2003 - 23:23:16 MET

  • Next message: Olivier Kraif: "[Corpora-List] Looking forparallel corpora : French/English scientific articles"

    A couple a days ago I posted a question:

    "I'm doing a project that involves comparing two very large word lists (~40.000 and 70.000 words). What I need to find out, is which words are on one list and not on the other (and/or vice versa).
     Can anyone give me a hint as to how to do this? (I was thinking; maybe a perl script?)"

    I got an overwhelming amount of answers, and I apologize for not answering everybody personally.
    Thanks to the following people, who all suggested solutions to my problem:
     
     Lluís Padró
     David Graff
     JL delucca
     Darren Pearce
     Vlado Keselj
     Ken Beesley
     Ronald P. Reck
     Bob Krovetz
     Erin McKean
     Preslav Ivanov Nakov
     Tony Abou-Assaleh
     Miles Osborne
     Paul Bijnens
     Christer.Johansson
     Donna Byron
     Alexiei Dingli
     Otto Lassen
     Marco Baroni

    Several people suggested I use the unix-utilities sort in combination with diff or comm. - which of course will give me what I'm looking for.
    However, what I didn't explain in my original query, was that my lists are lexicon files, and thus are already sorted and consisting of unique tokens. Given the difference in the size of the two lists (30K words), I was hoping for a tool that maybe did a little bit more than that.

    What I chose to go with, was a perl script written by David Graff:
    >
    > I wrote a perl script myself some time ago, to do just this sort of
    > thing, as well as a few other things. You can download it from the
    > "Perl Monks" web site:
    >
    > http://www.perlmonks.org/index.pl?node_id=160735
    >
    > It actually does quite a few other things besides the simple task you
    > posed: given two lists, you can output the intersection, the union, or
    > the set of elements unique to one list or the other.
    >
    > Some additional options let you do this on multi-column files as well as
    > simple lists, choosing any column as the "join" field for relating the
    > two input lists, and specifying a specialized field delimiter if
    > necessary.
    >
    > The Perl Monks site is an excellent resource -- you can search their
    > "code catacombs", FAQ's, and watch on-going discussions as people post
    > questions/findings and get answers/reactions.
    >
    > Best regards,
    >
    > Dave Graff

    Another interesting suggestion came from JL delucca:

    From: "JL delucca" <delucca@email.com>
    To: "Tine Lassen" <tine.lassen@tdcadsl.dk>
    > Tine,
    >
    > Well, well, well. I am have a page on Internet http://www.dictionarium.com (main page) and a page on Computacional Linguistics Programming (all mine): http://www.dictionarium.com.dictionarium.htm
    >
    > At the present time there is not abailable on my page above cited any tool to do you want, however I have tools working on Windows which to do you want.
    >
    > Send me your lists and I can try to do you want.
    >
    > Best wishes,
    >
    >
    > J. L. De Lucca
     

    Below are the rest of the answers I got:

    From: "Marco Baroni" <baroni@sslmit.unibo.it>
    > A perl script like the following should work (I haven't tested it,
    > though):
    >
    > ************************************************************
    >
    > #!/usr/bin/perl -w
    >
    > open FIRSTLIST, shift;
    > while (<FIRSTLIST>) {
    > $infirst{$_}++;
    > }
    > close FIRSTLIST;
    >
    > open SECONDLIST, shift;
    > while (<SECONDLIST>) {
    > if (!($infirst{$_})) {
    > print;
    > }
    > }
    > close SECONDLIST;
    >
    > ************************************************************
    >
    > Regards,
    >
    > Marco

     
    From: "Alexiei Dingli" <alexiei@dcs.shef.ac.uk>
    > Just add elements to a Hash data structure. It has access time O(1) so when
    > u insert into the hash, you know immediately if the world already exists or
    > not.
    >
    > Lex
     
    From: "Donna Byron" <dkbyron@yahoo.com>
    > Are you on unix? If so you can use the command-line
    > sort on each of your files with the -u flag to produce
    > only one token of each unique item in the output, then
    > diff the sorted files. That will give you a
    > type-level distinction, instead of token-level, but it
    > sounds like that's what you wanted.
    >
    > Donna

    From: <Christer.Johansson@lili.uib.no>
    > In unix/linux you do something like
    >
    > sort file1
    > sort file2
    > diff -23 file1 file2 > result
    > less result
    >
    > (I'm not actually 100% confident the best is to use diff,
    > but look at these man pages:
    > man sort
    > man diff
    > apropos compare
    > )
    >
    > Good luck
    >
    > /Christer
     

     
    From: "Paul Bijnens" <paul.bijnens@xplanation.com>
    > In unix/linux: read the man for for the command "comm":
    >
    > SYNOPSIS
    > comm [OPTION]... LEFT_FILE RIGHT_FILE
    >
    > DESCRIPTION
    > Compare sorted files LEFT_FILE and RIGHT_FILE line by line.
    >
    > -1 suppress lines unique to left file
    >
    > -2 suppress lines unique to right file
    >
    > -3 suppress lines that appear in both files
    >
    > Paul @ Home
     
    From: Otto Lassen
    Hi
    That could be done in any language:
    1. sort then two lists
    2. compare them word for word
    3. output words which are not in both lists
    Regards
    Otto Lassen
    From: "Miles Osborne" <miles@inf.ed.ac.uk>
    To: "Otto Lassen" <otto@lassen.mail.dk>
    > that's far too slow -use a hash table instead.
    >
    > now, this wouldn't be homework, would it?
    >
    > Miles
    >
    From: <radev@umich.edu>
    To: "Miles Osborne" <miles@inf.ed.ac.uk>
    Cc: "Otto Lassen" <otto@lassen.mail.dk>; "Tine Lassen" <tine.lassen@tdcadsl.dk>;
    > Here is a UNIX script:
    >
    > % sort one | uniq > one.uniq
    > % sort two | uniq > two.uniq
    > % cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr > output
    >
    > Here is an example
    >
    > one:
    > ==========
    > cat
    > dog
    > cat
    > mouse
    >
    > two:
    > ==========
    > cat
    > rabbit
    > elephant
    > rabbit
    >
    > output:
    > ==========
    > 3 cat
    > 2 mouse
    > 2 dog
    > 1 rabbit
    > 1 elephant
    >
    >
    > Words with a count of 3 appear in both "one" and "two".
    > Words with a count of 2 appear in "one" only.
    > Words with a count of 1 appear in "two" only.
    >
    > --
    > Drago

    From: "Tony Abou-Assaleh" <taa@cs.dal.ca>
    To: "Miles Osborne" <miles@inf.ed.ac.uk>; "Otto Lassen" <otto@lassen.mail.dk>; "Tine Lassen" <tine.lassen@tdcadsl.dk>; <CORPORA@HD.UIB.NO>; <radev@umich.edu>
    > If the files are really big, sorting them first and then merging using a
    > merge sort is probably the fastest. Sorting is O(N lg N) and merging is
    > O(N). Alternatively, you could concatenate, sort, and then if needed
    > traverse.
    >
    > However, 40K and 70K words isn't really that much, especially if you don't
    > need to repeat every few seconds. So in your case, Drago's method is the
    > simplest and would be fast enough.
    >
    > TAA

    From: "Preslav Ivanov Nakov" <nakov@eecs.berkeley.edu>
    > If you have enough memory to fit the smaller one there (and for any
    > reasonable wordlist size I can imagine, you should have),
    > a hash table is much better than sorting - it is linear in the size of both
    > files.
    >
    > Preslav
    >
    From: "Erin McKean" <editor@verbatimmag.com>
    To: "Tine Lassen" <tine.lassen@tdcadsl.dk>
    > I believe the Perl Cookbook has a good way to do this -- read the
    > words into an array and compare the arrays.
    >
    > Lots of text editors, including EditPadPro, have compare functions,
    > and if you have access to a unix system the command 'diff' will work,
    > too.
    >
    > Good luck!
    >
    > Erin McKean
    >

    From: "Bob Krovetz" <bkrovetz@askjeeves.com>
    > For files that are this small, the time difference between O(n) and O(nlogn)
    > isn't a big deal on typical PC's. I would also recommend looking at the
    > "comm"
    > command in Unix/Linux. With the comm command you would only have to do two
    > sorts (one for each file).
    >
    > >If you have enough memory to fit the smaller one there (and for any
    > reasonable wordlist >size I can imagine, you should have), a hash table is
    > much better than sorting - it is >linear in the size of both files.
    >
    > Preslav
    >
    From: "Ronald P. Reck" <rreck@iama.rrecktek.com>
    > The UNIX 'diff' command can do this easily.
    >
    > ie: diff file1 file2
    >
    > If I can help let me know.
    >
    > This is from the diff man page:
    >
    > NAME
    > diff - find differences between two files
    >
    > SYNOPSIS
    > diff [options] from-file to-file
    >
    ----- Original Message -----
    From: "Vlado Keselj" <vlado@cs.dal.ca>
    To: <radev@umich.edu>
    Cc: "Miles Osborne" <miles@inf.ed.ac.uk>; "Otto Lassen" <otto@lassen.mail.dk>; "Tine Lassen" <tine.lassen@tdcadsl.dk>; <CORPORA@HD.UIB.NO>
    > On Sat, 15 Nov 2003 radev@umich.edu wrote:
    >
    > > Here is a UNIX script:
    > >
    > > % sort one | uniq > one.uniq
    > > % sort two | uniq > two.uniq
    > > % cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr > output
    >
    > A similar question was asked about 2.5 years ago on the corpora-list.
    > (Is this a candidate for FAQ?) This was my answer:
    >
    > Date: Fri Apr 20 2001 - 17:04:03 MET DST
    > Subject: Re: Corpora: FW: help - comparing word lists
    >
    > On Unix, Linux and similar: You can sort both lists and use comm, e.g.:
    > sort -u < list1 > list1.sorted
    > sort -u < list2 > list2.sorted
    > comm -23 list1.sorted list2.sorted
    >
    > It will output the words that are on list1 but not on list2.
    > Both commands are pretty efficient.

    From: "Ken Beesley" <Ken.Beesley@xrce.xerox.com>
    > Tine,
    >
    > The standard Unix utility 'comm' will do this for you.
    >
    >
    > NAME
    > comm - select or reject lines common to two files
    >
    > SYNOPSIS
    > comm [-123] file1 file2
    >
    > DESCRIPTION
    > The comm utility will read file1 and file2, which should be
    > ordered in the current collating sequence, and produce three
    > text columns as output: lines only in file1; lines only in
    > file2; and lines in both files.
    >
    > If the input files were ordered according to the collating
    > sequence of the current locale, the lines written will be in
    > the collating sequence of the original lines. If not, the
    > results are unspecified.
    >
    > OPTIONS
    > The following options are supported:
    >
    > -1 Suppress the output column of lines unique to file1.
    >
    > -2 Suppress the output column of lines unique to file2.
    >
    > -3 Suppress the output column of lines duplicated in
    > file1 and file2.
    >
    >
    > i.e. file1 and file2 are text files that have one word on
    > each line. Each of the files must be in the current collating
    > sequence (the standard unix utility is 'sort').
    >
    > It will create, by default an output file of three columns
    >
    > Words-only-in-file1 Word-only-in-file-2 Words-in-both-files
    >
    >
    > Say your original wordlist files are list1 and list2. Here's
    > the drill, using 'sort' to make sure that they are sorted right
    > for comm.
    >
    > sort list1 > list1.sorted
    > sort list2 > list2.sorted
    > comm list1.sorted list2.sorted > commoutput
    >
    > Then examine commoutput with your favorite text editor.
    >
    > You can use the optional flags as shown to suppress one or more of
    > the columns of output.
    >
    > Good luck.
    >
    > Ken

    From: "Darren Pearce" <Darren.Pearce@sussex.ac.uk>
    To: "Tine Lassen" <tine.lassen@tdcadsl.dk>
    Cc: "Miles Osborne" <miles@inf.ed.ac.uk>; "Otto Lassen" <otto@lassen.mail.dk>; <radev@umich.edu>; <CORPORA@HD.UIB.NO>
    > A more efficient version that doesn't require sorting the concatenated
    > list uses the Unix command 'comm':
    >
    > % sort list1 > list1.sorted
    > % sort list2 > list2.sorted
    >
    > To see those items that occur in list1 but not in list2, use:
    >
    > % comm -2 -3 list1.sorted list2.sorted
    >
    > To see those items that occur in list2 but not in list1 use:
    >
    > % comm -2 -3 list1.sorted list2.sorted
    >
    > If needed, to see those items that appear in both lists, use:
    >
    > % comm -1 -2 list1.sorted list2.sorted
    >
    > The sorting commands could use 'uniq' but assuming the word lists
    > contain no repeats in the first place, this isn't necessary. The lists
    > my already be sorted too which would mean the first two commands
    > wouldn't be necessary at all. (You can check if they are sorted with
    > 'sort -c')
    >
    > :Darren.

    ----- Original Message -----
    From: "Lluís Padró" <padro@lsi.upc.es>
    > sort list1 > list1.sorted
    > sort list2 > list2.sorted
    > join -v1 list1.sorted list2.sorted
    >
    > (if you use -v2 instead, you'll get words in list2 and not in list1)
    >
    > best
    > --
    > ------------------------------------------------------------------------
    > * Lluís Padró i Cirera * UNIVERSITAT POLITČCNICA DE CATALUNYA



    This archive was generated by hypermail 2b29 : Tue Nov 18 2003 - 23:34:16 MET