Re: [Corpora-List] Comparing files

From: Vlado Keselj (vlado@cs.dal.ca)
Date: Sun Nov 16 2003 - 14:30:29 MET

Next message: Ken Beesley: "Re: [Corpora-List] Comparing files"

Previous message: Bob Krovetz: "Re: [Corpora-List] Comparing files"
In reply to: radev@umich.edu: "Re: [Corpora-List] Comparing files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, 15 Nov 2003 radev@umich.edu wrote:

A similar question was asked about 2.5 years ago on the corpora-list.
(Is this a candidate for FAQ?) This was my answer:

Date: Fri Apr 20 2001 - 17:04:03 MET DST
Subject: Re: Corpora: FW: help - comparing word lists

On Unix, Linux and similar: You can sort both lists and use comm, e.g.:
sort -u < list1 > list1.sorted
sort -u < list2 > list2.sorted
comm -23 list1.sorted list2.sorted

It will output the words that are on list1 but not on list2.
Both commands are pretty efficient.

Vlado

On Fri, 20 Apr 2001, Wiesheu, Martin wrote:

> hello out there,
>
> could anyone help me on the following question?:
>
> is there any tool or method to efficiently compare two very long word lists
> to see what words are on one list and not on the other?
>
> thanks,
>
> martin
>
>
> Martin Wiesheu
> ZGS Research
> COMMERZBANK Securities
>
> Tel. + 49 - 69 - 136 43730
> Fax. + 49 - 69 - 136 27445

> Here is an example
>
> one:
> ==========
> cat
> dog
> cat
> mouse
>
> two:
> ==========
> cat
> rabbit
> elephant
> rabbit
>
> output:
> ==========
> 3 cat
> 2 mouse
> 2 dog
> 1 rabbit
> 1 elephant
>
>
> Words with a count of 3 appear in both "one" and "two".
> Words with a count of 2 appear in "one" only.
> Words with a count of 1 appear in "two" only.
>
> --
> Drago
>
>
> Miles Osborne wrote:
> >
> > that's far too slow -use a hash table instead.
> >
> > now, this wouldn't be homework, would it?
> >
> > Miles
> >
> > Quoting Otto Lassen <otto@lassen.mail.dk>:
> >
> > > Hi
> > > That could be done in any language:
> > > 1. sort then two lists
> > > 2. compare them word for word
> > > 3. output words which are not in both lists
> > > Regards
> > > Otto Lassen
> > >
> > > At 21:54 15-11-2003 +0100, you wrote:
> > > >Hi,
> > > >
> > > >I'm doing a project that involves comparing two very large word lists
> > >
> > > >(~40.000 and 70.000 words). What I need to find out, is which words are
> > > on
> > > >one list and not on the other (and/or vice versa).
> > > >Can anyone give me a hint as to how to do this? (I was thinking; maybe
> > > a
> > > >perl script?)
> > > >
> > > >Any help will be greatly appreciated.
> > > >Best,
> > > >Tine Lassen
> > >
> > >
> >
> >
>
>
> --
> Dragomir R. Radev radev@umich.edu
> Assistant Professor of Information, Electrical Engineering and
> Computer Science, and Linguistics, the University of Michigan, Ann Arbor
> Phone: 734-615-5225 Fax: 734-764-2475 http://www.si.umich.edu/~radev
>

Next message: Ken Beesley: "Re: [Corpora-List] Comparing files"
Previous message: Bob Krovetz: "Re: [Corpora-List] Comparing files"
In reply to: radev@umich.edu: "Re: [Corpora-List] Comparing files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sun Nov 16 2003 - 14:33:54 MET