Re: Corpora: efficient string matching: summary of replies

Martin Kay (kay@parc.xerox.com)
Tue, 14 Apr 1998 09:41:36 PDT

Three minor remark on your summary of meesages on efficient retrieval of
strings in corpora.

1. You say
PAT trees are binary trees, so they take more space than arrays.
On the other hand, their construction is of order O(n), whereas
arrays take order O(n log(n)).
As I think I said in my note, this is not quite right and, in particular
cases can be very wrong. The n log(n) estimate is based on the idea that
comparison of strings in sorting as a unit-time operation. But, if
there are any long repeated strings---and there will be in a
sufficiently long text---the comparison can be of order n itself.

2. There is no point in balancing a Patricia tree. It would take a text of
quite extrodinary perversity to give rise to a tree that was importantly
out of balance.

3. I save the space overhead while building in linear time by building a
tree and then converting to an array!

--Martin