Re: [Corpora-List] Perl efficiency (Re: N-gram string extraction)

From: Sean Slattery (Sean.Slattery@cs.cmu.edu)
Date: Thu Aug 29 2002 - 11:50:53 MET DST

  • Next message: Stefan Evert: "[Corpora-List] [osander@gmx.de: How to extract N-grams]"

    A few observations to add to Stefan's excellent email.

    First up, I think that writing in Perl is the correct solution for
    something you need to do now. For text processing tasks, nothing that
    I've seen (Java, C, C++) even comes close in terms of development
    time. If you're willing to spend an order of magnitude more time
    writing code, then you will of course get a solution that will run
    faster and smaller, but you're trading off your time (valuable) for
    CPU cycles (cheap). The point at which that tradeoff becomes
    worthwhile will vary a lot from person to person.

    Stefan's point about Perl's memory usage is well taken. In this case
    though, when you know that the data structure is a simple
    string->number hash, and you know you'll be dealing with many many
    keys, then simply tie'ing the hash to a disk file, moving the memory
    usage issue from RAM to disk. The code will never swap, and you could
    potentially deal with many more ngrams than an in-core C/C++/Java
    implementation could deal with. There may be nice libraries for doing
    the equivalent of a tie, but are they as easy to use?

    Anyhow - "Programming Perl" has lots more to say about various forms
    of efficiency in Perl. If you're loathe to spend time crafting code in
    languages less suited to processing text, then have a browse through
    the Efficiency section in Chapter 25 - you may not need to give up as
    much speed as you think you do.

    S.



    This archive was generated by hypermail 2b29 : Thu Aug 29 2002 - 22:15:45 MET DST