Re: Regex on a huge text

Discussion in 'Python' started by Medardo Rodriguez, Aug 22, 2008.

  1. On Fri, Aug 22, 2008 at 11:24 AM, Dan <> wrote:
    > I'm looking on how to apply a regex on a pretty huge input text (a file
    > that's a couple of gigabytes). I found finditer which would return results
    > iteratively which is good but it looks like I still need to send a string
    > which would be bigger than my RAM. Is there a way to apply a regex directly
    > on a file?
    >
    > Any help would be appreciated.



    You can call *grep* posix utility.
    But if the regex's matches are possible only inner the context of a
    line of that file:
    #<code>
    res = []
    with file(filename) as f:
    for line in f:
    res.extend(getmatches(regex, line))
    # Of course "getmatches" describes the concept.
    #</code>

    Regards
    Medardo Rodriguez, Aug 22, 2008
    #1
    1. Advertising

  2. Medardo Rodriguez

    John Machin Guest

    On Aug 23, 6:19 am, "Medardo Rodriguez" <> wrote:
    > On Fri, Aug 22, 2008 at 11:24 AM, Dan <> wrote:
    > > I'm looking on how to apply a regex on a pretty huge input text (a file
    > > that's a couple of gigabytes). I found finditer which would return results
    > > iteratively which is good but it looks like I still need to send a string
    > > which would be bigger than my RAM. Is there a way to apply a regex directly
    > > on a file?

    >
    > > Any help would be appreciated.

    >
    > You can call *grep* posix utility.
    > But if the regex's matches are possible only inner the context of a
    > line of that file:
    > #<code>

    (snip)
    > #</code>


    Docs:
    """
    mmap — Memory-mapped file support

    Memory-mapped file objects behave like both strings and like file
    objects. Unlike normal string objects, however, these are mutable. You
    can use mmap objects in most places where strings are expected; for
    example, you can use the re module to search through a memory-mapped
    file.
    """
    John Machin, Aug 22, 2008
    #2
    1. Advertising

  3. En Fri, 22 Aug 2008 18:56:51 -0300, John Machin <> escribió:
    > On Aug 23, 6:19 am, "Medardo Rodriguez" <> wrote:
    >> On Fri, Aug 22, 2008 at 11:24 AM, Dan <> wrote:
    >> > I'm looking on how to apply a regex on a pretty huge input text (a file
    >> > that's a couple of gigabytes). I found finditer which would return results
    >> > iteratively which is good but it looks like I still need to send a string
    >> > which would be bigger than my RAM. Is there a way to apply a regex directly
    >> > on a file?

    >
    > Docs:
    > """
    > mmap — Memory-mapped file support
    >
    > Memory-mapped file objects behave like both strings and like file
    > objects. Unlike normal string objects, however, these are mutable. You
    > can use mmap objects in most places where strings are expected; for
    > example, you can use the re module to search through a memory-mapped
    > file.
    > """


    Still limited to virtual memory address range for user processes, 2GB or 3GB depending on the OS (assuming a 32 bits OS).

    --
    Gabriel Genellina
    Gabriel Genellina, Aug 24, 2008
    #3
  4. Medardo Rodriguez

    Paddy Guest

    On Aug 22, 9:19 pm, "Medardo Rodriguez" <> wrote:
    > On Fri, Aug 22, 2008 at 11:24 AM, Dan <> wrote:
    > > I'm looking on how to apply a regex on a pretty huge input text (a file
    > > that's a couple of gigabytes). I found finditer which would return results
    > > iteratively which is good but it looks like I still need to send a string
    > > which would be bigger than my RAM. Is there a way to apply a regex directly
    > > on a file?

    >
    > > Any help would be appreciated.

    >
    > You can call *grep* posix utility.
    > But if the regex's matches are possible only inner the context of a
    > line of that file:
    > #<code>
    > res = []
    > with file(filename) as f:
    >     for line in f:
    >         res.extend(getmatches(regex, line))
    > #  Of course "getmatches" describes the concept.
    > #</code>
    >
    > Regards


    Try and pre-filter your file on a line basis to cut it down , then
    apply a further filter on the result.

    For example, if you were looking for consecutive SPAM records with the
    same Name field then you might first extract only the SPAM records
    from the gigabytes to leave something more manageable to search for
    consecutive Name fields in.

    - Paddy.
    Paddy, Aug 24, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Brown Smith
    Replies:
    1
    Views:
    494
    Frankie
    Jun 25, 2005
  2. Brock Heinz
    Replies:
    8
    Views:
    522
    Brock Heinz
    Nov 23, 2004
  3. Jan
    Replies:
    6
    Views:
    479
  4. Replies:
    3
    Views:
    734
    Reedick, Andrew
    Jul 1, 2008
  5. Replies:
    3
    Views:
    481
Loading...

Share This Page