Text file splitter, date/time field

Discussion in 'Perl Misc' started by originals@gmail.com, Jan 31, 2006.

  1. Guest

    Sorry to be such a leech!

    I need to split an archive of a discussion forum saved as one huge txt
    file into individual txt files--one per message.

    Posts are stamped with a date and time, messages can be of any length.
    Posters are sometimes address by their time (as it was an anon forum)
    but the full time/date stamp is always unique to the start of a
    message.

    New to perl but have installed activeperl and can run a .pl script from
    the command line.

    If anyone could provide a script for this job, I'd really appreciate
    it.

    05.11.01 10:01 AM

    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    05.11.01 10:41 AM

    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    05.12.01 10:50 PM

    10:01, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    you get the idea.

    Thanks.

    ps I won't just use it an dump it, I will learn from it!!!! cheers.
     
    , Jan 31, 2006
    #1
    1. Advertising

  2. wrote:
    >
    > I need to split an archive of a discussion forum saved as one huge txt
    > file into individual txt files--one per message.
    >
    > Posts are stamped with a date and time, messages can be of any length.
    > Posters are sometimes address by their time (as it was an anon forum)
    > but the full time/date stamp is always unique to the start of a
    > message.
    >
    > New to perl but have installed activeperl and can run a .pl script from
    > the command line.
    >
    > If anyone could provide a script for this job, I'd really appreciate
    > it.


    #!/usr/bin/perl
    use warnings;
    use strict;

    while ( <> ) {

    if ( /^\d{2}\.\d{2}\.\d{2} \d{2}:\d{2} [AP]M$/ ) {
    chomp;
    tr/ /_/;
    open OUT, '>', $_ or die "Cannot open $_: $!";
    next;
    }

    print OUT if fileno OUT;
    }

    __END__



    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Jan 31, 2006
    #2
    1. Advertising

  3. Guest

    John, thanks for this. It's only out putting empty files (0k, no
    extension and no content when opened in notepad). I've tried it with
    the sample I posted (saved as plain text) just to make sure and same
    result. Maybe you can tweak. In the meantime I'll see if I can get
    anywhere using the "if ( /^\d{2}\.\d{2}\.\d{2} \d{2}:\d{2} [AP]M$/ )"
    in a similar script I've found that splitts after a keyword.

    many thanks
     
    , Feb 1, 2006
    #3
  4. Guest

    wrote:
    > John, thanks for this. It's only out putting empty files


    That's odd - John's script should not have produced any type of file
    for you - not because there's anything wrong with his script, but
    because you're on a Windows machine, and you want to create files named
    as per the timestamp, which include double-points (aka "colon", ie ":")
    which is an illegal character on Windows filesystems. It should have
    failed on the attempt to create the file for writing.

    John's script works perfectly for me on UNIX, and perfectly on Windows
    if I create a slightly modified version of the filename, such as:

    (my $file = $_) =~ s/\:/_/g;
    open OUT, '>',$file or die "Cannot open $_: $!";

    --
    http://DavidFilmer.com
     
    , Feb 1, 2006
    #4
  5. wrote:
    > wrote:
    >>John, thanks for this. It's only out putting empty files

    >
    > That's odd - John's script should not have produced any type of file
    > for you - not because there's anything wrong with his script, but
    > because you're on a Windows machine, and you want to create files named
    > as per the timestamp, which include double-points (aka "colon", ie ":")
    > which is an illegal character on Windows filesystems. It should have
    > failed on the attempt to create the file for writing.
    >
    > John's script works perfectly for me on UNIX, and perfectly on Windows
    > if I create a slightly modified version of the filename, such as:
    >
    > (my $file = $_) =~ s/\:/_/g;
    > open OUT, '>',$file or die "Cannot open $_: $!";


    Thanks, I don't have Windows to test on. Actually if you just changed the line:

    tr/ /_/;

    to:

    tr/ :/_/;

    it would have done the same.


    BTW:

    > open OUT, '>',$file or die "Cannot open $_: $!";

    ^^^^^ ^^

    If you are going to change the variable in the open() you should change it in
    the die() as well.



    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Feb 2, 2006
    #5
  6. Throw Guest

    wrote:

    > I need to split an archive of a discussion forum saved as one huge txt
    > file into individual txt files--one per message.
    >
    > Posts are stamped with a date and time, messages can be of any length.
    > Posters are sometimes address by their time (as it was an anon forum)
    > but the full time/date stamp is always unique to the start of a
    > message.


    G'day everyone

    The solution given to this question is exactly what I'm looking for,
    except I need to split a concatenated PHP file. Basically, I have one
    large text file into which I have copied PHP file after PHP file, and
    now I want to split them up again. The PHP file always begins with

    <?php

    and always ends with

    ?>

    so it should be fairly easy to adjust the above script, shouldn't it?
    However, I have tried and failed. Also, what would the command line be
    for it? Can anyone help me with the adaptation?

    Thanks a lot
    Samuel (aka throw aka leuce aka voetleuce)
     
    Throw, Feb 17, 2006
    #6
  7. "Throw" <> wrote in news:1140187205.361247.173780
    @g43g2000cwa.googlegroups.com:

    > except I need to split a concatenated PHP file. Basically, I have one
    > large text file into which I have copied PHP file after PHP file, and
    > now I want to split them up again. The PHP file always begins with
    >
    > <?php
    >
    > and always ends with
    >
    > ?>
    >
    > so it should be fairly easy to adjust the above script, shouldn't it?
    > However, I have tried and failed.


    What have you tried and what has failed?

    Please read the posting guidelines for this group. They provide you with
    invaluable information you can use to help your self as well as helping us
    help you.

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Feb 17, 2006
    #7
  8. Throw <> wrote:

    > so it should be fairly easy to adjust the above script, shouldn't it?
    > However, I have tried and failed.



    What have you tried?

    If you show us your broken code we could help you fix it.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Feb 18, 2006
    #8
  9. Throw wrote:
    > wrote:
    >
    > > I need to split an archive of a discussion forum saved as one huge txt
    > > file into individual txt files--one per message.
    > >
    > > Posts are stamped with a date and time, messages can be of any length.
    > > Posters are sometimes address by their time (as it was an anon forum)
    > > but the full time/date stamp is always unique to the start of a
    > > message.

    >
    > G'day everyone
    >
    > The solution given to this question is exactly what I'm looking for,
    > except I need to split a concatenated PHP file. Basically, I have one
    > large text file into which I have copied PHP file after PHP file, and
    > now I want to split them up again. The PHP file always begins with
    >
    > <?php
    >
    > and always ends with
    >
    > ?>
    >
    > so it should be fairly easy to adjust the above script, shouldn't it?
    > However, I have tried and failed. Also, what would the command line be
    > for it? Can anyone help me with the adaptation?


    check out this FAQ:
    http://groups.google.com/group/comp...65c4938924b/0486c9f8a384c887#0486c9f8a384c887
     
    it_says_BALLS_on_your_forehead, Feb 18, 2006
    #9
  10. it_says_BALLS_on_your_forehead wrote:
    > Throw wrote:
    > > wrote:
    > >
    > > > I need to split an archive of a discussion forum saved as one huge txt
    > > > file into individual txt files--one per message.
    > > >
    > > > Posts are stamped with a date and time, messages can be of any length.
    > > > Posters are sometimes address by their time (as it was an anon forum)
    > > > but the full time/date stamp is always unique to the start of a
    > > > message.

    > >
    > > G'day everyone
    > >
    > > The solution given to this question is exactly what I'm looking for,
    > > except I need to split a concatenated PHP file. Basically, I have one
    > > large text file into which I have copied PHP file after PHP file, and
    > > now I want to split them up again. The PHP file always begins with
    > >
    > > <?php
    > >
    > > and always ends with
    > >
    > > ?>
    > >
    > > so it should be fairly easy to adjust the above script, shouldn't it?
    > > However, I have tried and failed. Also, what would the command line be
    > > for it? Can anyone help me with the adaptation?

    >
    > check out this FAQ:
    > http://groups.google.com/group/comp...65c4938924b/0486c9f8a384c887#0486c9f8a384c887


    remember when crafting your solution, if you want to use John's
    example, you must have some sort of unique identifier for each file you
    want to write. since there's no unique timestamp, i would suggest an
    iterator in the while loop. If you couple John's example with the
    information in the FAQ linked to above, the answer should be obvious.
     
    it_says_BALLS_on_your_forehead, Feb 18, 2006
    #10
  11. Throw Guest

    A. Sinan Unur wrote:

    > "Throw" <> wrote in news:1140187205.361247.173780
    > @g43g2000cwa.googlegroups.com:


    > > except I need to split a concatenated PHP file. Basically, I have one
    > > large text file into which I have copied PHP file after PHP file, and
    > > now I want to split them up again. The PHP file always begins with
    > > <?php


    > What have you tried and what has failed?


    I have tried the following if-lines and other variations thereof:

    if ( \<?php ) {
    if ( /\<\?php ) {
    if ( /\<\?\p\h\p ) {
    if ( \<?php [AP]M$/ ) {
    if ( /\<\?php [AP]M$/ ) {
    if ( /\<\?\p\h\p [AP]M$/ ) {
    if ( \<?php ) {
    if ( /^\<\?php ) {
    if ( /^\<\?\p\h\p ) {
    if ( \<?php [AP]M$/ ) {
    if ( /^\<\?php [AP]M$/ ) {
    if ( /^\<\?\p\h\p [AP]M$/ ) {

    Does that answer your question? The problem, I think it should be
    clear, is that I do not understand Perl regex syntax, and is therefore
    forced to resort to brute-force methods.

    > Please read the posting guidelines for this group. They provide you with
    > invaluable information you can use to help your self as well as helping us
    > help you.


    None of said posting guidelines helps me to help myself nor does it
    help you any more to help me than my initial post already does... don't
    you agree?

    Samuel
     
    Throw, Feb 21, 2006
    #11
  12. "Throw" <> wrote in
    news::

    > A. Sinan Unur wrote:
    >
    >> "Throw" <> wrote in news:1140187205.361247.173780
    >> @g43g2000cwa.googlegroups.com:

    >
    >> > except I need to split a concatenated PHP file. Basically, I have
    >> > one large text file into which I have copied PHP file after PHP
    >> > file, and now I want to split them up again. The PHP file always
    >> > begins with <?php

    >
    >> What have you tried and what has failed?

    >
    > I have tried the following if-lines and other variations thereof:
    >
    > if ( \<?php ) {
    > if ( /\<\?php ) {
    > if ( /\<\?\p\h\p ) {
    > if ( \<?php [AP]M$/ ) {
    > if ( /\<\?php [AP]M$/ ) {
    > if ( /\<\?\p\h\p [AP]M$/ ) {
    > if ( \<?php ) {
    > if ( /^\<\?php ) {
    > if ( /^\<\?\p\h\p ) {
    > if ( \<?php [AP]M$/ ) {
    > if ( /^\<\?php [AP]M$/ ) {
    > if ( /^\<\?\p\h\p [AP]M$/ ) {
    >
    > Does that answer your question?


    It tells me that you are not approaching the problem methodically.

    > The problem, I think it should be clear, is that I do not understand
    > Perl regex syntax,


    perldoc perlretut

    perldoc perlre

    > and is therefore forced to resort to brute-force methods.


    http://perl.plover.com/Questions.html

    >> Please read the posting guidelines for this group. They provide you
    >> with invaluable information you can use to help your self as well as
    >> helping us help you.

    >
    > None of said posting guidelines helps me to help myself nor does it
    > help you any more to help me than my initial post already does...
    > don't you agree?


    No, I don't.

    If you had at least attempted to post a short but complete program, read
    the documentation along the way, that would have gone a long way towards
    helping you help yourself, and help us help you.

    Sinan
    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Feb 21, 2006
    #12
  13. "Throw" <> writes:

    > The solution given to this question is exactly what I'm looking for,
    > except I need to split a concatenated PHP file. Basically, I have one
    > large text file into which I have copied PHP file after PHP file, and
    > now I want to split them up again. The PHP file always begins with
    >
    > <?php
    >
    > and always ends with
    >
    > ?>


    There are two main ways to do this. Either read the entire file into
    one variable, and then do a regex within a while loop on the entire
    thing, treating it as one line and looking for your sample text (what
    I think of as the brute force approach, since it is the quickest to
    code but requires reading the entire file into memory at once, which
    could be bad for a very large file):

    my $line = join '', <STDIN>;
    my $count = 1;
    while( $line =~ /(<\?php.+?\?>\s*)/gs ){
    my $chunk = $1;
    open my $out, ">", "$count.php" or die $!;
    print $out $chunk;
    close $out or die $!;
    $count++;
    }

    The other option would be to read through the original file line by
    line, starting a new file when you hit a <?php line, and closing it
    when you hit ?>, writing all the lines between to said file. It's
    similar to the above, so you can probably work it out for yourself.


    --
    Aaron --
    http://360.yahoo.com/aaron_baugher
     
    Aaron Baugher, Feb 21, 2006
    #13
  14. Throw <> wrote:
    >
    > A. Sinan Unur wrote:
    >
    >> "Throw" <> wrote in news:1140187205.361247.173780
    >> @g43g2000cwa.googlegroups.com:

    >
    >> > except I need to split a concatenated PHP file.



    I suspect that what you need done is a great deal different
    from the subject of this thread.

    The OP has markers only at the beginning of records, you have
    them at the beginning and at the end.

    The OP's markers are variable length, yours are fixed strings.

    The OP's output filenames are derived from what is matched, you
    haven't indicated any way of naming the files.


    >> > Basically, I have one
    >> > large text file into which I have copied PHP file after PHP file, and
    >> > now I want to split them up again. The PHP file always begins with
    >> > <?php



    >> What have you tried and what has failed?

    >
    > I have tried the following if-lines and other variations thereof:
    >
    > if ( \<?php ) {



    That is not the syntax for the match operator:

    perldoc -f m

    then:

    perldoc perlop

    The match operator starts with either an "m" or a "/" character,
    not a "\" character.


    > if ( /\<\?php ) {



    The match operator ends with a "/" character.

    If you add that character, then it should match just fine,
    though it has one extra backslash that is not needed.


    > if ( /\<\?\p\h\p ) {
    > if ( \<?php [AP]M$/ ) {
    > if ( /\<\?php [AP]M$/ ) {
    > if ( /\<\?\p\h\p [AP]M$/ ) {
    > if ( \<?php ) {
    > if ( /^\<\?php ) {
    > if ( /^\<\?\p\h\p ) {
    > if ( \<?php [AP]M$/ ) {
    > if ( /^\<\?php [AP]M$/ ) {
    > if ( /^\<\?\p\h\p [AP]M$/ ) {
    >
    > Does that answer your question?



    It answers the underlying unspoken question quite well.

    You appear to want to write code in a language that you do not know.

    The implication is that you want us to write your code for you.

    (most especially since you have asked us to write code for you before.)


    > The problem, I think it should be
    > clear, is that I do not understand Perl regex syntax,



    Then you go learn about it before you write it.

    Trying random things will take much more time than learning
    the language that you wish to speak.


    > and is therefore
    > forced to resort to brute-force methods.



    That is absurd.

    If you do not know a language, you go learn the language.

    You can learn about the syntax for the m// operator and for
    Perl's regular expression in the documentation that came with perl.

    If you don't understand some part of those docs, then post a question
    about it here and we will help you understand it.

    We are not likely to read those docs to you though.


    >> Please read the posting guidelines for this group. They provide you with
    >> invaluable information you can use to help your self as well as helping us
    >> help you.

    >
    > None of said posting guidelines helps me to help myself



    They most certainly do!

    - Check the Perl Frequently Asked Questions (FAQ)

    Since you have a question about pattern matching, you would
    eventually try:

    perldoc -q pattern

    And would have found:

    How can I pull out lines between two patterns that are themselves on
    different lines?

    Which tells you how to do exactly what you need done!


    - Check the other standard Perl docs (*.pod)

    Which describe the syntax for the operator that you want to use.


    - Use an effective followup style

    Wherein you quote what you are commenting on, such as the
    code that you want modified.

    This helps you because it allows more people to examine the problem.

    Many or most readers will just move on to answering the next person's
    question rather than spend time locating the code.


    > nor does it
    > help you any more to help me than my initial post already does...



    - Provide enough information

    (which asks for a short and complete program that we can run
    that illustrates the problem you need solved.)

    If you posted code missing the match operator's closing slash,
    then we could have told you that were missing the closing slash,
    and one of your problems would have been eliminated straightaway
    rather than here way down-thread.

    Are the "<?php" and "?>" always on separate lines?

    If you had posted data to go with your code, we would have been able
    to see that there was a much better way of solving your problem
    than what appeared in the thread thus far.


    Anyway, here is a short and complete program that *you* can run.

    ----------------------------------------
    #!/usr/bin/perl
    use warnings;
    use strict;

    my $cnt=1;
    while ( <DATA> ) {
    if ( /<\?php/ ) {
    open OUT, '>', "$cnt.php" or die "could not open '$cnt.php' $!";
    $cnt++;
    }
    print OUT if /<\?php/ .. /\?>/;
    }

    __DATA__
    extra stuff
    <?php
    1st PHP section
    ?>
    in-between stuff
    <?php
    2nd PHP section
    ?>
    trailing stuff
    ----------------------------------------


    > don't
    > you agree?



    No.

    You have already used up all of your coupons.

    So long!


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Feb 22, 2006
    #14
  15. Throw Guest

    Tad McClellan wrote:

    > I suspect that what you need done is a great deal different
    > from the subject of this thread.


    IMO it is not, but I'm sorry that you disagree. The OP wanted a script
    for splitting long files, and did I.

    > The OP has markers only at the beginning of records, you have
    > them at the beginning and at the end.


    True, but that is not relevant, because I don't need to split my file
    at the top and bottom of each PHP file... I only need to split it at
    the top *or* bottom of each PHP file (because the one's bottom is also
    the next one's top, if you see what I mean).

    > The OP's markers are variable length, yours are fixed strings.


    I would have thought that a procedure fixed stings would be easier and
    simpler to implement that that of variable length. I would have
    thought that some of the characters in the search string are regex code
    for "variable things", which one could simply remove to be left with
    that which refers to a fixed string. At least, this is what the regex
    find functions of other languages that I have dealt with, does.

    > The OP's output filenames are derived from what is matched, you
    > haven't indicated any way of naming the files.


    That is true, but I think I would have realised it and probably have
    included the equivalent of a for-next loop (and on how to do that, I
    would probably have searched various Perl forums for existing answers
    to similar questions asked by equally clueless people). Alternatively,
    I may have written a script in a different languge (say, AutoIt) which
    creates unique names for each PHP file... although to do that, I would
    have to know how to call the name in the Perl script's find function.

    > It answers the underlying unspoken question quite well.
    > You appear to want to write code in a language that you do not know.


    Yes.

    > The implication is that you want us to write your code for you.


    No. I did not ask for a completely new script. I asked for help with
    the regex only. The script was already in existence, and it required
    very, very little adaptation... so very little in fact that I might
    have been able to figure it out myself if I had the missing
    information.

    > (most especially since you have asked us to write code for you before.)


    I do ask for scripts to be written, yes. If you enjoy writing simple
    scripts to solve problems that haven't been solved before, you're
    welcome to respond. If you do not, then feel free not to respond. I'm
    not asking because I believe I have the right to expect to be helped.
    I'm asking simply on the off-chance that someone might want to help (or
    point me into some direction).

    > That is absurd.
    > If you do not know a language, you go learn the language.


    I do not agree. Sorry, but my purpose is not to learn a single
    language, but to discover a solution to my problem using whatever means
    is available. If it's a Perl script, then good. If it's Java, Python,
    Ruby, AutoIt, VB macro, StarBasic, Tcl, Yabasic, etc... then also good.
    I have limited knowledge of some of these languages, and if I see
    something which I *think* I understand partially, I'll fiddle with it.
    But I won't read the whole manual, and I won't try to learn everything
    there is about the language.

    What you're saying, has some merit, though. Not knowing the entire
    language can be extremely limiting in that you won't be able to solve
    problems when they arise, except "blindly". In the above case, I had
    believed that my only obstacle to success was the regex line of code.

    Before your post, much of the responses I've had to my query had been
    utterly unuseful (but I have no right to complain or blame). Your
    answer about iterations was very useful because it shows me an
    additional error in my thinking and it made me learn more about Perl
    (though not enough to write programs, heh-heh).

    > You can learn about the syntax for the m// operator and for
    > Perl's regular expression in the documentation that came with perl.


    Thanks... now at least I know what to look for.

    > Since you have a question about pattern matching, you would
    > eventually try:
    >
    > perldoc -q pattern


    Thanks. So it's called "pattern matching"...

    > Anyway, here is a short and complete program that *you* can run.


    Thanks.

    > You have already used up all of your coupons.


    Thanks. I'll ask for free scripts again, though. Does that offend
    you? The OP's request didn't seem to, and unlike myself he didn't even
    bother to try anything before asking on the forum (but maybe he's a
    regular here).

    I don't post free script requests and then just sit back and wait for
    the free stuff to roll on in. I post, yes, and then I continue in my
    search elsewhere for other possible solutions to my problem. And when
    I have found a solution, I tell those in my group about it so that they
    too can use it when they encounter that problem in future. I'm sorry
    if this offends you.

    Samuel Murray (aka voetleuce, leuce, throw)
     
    Throw, Mar 3, 2006
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robin
    Replies:
    2
    Views:
    6,306
    =?Utf-8?B?Sm9obiBTaXZpbGxh?=
    Aug 4, 2004
  2. Chris Berg
    Replies:
    0
    Views:
    828
    Chris Berg
    Oct 27, 2003
  3. Peter Grison

    Date, date date date....

    Peter Grison, May 28, 2004, in forum: Java
    Replies:
    10
    Views:
    3,327
    Michael Borgwardt
    May 30, 2004
  4. Roedy Green

    file splitter utility.

    Roedy Green, Jun 5, 2005, in forum: Java
    Replies:
    2
    Views:
    605
    P.Hill
    Jun 7, 2005
  5. gil
    Replies:
    1
    Views:
    173
Loading...

Share This Page