Text file splitter, date/time field

O

originals

Sorry to be such a leech!

I need to split an archive of a discussion forum saved as one huge txt
file into individual txt files--one per message.

Posts are stamped with a date and time, messages can be of any length.
Posters are sometimes address by their time (as it was an anon forum)
but the full time/date stamp is always unique to the start of a
message.

New to perl but have installed activeperl and can run a .pl script from
the command line.

If anyone could provide a script for this job, I'd really appreciate
it.

05.11.01 10:01 AM

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

05.11.01 10:41 AM

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

05.12.01 10:50 PM

10:01, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

you get the idea.

Thanks.

ps I won't just use it an dump it, I will learn from it!!!! cheers.
 
J

John W. Krahn

I need to split an archive of a discussion forum saved as one huge txt
file into individual txt files--one per message.

Posts are stamped with a date and time, messages can be of any length.
Posters are sometimes address by their time (as it was an anon forum)
but the full time/date stamp is always unique to the start of a
message.

New to perl but have installed activeperl and can run a .pl script from
the command line.

If anyone could provide a script for this job, I'd really appreciate
it.

#!/usr/bin/perl
use warnings;
use strict;

while ( <> ) {

if ( /^\d{2}\.\d{2}\.\d{2} \d{2}:\d{2} [AP]M$/ ) {
chomp;
tr/ /_/;
open OUT, '>', $_ or die "Cannot open $_: $!";
next;
}

print OUT if fileno OUT;
}

__END__



John
 
O

originals

John, thanks for this. It's only out putting empty files (0k, no
extension and no content when opened in notepad). I've tried it with
the sample I posted (saved as plain text) just to make sure and same
result. Maybe you can tweak. In the meantime I'll see if I can get
anywhere using the "if ( /^\d{2}\.\d{2}\.\d{2} \d{2}:\d{2} [AP]M$/ )"
in a similar script I've found that splitts after a keyword.

many thanks
 
U

usenet

John, thanks for this. It's only out putting empty files

That's odd - John's script should not have produced any type of file
for you - not because there's anything wrong with his script, but
because you're on a Windows machine, and you want to create files named
as per the timestamp, which include double-points (aka "colon", ie ":")
which is an illegal character on Windows filesystems. It should have
failed on the attempt to create the file for writing.

John's script works perfectly for me on UNIX, and perfectly on Windows
if I create a slightly modified version of the filename, such as:

(my $file = $_) =~ s/\:/_/g;
open OUT, '>',$file or die "Cannot open $_: $!";
 
J

John W. Krahn

That's odd - John's script should not have produced any type of file
for you - not because there's anything wrong with his script, but
because you're on a Windows machine, and you want to create files named
as per the timestamp, which include double-points (aka "colon", ie ":")
which is an illegal character on Windows filesystems. It should have
failed on the attempt to create the file for writing.

John's script works perfectly for me on UNIX, and perfectly on Windows
if I create a slightly modified version of the filename, such as:

(my $file = $_) =~ s/\:/_/g;
open OUT, '>',$file or die "Cannot open $_: $!";

Thanks, I don't have Windows to test on. Actually if you just changed the line:

tr/ /_/;

to:

tr/ :/_/;

it would have done the same.


BTW:
open OUT, '>',$file or die "Cannot open $_: $!";
^^^^^ ^^

If you are going to change the variable in the open() you should change it in
the die() as well.



John
 
T

Throw

I need to split an archive of a discussion forum saved as one huge txt
file into individual txt files--one per message.

Posts are stamped with a date and time, messages can be of any length.
Posters are sometimes address by their time (as it was an anon forum)
but the full time/date stamp is always unique to the start of a
message.

G'day everyone

The solution given to this question is exactly what I'm looking for,
except I need to split a concatenated PHP file. Basically, I have one
large text file into which I have copied PHP file after PHP file, and
now I want to split them up again. The PHP file always begins with

<?php

and always ends with

?>

so it should be fairly easy to adjust the above script, shouldn't it?
However, I have tried and failed. Also, what would the command line be
for it? Can anyone help me with the adaptation?

Thanks a lot
Samuel (aka throw aka leuce aka voetleuce)
 
A

A. Sinan Unur

except I need to split a concatenated PHP file. Basically, I have one
large text file into which I have copied PHP file after PHP file, and
now I want to split them up again. The PHP file always begins with

<?php

and always ends with

?>

so it should be fairly easy to adjust the above script, shouldn't it?
However, I have tried and failed.

What have you tried and what has failed?

Please read the posting guidelines for this group. They provide you with
invaluable information you can use to help your self as well as helping us
help you.

Sinan
 
T

Tad McClellan

Throw said:
so it should be fairly easy to adjust the above script, shouldn't it?
However, I have tried and failed.


What have you tried?

If you show us your broken code we could help you fix it.
 
I

it_says_BALLS_on_your_forehead

Throw said:
G'day everyone

The solution given to this question is exactly what I'm looking for,
except I need to split a concatenated PHP file. Basically, I have one
large text file into which I have copied PHP file after PHP file, and
now I want to split them up again. The PHP file always begins with

<?php

and always ends with

?>

so it should be fairly easy to adjust the above script, shouldn't it?
However, I have tried and failed. Also, what would the command line be
for it? Can anyone help me with the adaptation?

check out this FAQ:
http://groups.google.com/group/comp...65c4938924b/0486c9f8a384c887#0486c9f8a384c887
 
I

it_says_BALLS_on_your_forehead

it_says_BALLS_on_your_forehead said:

remember when crafting your solution, if you want to use John's
example, you must have some sort of unique identifier for each file you
want to write. since there's no unique timestamp, i would suggest an
iterator in the while loop. If you couple John's example with the
information in the FAQ linked to above, the answer should be obvious.
 
T

Throw

A. Sinan Unur said:
@g43g2000cwa.googlegroups.com:
What have you tried and what has failed?

I have tried the following if-lines and other variations thereof:

if ( \<?php ) {
if ( /\<\?php ) {
if ( /\<\?\p\h\p ) {
if ( \<?php [AP]M$/ ) {
if ( /\<\?php [AP]M$/ ) {
if ( /\<\?\p\h\p [AP]M$/ ) {
if ( \<?php ) {
if ( /^\<\?php ) {
if ( /^\<\?\p\h\p ) {
if ( \<?php [AP]M$/ ) {
if ( /^\<\?php [AP]M$/ ) {
if ( /^\<\?\p\h\p [AP]M$/ ) {

Does that answer your question? The problem, I think it should be
clear, is that I do not understand Perl regex syntax, and is therefore
forced to resort to brute-force methods.
Please read the posting guidelines for this group. They provide you with
invaluable information you can use to help your self as well as helping us
help you.

None of said posting guidelines helps me to help myself nor does it
help you any more to help me than my initial post already does... don't
you agree?

Samuel
 
A

A. Sinan Unur

A. Sinan Unur said:
@g43g2000cwa.googlegroups.com:
What have you tried and what has failed?

I have tried the following if-lines and other variations thereof:

if ( \<?php ) {
if ( /\<\?php ) {
if ( /\<\?\p\h\p ) {
if ( \<?php [AP]M$/ ) {
if ( /\<\?php [AP]M$/ ) {
if ( /\<\?\p\h\p [AP]M$/ ) {
if ( \<?php ) {
if ( /^\<\?php ) {
if ( /^\<\?\p\h\p ) {
if ( \<?php [AP]M$/ ) {
if ( /^\<\?php [AP]M$/ ) {
if ( /^\<\?\p\h\p [AP]M$/ ) {

Does that answer your question?

It tells me that you are not approaching the problem methodically.
The problem, I think it should be clear, is that I do not understand
Perl regex syntax,

perldoc perlretut

perldoc perlre
and is therefore forced to resort to brute-force methods.
http://perl.plover.com/Questions.html


None of said posting guidelines helps me to help myself nor does it
help you any more to help me than my initial post already does...
don't you agree?

No, I don't.

If you had at least attempted to post a short but complete program, read
the documentation along the way, that would have gone a long way towards
helping you help yourself, and help us help you.

Sinan
 
A

Aaron Baugher

Throw said:
The solution given to this question is exactly what I'm looking for,
except I need to split a concatenated PHP file. Basically, I have one
large text file into which I have copied PHP file after PHP file, and
now I want to split them up again. The PHP file always begins with

<?php

and always ends with

?>

There are two main ways to do this. Either read the entire file into
one variable, and then do a regex within a while loop on the entire
thing, treating it as one line and looking for your sample text (what
I think of as the brute force approach, since it is the quickest to
code but requires reading the entire file into memory at once, which
could be bad for a very large file):

my $line = join '', <STDIN>;
my $count = 1;
while( $line =~ /(<\?php.+?\?>\s*)/gs ){
my $chunk = $1;
open my $out, ">", "$count.php" or die $!;
print $out $chunk;
close $out or die $!;
$count++;
}

The other option would be to read through the original file line by
line, starting a new file when you hit a <?php line, and closing it
when you hit ?>, writing all the lines between to said file. It's
similar to the above, so you can probably work it out for yourself.
 
T

Tad McClellan

I suspect that what you need done is a great deal different
from the subject of this thread.

The OP has markers only at the beginning of records, you have
them at the beginning and at the end.

The OP's markers are variable length, yours are fixed strings.

The OP's output filenames are derived from what is matched, you
haven't indicated any way of naming the files.

I have tried the following if-lines and other variations thereof:

if ( \<?php ) {


That is not the syntax for the match operator:

perldoc -f m

then:

perldoc perlop

The match operator starts with either an "m" or a "/" character,
not a "\" character.

if ( /\<\?php ) {


The match operator ends with a "/" character.

If you add that character, then it should match just fine,
though it has one extra backslash that is not needed.

if ( /\<\?\p\h\p ) {
if ( \<?php [AP]M$/ ) {
if ( /\<\?php [AP]M$/ ) {
if ( /\<\?\p\h\p [AP]M$/ ) {
if ( \<?php ) {
if ( /^\<\?php ) {
if ( /^\<\?\p\h\p ) {
if ( \<?php [AP]M$/ ) {
if ( /^\<\?php [AP]M$/ ) {
if ( /^\<\?\p\h\p [AP]M$/ ) {

Does that answer your question?


It answers the underlying unspoken question quite well.

You appear to want to write code in a language that you do not know.

The implication is that you want us to write your code for you.

(most especially since you have asked us to write code for you before.)

The problem, I think it should be
clear, is that I do not understand Perl regex syntax,


Then you go learn about it before you write it.

Trying random things will take much more time than learning
the language that you wish to speak.

and is therefore
forced to resort to brute-force methods.


That is absurd.

If you do not know a language, you go learn the language.

You can learn about the syntax for the m// operator and for
Perl's regular expression in the documentation that came with perl.

If you don't understand some part of those docs, then post a question
about it here and we will help you understand it.

We are not likely to read those docs to you though.

None of said posting guidelines helps me to help myself


They most certainly do!

- Check the Perl Frequently Asked Questions (FAQ)

Since you have a question about pattern matching, you would
eventually try:

perldoc -q pattern

And would have found:

How can I pull out lines between two patterns that are themselves on
different lines?

Which tells you how to do exactly what you need done!


- Check the other standard Perl docs (*.pod)

Which describe the syntax for the operator that you want to use.


- Use an effective followup style

Wherein you quote what you are commenting on, such as the
code that you want modified.

This helps you because it allows more people to examine the problem.

Many or most readers will just move on to answering the next person's
question rather than spend time locating the code.

nor does it
help you any more to help me than my initial post already does...


- Provide enough information

(which asks for a short and complete program that we can run
that illustrates the problem you need solved.)

If you posted code missing the match operator's closing slash,
then we could have told you that were missing the closing slash,
and one of your problems would have been eliminated straightaway
rather than here way down-thread.

Are the "<?php" and "?>" always on separate lines?

If you had posted data to go with your code, we would have been able
to see that there was a much better way of solving your problem
than what appeared in the thread thus far.


Anyway, here is a short and complete program that *you* can run.

----------------------------------------
#!/usr/bin/perl
use warnings;
use strict;

my $cnt=1;
while ( <DATA> ) {
if ( /<\?php/ ) {
open OUT, '>', "$cnt.php" or die "could not open '$cnt.php' $!";
$cnt++;
}
print OUT if /<\?php/ .. /\?>/;
}

__DATA__
extra stuff
<?php
1st PHP section
?>
in-between stuff
<?php
2nd PHP section
?>
trailing stuff
----------------------------------------

don't
you agree?


No.

You have already used up all of your coupons.

So long!
 
T

Throw

Tad said:
I suspect that what you need done is a great deal different
from the subject of this thread.

IMO it is not, but I'm sorry that you disagree. The OP wanted a script
for splitting long files, and did I.
The OP has markers only at the beginning of records, you have
them at the beginning and at the end.

True, but that is not relevant, because I don't need to split my file
at the top and bottom of each PHP file... I only need to split it at
the top *or* bottom of each PHP file (because the one's bottom is also
the next one's top, if you see what I mean).
The OP's markers are variable length, yours are fixed strings.

I would have thought that a procedure fixed stings would be easier and
simpler to implement that that of variable length. I would have
thought that some of the characters in the search string are regex code
for "variable things", which one could simply remove to be left with
that which refers to a fixed string. At least, this is what the regex
find functions of other languages that I have dealt with, does.
The OP's output filenames are derived from what is matched, you
haven't indicated any way of naming the files.

That is true, but I think I would have realised it and probably have
included the equivalent of a for-next loop (and on how to do that, I
would probably have searched various Perl forums for existing answers
to similar questions asked by equally clueless people). Alternatively,
I may have written a script in a different languge (say, AutoIt) which
creates unique names for each PHP file... although to do that, I would
have to know how to call the name in the Perl script's find function.
It answers the underlying unspoken question quite well.
You appear to want to write code in a language that you do not know.
Yes.

The implication is that you want us to write your code for you.

No. I did not ask for a completely new script. I asked for help with
the regex only. The script was already in existence, and it required
very, very little adaptation... so very little in fact that I might
have been able to figure it out myself if I had the missing
information.
(most especially since you have asked us to write code for you before.)

I do ask for scripts to be written, yes. If you enjoy writing simple
scripts to solve problems that haven't been solved before, you're
welcome to respond. If you do not, then feel free not to respond. I'm
not asking because I believe I have the right to expect to be helped.
I'm asking simply on the off-chance that someone might want to help (or
point me into some direction).
That is absurd.
If you do not know a language, you go learn the language.

I do not agree. Sorry, but my purpose is not to learn a single
language, but to discover a solution to my problem using whatever means
is available. If it's a Perl script, then good. If it's Java, Python,
Ruby, AutoIt, VB macro, StarBasic, Tcl, Yabasic, etc... then also good.
I have limited knowledge of some of these languages, and if I see
something which I *think* I understand partially, I'll fiddle with it.
But I won't read the whole manual, and I won't try to learn everything
there is about the language.

What you're saying, has some merit, though. Not knowing the entire
language can be extremely limiting in that you won't be able to solve
problems when they arise, except "blindly". In the above case, I had
believed that my only obstacle to success was the regex line of code.

Before your post, much of the responses I've had to my query had been
utterly unuseful (but I have no right to complain or blame). Your
answer about iterations was very useful because it shows me an
additional error in my thinking and it made me learn more about Perl
(though not enough to write programs, heh-heh).
You can learn about the syntax for the m// operator and for
Perl's regular expression in the documentation that came with perl.

Thanks... now at least I know what to look for.
Since you have a question about pattern matching, you would
eventually try:

perldoc -q pattern

Thanks. So it's called "pattern matching"...
Anyway, here is a short and complete program that *you* can run.
Thanks.

You have already used up all of your coupons.

Thanks. I'll ask for free scripts again, though. Does that offend
you? The OP's request didn't seem to, and unlike myself he didn't even
bother to try anything before asking on the forum (but maybe he's a
regular here).

I don't post free script requests and then just sit back and wait for
the free stuff to roll on in. I post, yes, and then I continue in my
search elsewhere for other possible solutions to my problem. And when
I have found a solution, I tell those in my group about it so that they
too can use it when they encounter that problem in future. I'm sorry
if this offends you.

Samuel Murray (aka voetleuce, leuce, throw)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top