PDA

View Full Version : Script to split a text document by page count? (Perl, applescript)


Stoid
02-27-2010, 05:26 PM
I've been hunting for days and I can't believe no one has written a script to do this.

I'm on a mac and I don't know from scripting to begin with, but I know that I can use Applescript and Perl without much fuss.

All I'm trying to do is take a .txt document that is actually a mail merged document that is the essentially the same three pages repeated hundreds of times and split it into separate files based on those three pages, and save them as documents/files named after a particular word on the second line of the document.

I know it's easy to say it seems easy when I don't know how to do anything to begin with, but still.. .it does seem easy. A friend who is very clever about these things says Perl scripts working with text are everywhere...but I'm not finding them.

So...any tips of help would be much appreciated.

QuercusMax
02-27-2010, 08:45 PM
Probably you aren't finding such a script because anybody who's written such a thing has probably done it as a one-off that worked for their particular case and none other. It's seriously probably a five or ten minute job for somebody with any kind of scripting language experience, but I can't imagine it would be worth spending any time publishing such a thing, since it would have to be modified anyway for anyone else.

code_grey
02-27-2010, 09:26 PM
on Craigslist computer gigs you can probably get it done for $5. All the more if this is not secret stuff so that you can just send the file to a programmer (that way he could do it using a non-Mac development environment since there are relatively few people familiar with AppleScript out there).

BigT
02-27-2010, 09:34 PM
Probably you aren't finding such a script because anybody who's written such a thing has probably done it as a one-off that worked for their particular case and none other. It's seriously probably a five or ten minute job for somebody with any kind of scripting language experience, but I can't imagine it would be worth spending any time publishing such a thing, since it would have to be modified anyway for anyone else.

Particularly the part about taking the file name from the second line.

I'm pretty sure you'd need to know a unique phrase at the end and/or beginning of each file, as well as a unique phrase before and/or after the name. It would be a lot easier than trying to parse lines or pages.

Do you still have the original template and database? It might be easier to start from there than the outputted file.

Superfluous Parentheses
02-27-2010, 09:34 PM
if it is actually the same 3 pages, why not just save those 3 pages from a text editor?

I'm probably missing something, but if the problem is what it appears to be, writing even a simple script in perl would be more work.

Erasmus Darwin
02-28-2010, 12:04 AM
If you can give me a slightly more concrete description of the file format (how to identify the start and end of each file and how to grab the filename), I could throw together a Perl script for you.

BigT
02-28-2010, 12:35 AM
if it is actually the same 3 pages, why not just save those 3 pages from a text editor?

I'm probably missing something, but if the problem is what it appears to be, writing even a simple script in perl would be more work.

A mail merged document is the almost exactly the same for each set, but the names (and/or possibly other information) are changed in each one. It's the output of a form letter.

The non-computerized equivalent would be a form where everything is already written, but they leave blanks for the names. Just saving the first three pages would be like just filling in one person's name, and then sending a copy of that to everyone.

Walton Firm
02-28-2010, 03:06 AM
I'm sure there are more elegant ways to do it but this is the first thing I came up with after reading the OP:


#!/usr/bin/perl -w

$text = "";
$line = 0;
$word = "";

while(<>) {
if (/separator/) {
open(OUTFILE, ">$word.txt") or die "Cannot open $word.txt\n";
print OUTFILE $text;
close(OUTFILE);
$text = "";
$line = 0;
$word = "";
}
else {
$text .= $_;
$line++;
# If this is the second line after the separator, take the first
# word after "Hello, " to use for the filename.
if ($line == 2) {
if (/Hello, (\S+)/) {
$word = $1;
}
else {
die "Unable to determine the name for the file!\nLine is: $_\n";
}
}
}
}


Since the OP is not a 100% unambiguous spec, I made some assumptions:

The input is a single text file.
It contains a number of sub-documents, each ending with a line containing the word 'separator'. (Note: the last sub-document must also end with a separator!)
The separator line itself will not be written out.
The second line of each subdocument contains the pattern "Hello, xxx" whereby xxx.txt is the name under which that subdocument must be written.
The fact that the subdocuments are mostly the same, and that each consists of multiple pages, is not relevant for the program.


This is the test document I used:


bla bla bla bla
bla Hello, w1 bla bla
bla bla bla bla

separator
bla bla bla bla
bla Hello, w2 bla bla
bla bla bla bla

separator
bla bla bla bla
bla Hello, w3 bla bla
bla bla bla bla

separator


To test the program, I would run perl split.pl testdoc.txt. Make sure that the pattern after "Hello, " does not contain any invalid filename characters, and that it will not overwrite an existing file in the directory!

psychonaut
02-28-2010, 06:55 AM
Wow, that's a pretty big Perl program for what could be accomplished by a single Unix command: csplit.

I hear Mac OS X is a BSD variant, so possibly it includes csplit by default. If not, it can be obtained as part of the GNU coreutils package.

AHunter3
02-28-2010, 10:04 AM
I could do it in FileMaker easily enough. (Hey, I use this hammer like Thor and everything in creation is my freakin' nail!)

If I knew my way around grep, I'd AppleScript BBEdit or its little brother TextWrangler.

beowulff
02-28-2010, 10:05 AM
Wow, that's a pretty big Perl program for what could be accomplished by a single Unix command: csplit.

I hear Mac OS X is a BSD variant, so possibly it includes csplit by default. If not, it can be obtained as part of the GNU coreutils package.

csplit is located in /usr/bin/csplit on OS X 10.5

Stoid
02-28-2010, 09:20 PM
Thanks for the tips thoughts and attempts, all!

I'm interested in the csplit thing, and that script that Walton wrote looks possible...but im SO much of a clueless noob that I don't even know how to implement it.

In the meanwhile, I did come up with a long way around that still works.

The "Mail Merge Form Letter" is a page of HTML from my website... it's the gallery page, which is identical for thousands of pages, with the following differences: names of thumbnails, names of pictures they link to, page that "previous" links to, page that "next" links to. As follows:<!DOCTYPE HTML PUBLIC "-//SoftQuad//DTD HTML 3.2 + extensions for HoTMetaL PRO 3.0(U) 19961211//EN" "hmpro3.dtd">
<HTML>
<HEAD>
<TITLE>RetroRaunch / Cheesecake / Babes by the Decade / 1940-1959 /
8621-8640</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#FF0000" VLINK="#800000"
ALINK="#000000">
<CENTER><IMG SRC="../../retroraunch.jpg" WIDTH="445" HEIGHT="213"> <BR>
<BR><FONT SIZE="+2">Cheesecake / Babes by the Decade / 1940-1959 /
8621-8640</FONT> <BR>Click a thumbnail for a full size image <BR> <BR>
<TABLE BORDER="0" CELLSPACING="10">
<TR>
<TD VALIGN="MIDDLE" ALIGN="CENTER" WIDTH="120"><A
HREF="photos/40ccba-8621.jpg"><IMG SRC="thumbs/t40ccba-8621.jpg" WIDTH="79"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER" WIDTH="120"><A
HREF="photos/40ccba-8622.jpg"><IMG SRC="thumbs/t40ccba-8622.jpg" WIDTH="81"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER" WIDTH="120"><A
HREF="photos/40ccba-8623.jpg"><IMG SRC="thumbs/t40ccba-8623.jpg" WIDTH="80"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER" WIDTH="120"><A
HREF="photos/40ccba-8624.jpg"><IMG SRC="thumbs/t40ccba-8624.jpg" WIDTH="82"
HEIGHT="120" BORDER="0"></A></TD>
</TR>
<TR>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8625.jpg"><IMG SRC="thumbs/t40ccba-8625.jpg" WIDTH="82"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8626.jpg"><IMG SRC="thumbs/t40ccba-8626.jpg" WIDTH="99"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8627.jpg"><IMG SRC="thumbs/t40ccba-8627.jpg" WIDTH="96"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8628.jpg"><IMG SRC="thumbs/t40ccba-8628.jpg" WIDTH="78"
HEIGHT="120" BORDER="0"></A></TD>
</TR>
<TR>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8629.jpg"><IMG SRC="thumbs/t40ccba-8629.jpg" WIDTH="79"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8630.jpg"><IMG SRC="thumbs/t40ccba-8630.jpg" WIDTH="79"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8631.jpg"><IMG SRC="thumbs/t40ccba-8631.jpg" WIDTH="79"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8632.jpg"><IMG SRC="thumbs/t40ccba-8632.jpg" WIDTH="79"
HEIGHT="120" BORDER="0"></A></TD>
</TR>
<TR>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8633.jpg"><IMG SRC="thumbs/t40ccba-8633.jpg" WIDTH="82"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8634.jpg"><IMG SRC="thumbs/t40ccba-8634.jpg" WIDTH="79"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8635.jpg"><IMG SRC="thumbs/t40ccba-8635.jpg" WIDTH="83"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8636.jpg"><IMG SRC="thumbs/t40ccba-8636.jpg" WIDTH="81"
HEIGHT="120" BORDER="0"></A></TD>
</TR>
<TR>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8637.jpg"><IMG SRC="thumbs/t40ccba-8637.jpg" WIDTH="80"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8638.jpg"><IMG SRC="thumbs/t40ccba-8638.jpg" WIDTH="79"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8639.jpg"><IMG SRC="thumbs/t40ccba-8639.jpg" WIDTH="80"
HEIGHT="120" BORDER="0"></A></TD>
<TD VALIGN="MIDDLE" ALIGN="CENTER"><A
HREF="photos/40ccba-8640.jpg"><IMG SRC="thumbs/t40ccba-8640.jpg" WIDTH="80"
HEIGHT="120" BORDER="0"></A></TD>
</TR>
</TABLE> <BR><A HREF="40ccba8601-8620.htm"><IMG SRC="../../previous.jpg"
WIDTH="171" HEIGHT="35" BORDER="0"></A><A HREF="../index.html"><IMG
SRC="../../back.jpg" BORDER="0" WIDTH="171" HEIGHT="35"></A><A
HREF="40ccba8641-8660.htm"><IMG SRC="../../next.jpg" WIDTH="171" HEIGHT="35"
BORDER="0"></A> <BR>
<P><FONT>Questions? Comments? Let us know what you think! <BR><A
HREF="../../mail.htm"><IMG SRC="../../icon-mail.jpg" ALT="mail us" BORDER="0"
HSPACE="15" WIDTH="100" HEIGHT="160"></A> <BR> <BR></FONT></P>
<P><FONT SIZE="-2">Copyright &copy;1998 GDI</FONT> <BR> <BR></P></CENTER>
</BODY>
</HTML>

The name of this page is 40ccba8621-8640.htm

If you know how to read HTML, you look at the bold red words, you can see what I'm doing: there is a consistent naming/sequencing scheme to the photos and thumbs, and the pages are named after the photos and thumbs within them, and it's all very linear and logical.


It's also a completely tedious pain in the ass to update if you are not a detail-oriented person. Everyone who has been responsible for updating aside from me could throw this together in 5-10 minutes with the new images, enter a couple of references to the update elsewhere on the site, and be done. Me? I struggle with making sure eveyrthing is correct for two hours.

On the other hand, I spent about 20 hours trying to figure out how to automate most of it, and I did.

I created a database where all I have to do is enter the basic information in about 4 fields, and just duplicate it endlessly, it generates all the data: the image names, the page names and references, everything. In a flash.

Then I took the above text, slapped it in a Pages doc (Word would work, too) and entered the correct fields.

Merge.

Big fat document... how to split?

Well, the merge creates 3 page sections for each separate doc. So I printed to a PDF, opened it in Acrobat, used ACrobat's "split document" function to split it into individual PDFs for each page, then ran "multiple export" to "accessible text" to convert it all to text, then ran them all through A Better Finder Rename with a 2 step command to change it all to 40ccba, starting number, step by ten suffix of -.htm, then insert the second set of numbers.

And I end up with:


40ccba9941-9950.htm
40ccba9951-9960.htm
40ccba9961-9970.htm

etc...and yes, the names correspond correctly to the information within.

It's the long way around, but it worked.

Now I have to do it for 20ccba and 40febo, etc. But it's WAY easier in the long run for ME and my ADD brain.

Next up: Folder actions that, when I scan a photo into a foilder, it resizes, creates a thumbnail, watermarks, enters EXIF data, and uploads to the server!

Because of course now that I have (or will have) all the gallery HTML pre-written, all I need to go live is add the actual photos.

And what's really kinda silly about all this? Is that what I'm really going to focus on as soon as this is all under control is converting my whole website to either Joomla + Gallery or Drupal + Coppermine or vice-versa.

But I can't learn those things and make the change without having the old-fashioned existing system under control while I do.

So that's what I'm doing.

Erasmus Darwin
03-01-2010, 11:17 AM
Okay, I've got a nice two-step solution for you.

Step 1:
csplit -z inputfile '/<!DOCTYPE/' '{*}'

Where 'inputfile' is the giant mailmerged file. What this does is splits the file up into a number of files named xx## where ## is an incrementing number. It splits on <!DOCTYPE, which makes a nice separator between pages. It doesn't do any checking to make sure <!DOCTYPE isn't occurring in the middle of a page, but that should be a fairly safe assumption to make. So at this point, the only thing left is to rename the files based on their contents.

Step 2:
./rename_files.pl xx*

Where rename_files.pl is the following Perl script:

#!/usr/bin/perl

use strict;
use warnings;

for my $file (@ARGV) {
my ($first_img, $last_img) = '';
open(F, '<', $file)
or die "Error reading $file: $!";
while (<F>) {
if (/HREF="photos\/(.*?)\.jpg"/) {
$first_img = $1 if ! $first_img;
$last_img = $1;
}
}
close(F);
next if ! $first_img || ! $last_img;
(my $prefix = $first_img) =~ s/\d+$//;
$first_img =~ s/$prefix//;
$last_img =~ s/$prefix//;
$prefix =~ s/-$//;
rename($file, "${prefix}${first_img}-${last_img}.htm");
}


Standard caveats apply: It was a quick-and-dirty script. It doesn't do much in the way of error checking. If it gets confused, it'll either leave the file unrenamed (xx##) or rename it to something oddly long (40ccba1234-50ccba-5678.htm). The former would happen if it can't find any picture names at all, and the latter would happen if you have 2 different prefixes on your pictures.

Stoid
03-01-2010, 03:24 PM
First of all, thank you very much for your time and assistance, I really appreciate it enormously.

So... I tried out csplit. When it wouldn't work no matter what I did, I researched it. And tried and tried and tried, and kept getting an assortment of rejections, most of them having to do with DOCTYPE not being a directory... I tried messing with the doctype part of it, nothing was working. (what is the -z?)

I stumbled on to simple "split" command, and messed with that, just to see if I could get SOMETHING working, and eventually I did. The only thing I have managed to successfully achieve is splitting the file into 4k pieces, which doesn't help, of course, because they are not the right pieces. But I tried that when multiple attempts at getting it to split on a LINE (It appears that each page is 135 lines long) absolutely would not work... it just kept duplicating the whole file with a new name.

The total number of separate "docs" in the long file is 100. There are 13,504 lines. I have repeatedly tried:

split -l 135 mydoc.txt newdoc

and all it does is copy the file with the new name.

I tried saving it as .htm first, I tried actually "assigning" the line numbers using TextWrangler. Nope.

Why doesn't it like my split by line command?

A record of my attempts:

mycomputernamestuffhere$ split -l 136 40ccbamnerged.txt mergetest
split: 40ccbamnerged.txt: No such file or directory
mycomputernamestuffhere$ split -l 136 40ccba.txt test
mycomputernamestuffhere$ split -l 136 40ccba.txt moretest
mycomputernamestuffhere$ split -l 136 40ccba.htm pleasework
mycomputernamestuffhere$ split -1 5 40ccba.txt five
split: 5: No such file or directory
mycomputernamestuffhere$ split -l 50 40ccba.txt phive
mycomputernamestuffhere$ split -b 4k 40ccba.txt size
mycomputernamestuffhere$ split -l 136 numbered.txt wrangled
mycomputernamestuffhere$ split -l 135 numbered.txt wrangled.txt
mycomputernamestuffhere$ csplit -z 40ccba.txt 40ccba
csplit: illegal option -- z
usage: csplit [-ks] [-f prefix] [-n number] file args ...
mycomputernamestuffhere$ csplit -f ccba 40ccba.txt
278713
mycomputernamestuffhere$ csplit -f tryagain 40ccba.txt
278713
mycomputernamestuffhere$ csplit -f perfect 40ccba.txt '/,!DOCTYPE/''{*}'
csplit: {*}: bad offset
mycomputernamestuffhere$ csplit -f perfect 40ccba.txt '/<!DOCTYPE/' '{*}'
csplit: *}: bad repetition count
mycomputernamestuffhere$ csplit -f perfect 40ccba.txt '/<!DOCTYPE/' '{100}'
0
0
BUNCHA ZEROS DELETED
0
0
0
278713
mycomputernamestuffhere$ csplit -f perfect 40ccba.txt '/<!DOCTYPE/'
0
278713
mycomputernamestuffhere$ csplit -csplit -f perfect 40ccba.txt '/HoTMetaL/'
csplit: illegal option -- c
usage: csplit [-ks] [-f prefix] [-n number] file args ...
mycomputernamestuffhere$ csplit -f perfect 40ccba.txt '/HoTMetaL/'
0
278713
mycomputernamestuffhere$ csplit -f perfect 40ccba.txt '/HoTMetaL/'
0
278713
mycomputernamestuffhere$ csplit -f perfect 40ccba.txt /HoTMetaL/
0
278713
mycomputernamestuffhere$ csplit -f perfect 40ccba.txt /HoTMetaL/
0
278713
mycomputernamestuffhere$ csplit -f perfect 40ccba.txt /RetroRaunch/
0
278713
mycomputernamestuffhere$ csplit -f perfect 40ccba.txt /RetroRaunch/{100}
csplit: {100}: bad offset
mycomputernamestuffhere$ csplit -f perfect 40ccba.txt /RetroRaunch/ {100}
0
BUNCHA ZEROS DELETED
0
278713
mycomputernamestuffhere$ csplit -f bline breakline.txt /BREAKLINE/
0
272213
mycomputernamestuffhere$ csplit -f bline breakline.txt line_136 {100}
csplit: line_136: unrecognised pattern
mycomputernamestuffhere$ csplit -f bline breakline.txt line_no 136 {100}
csplit: line_no: unrecognised pattern
mycomputernamestuffhere$ line_no
-bash: line_no: command not found
mycomputernamestuffhere$ csplit -f bline breakline.txt 136 {100}
csplit: 136: out of range
mycomputernamestuffhere$ csplit -f bline breakline.txt 1_136
csplit: 1_136: bad line number
mycomputernamestuffhere$ csplit -f bline breakline.txt 1 136
0
csplit: 136: out of range
mycomputernamestuffhere$ csplit -f bline breakline.txt 136
csplit: 136: out of range
mycomputernamestuffhere$ csplit -f bline breakline.txt 1
0
272213
mycomputernamestuffhere$ csplit -f bline breakline.txt 5
8188
264025
mycomputernamestuffhere$ csplit -f bline breakline.txt 10
18423
253790
mycomputernamestuffhere$ csplit -f bline breakline.txt 99
200606
71607
mycomputernamestuffhere$ csplit -f bline breakline.txt 100
202653
69560
mycomputernamestuffhere$ csplit -f bline breakline.txt
\353237
mycomputernamestuffhere$ csplit -f bline new.txt 99
200606
152631
mycomputernamestuffhere$ csplit -f bline new.txt 10
18423
334814
mycomputernamestuffhere$ csplit -f bline new.txt 10
18423
334814
mycomputernamestuffhere$ csplit -f bline new.txt 2
2047
351190
mycomputernamestuffhere$ csplit -f bline new.txt 3
4094
349143
mycomputernamestuffhere$ csplit -f bline new.txt /BREAKLINE/
0
353237
mycomputernamestuffhere$ csplit -f bline new.txt /BREAKLINE/ {100}
0
0
0



When it did split the doc, it would do so in a way I could not figure out. It either made a hundred docs and only one had any content, or, when I was trying to figure out how to make "line_no" work, and I entered just raw numbers, it would split the file in two, the size of the two parts changing based on the number I used, but how I don't know.


VERY frustrating

( At one point I replaced the DOCTYPE line with one word, no code: BREAKLINE, in hopes that maybe the coding in the HTML was the problem. Nope.)

beowulff
03-01-2010, 04:02 PM
Have you ever used the command line? If not, there are a lot of "gotchas."
First of all. when you call csplit,. type the following (without the quotes) "csplit -z "
Note the space at the end.
Then, drop the file you want to split onto the terminal window.
Type a space.
The, type the following with the single quotes: '/<!DOCTYPE/' '{*}'
Hit return and report back.

Stoid
03-01-2010, 05:30 PM
Have you ever used the command line? If not, there are a lot of "gotchas."
First of all. when you call csplit,. type the following (without the quotes) "csplit -z "
Note the space at the end.
Then, drop the file you want to split onto the terminal window.
Type a space.
The, type the following with the single quotes: '/<!DOCTYPE/' '{*}'
Hit return and report back.


Thank you, and here is my report:



mycomputer$ csplit -z /Users/Stoidyhome/Documents/Update\ records/Merge\ Work/ptest/40ccba.txt '/<!DOCTYPE/' '{*}'
csplit: illegal option -- z
usage: csplit [-ks] [-f prefix] [-n number] file args ...
mycomputer$ csplit -z
csplit: illegal option -- z
usage: csplit [-ks] [-f prefix] [-n number] file args ...
mycomputer$ csplit
usage: csplit [-ks] [-f prefix] [-n number] file args ...
mycomputer$ cd /Users/mycomputer/Desktop/ptest/
mycomputer$ csplit -z /Users/mycomputer/Desktop/ptest/40ccba.txt '/<!DOCTYPE/' '{*}'
csplit: illegal option -- z
usage: csplit [-ks] [-f prefix] [-n number] file args ...
mycomputer$

Erasmus Darwin
03-01-2010, 05:39 PM
So... I tried out csplit. When it wouldn't work no matter what I did, I researched it. And tried and tried and tried, and kept getting an assortment of rejections, most of them having to do with DOCTYPE not being a directory... I tried messing with the doctype part of it, nothing was working. (what is the -z?)

-z tells it not to write out a file if the file's empty. Otherwise, it creates an empty file to hold the nothing before the first <!DOCTYPE. Obviously, it's not strictly necessary, but it does make things a bit tidier.

Anyway, it looks like your version of csplit's a little different from the one I tested with (which was on a Linux box). Rather than fighting with it (since it looks like you've tried all the possible ways to get it to work and then some), it might be easier just to skip it and use another Perl script to do the splitting. I'll throw something together right now.

beowulff
03-01-2010, 05:41 PM
Try it without the -z
I'm not sure what that is supposed to do, and the OS X version doesn't support it.

Erasmus Darwin
03-01-2010, 05:51 PM
Ok, here's a quick and ugly Perl script that can replace csplit:


#!/usr/bin/perl

use strict;
use warnings;

my $out_count = 0;
open(F, '>', 'initial_garbage')
or die "Error creating output file: $!";
while (<>) {
if (/^<!DOCTYPE/) {
open(F, '>', sprintf('xx%04d', $out_count++))
or die "Error creating output file: $!";
}
print F $_;
}
close(F);


Save it to 'split_by_doctype.pl', and then do 'chmod 755 split_by_doctype.pl' to make it executable. Then step 1 would be:
./split_by_doctype.pl inputfile

The file 'initial_garbage' will probably be empty and contains any data before the first <!DOCTYPE. Anyway, just run this, then run the rename script, and you should be golden.

Stoid
03-01-2010, 06:37 PM
Oh my GOD... it did exactly the same thing the csplit did many times: I got back 2 files, one was 276kb, named xx0000, the other was 0kb, named initial_garbage.


ARG!

Stoid
03-01-2010, 06:40 PM
by the way, is the ./ the command to invoke any Perl script, or is this something specific to this?

Stoid
03-01-2010, 06:45 PM
Beowulff: I tried it completely stripped down:



Stoidbook:ptest Stoidyhome$ csplit 40ccba.txt '/<!DOCTYPE/'
0
278713


Same thing: two files, one empty, one a repeat of the original.

ARGH!!!!

What was I doing wrong with the "line_no" command?

beowulff
03-01-2010, 07:08 PM
Beowulff: I tried it completely stripped down:



Same thing: two files, one empty, one a repeat of the original.

ARGH!!!!

What was I doing wrong with the "line_no" command?

If this is what you typed, it assumes that the file is at the root of your hard drive - is it?
You need to specify the full path to the file, which is what the dragging and dropping does.

Stoid
03-01-2010, 08:20 PM
If this is what you typed, it assumes that the file is at the root of your hard drive - is it?
You need to specify the full path to the file, which is what the dragging and dropping does.

I've done it both ways, I was getting directory problems at first and sorted that out. It won't do anything at all if it can't find it. As long as it's doing something, even if it's wrong, I know that the directory isn't the problem.

beowulff
03-01-2010, 08:26 PM
I've done it both ways, I was getting directory problems at first and sorted that out. It won't do anything at all if it can't find it. As long as it's doing something, even if it's wrong, I know that the directory isn't the problem.

One issue may be line endings. Unix expects linefeed "\n". If you are using <cr>, then you might have problems with csplit.

If you want to PM me and send me the file, I can try it over here.

Stoid
03-01-2010, 08:51 PM
One issue may be line endings. Unix expects linefeed "\n". If you are using <cr>, then you might have problems with csplit.


I mean absolutely no offense whatsoever... but can I blow you now? (I'm a woman, in case that helps. )

YOU RULE!!!!!!!

That was the problem!!! WOOHOO!!!!!!!!

I just saved it with Unix line endings via TextWrangler, and VOILA!

Now I have to try the rename script.... WOOHOO

(That is an enormously important thing for me to understand for future reference... awesome. YAY!)

Stoid
03-01-2010, 09:06 PM
Standard caveats apply: It was a quick-and-dirty script. It doesn't do much in the way of error checking. If it gets confused, it'll either leave the file unrenamed (xx##) or rename it to something oddly long (40ccba1234-50ccba-5678.htm). The former would happen if it can't find any picture names at all, and the latter would happen if you have 2 different prefixes on your pictures.


Part one is finally working. So now I'm trying to get your script working (which is a wonderful thing because it is based on the ACTUAL contents of the file, not the assumed contents, which my long-way-round is, and could screw up) and this is what's happening so far...



mypoot:ptest mypoot$ ./rename_files.pl xx*
-bash: ./rename_files.pl: Permission denied
mypoot:ptest mypoot$./rename_files.pl xx
-bash: ./rename_files.pl: Permission denied
mypoot:ptest mypoot$chmod 755 rename_files.pl
mypoot:ptest mypoot$ ./rename_files.pl xx*
mypoot:ptest mypoot$ ./rename_files.pl xx
Error reading xx: No such file or directory at ./rename_files.pl line 8.
mypoot:ptest mypoot$


Because I am so incredibly clueless about all this, I didn't know for sure where to put the rename_files script, I assumed it should be in the same folder as the split files, so that's where it is.

and my files are indeed named xxdigitdigit, so I also assume your xx* means "all the files starting with xx" teh asterisk being a wildcard.

I'm so close...

beowulff
03-01-2010, 09:07 PM
I mean absolutely no offense whatsoever... but can I blow you now? (I'm a woman, in case that helps. )

YOU RULE!!!!!!!

That was the problem!!! WOOHOO!!!!!!!!

I just saved it with Unix line endings via TextWrangler, and VOILA!

Now I have to try the rename script.... WOOHOO

(That is an enormously important thing for me to understand for future reference... awesome. YAY!)

Glad that I could help.
As far as blowjobs go...let me ask my wife.

<Hon, I just helped some chick on the SDMB, and she wants to show her gratitude in a very special way - is that OK?>

Sorry, Stoid, I'm afraid that my wife says that BJs are just not an acceptable form of gratitude indication. But, I really appreciate the thought.:p

Erasmus Darwin
03-01-2010, 10:06 PM
by the way, is the ./ the command to invoke any Perl script, or is this something specific to this?

./ tells it to run a command in the current directory. If you have '.' in your PATH, then it's unnecessary. However, a lot of UNIX-based systems don't have '.' in the PATH for security reasons. I have no clue what OS X does.

I'm so close...

Actually, I think you're done. Once you did the chmod 755 and ran it again, it looks like it worked. Do an 'ls' and see what the directory looks like now. Once it worked, it gave you an error when you tried it again as all the xx files had been renamed.

Stoid
03-01-2010, 10:32 PM
Actually, I think you're done. Once you did the chmod 755 and ran it again, it looks like it worked. Do an 'ls' and see what the directory looks like now. Once it worked, it gave you an error when you tried it again as all the xx files had been renamed.

Sad to say, nope.

I ran the script through some debug tools and the same errors kept coming back. The syntax was ok, but it keeps spitting out the line 8 issue, saying:

"Error reading xx: No such file or directory at ./rename_files.pl line 8."

I only get the error when I try it with xx standing alone. When I put xx* it does nothing, no error, no changes, nothing. See: (I tried renaming the docs rrdoc and adding the extension... just in case.)



mypoot:test myhome$ ls
Working docs xx23 xx49 xx75
rename_files.pl xx24 xx50 xx76
split_by_doctype.pl xx25 xx51 xx77
xx00 xx26 xx52 xx78
xx01 xx27 xx53 xx79
xx02 xx28 xx54 xx80
xx03 xx29 xx55 xx81
xx04 xx30 xx56 xx82
xx05 xx31 xx57 xx83
xx06 xx32 xx58 xx84
xx07 xx33 xx59 xx85
xx08 xx34 xx60 xx86
xx09 xx35 xx61 xx87
xx10 xx36 xx62 xx88
xx11 xx37 xx63 xx89
xx12 xx38 xx64 xx90
xx13 xx39 xx65 xx91
xx14 xx40 xx66 xx92
xx15 xx41 xx67 xx93
xx16 xx42 xx68 xx94
xx17 xx43 xx69 xx95
xx18 xx44 xx70 xx96
xx19 xx45 xx71 xx97
xx20 xx46 xx72 xx98
xx21 xx47 xx73 xx99
xx22 xx48 xx74
mypoot:test myhome$ ./rename_files.pl xx*
mypoot:test myhome$ ./rename_files.pl xx44
mypoot:test myhome$ ./rename_files.pl /Users/myhome/Desktop/test/xx44
mypoot:test myhome$ id
uid=501(myhome) gid=501(myhome) groups=501(myhome),98(_lpadmin),81(_appserveradm),101(com.apple.access_screensharing),79(_appserveru sr),80(admin),102(com.apple.access_ssh)
mypoot:test myhome$ pwd
/Users/myhome/Desktop/test
mypoot:test myhome$ ls -l rename_files.pl
-rwxr-xr-x@ 1 myhome myhome 566 Mar 1 19:26 rename_files.pl
mypoot:test myhome$ -d rename_files.pl
-bash: -d: command not found
mypoot:test myhome$ ./rename_files.pl xx*
mypoot:test myhome$ ./rename_files.pl xx
Error reading xx: No such file or directory at ./rename_files.pl line 8.
mypoot:test myhome$ ./rename_files.pl xx44
mypoot:test myhome$ perl rename_files.pl xx*
mypoot:test myhome$ ./rename_files.pl /Users/myhome/Desktop/test/xx17
mypoot:test myhome$ ./rename_files.pl xx17
mypoot:test myhome$ ./rename_files.pl rrdoc*
mypoot:test myhome$ ./rename_files.pl rrdoc
Error reading rrdoc: No such file or directory at ./rename_files.pl line 8.
mypoot:test myhome$ ls
Working docs rrdoc24.txt rrdoc50.txt rrdoc76.txt
rename_files.pl rrdoc25.txt rrdoc51.txt rrdoc77.txt
rrdoc00.txt rrdoc26.txt rrdoc52.txt rrdoc78.txt
rrdoc01.txt rrdoc27.txt rrdoc53.txt rrdoc79.txt
rrdoc02.txt rrdoc28.txt rrdoc54.txt rrdoc80.txt
rrdoc03.txt rrdoc29.txt rrdoc55.txt rrdoc81.txt
rrdoc04.txt rrdoc30.txt rrdoc56.txt rrdoc82.txt
rrdoc05.txt rrdoc31.txt rrdoc57.txt rrdoc83.txt
rrdoc06.txt rrdoc32.txt rrdoc58.txt rrdoc84.txt
rrdoc07.txt rrdoc33.txt rrdoc59.txt rrdoc85.txt
rrdoc08.txt rrdoc34.txt rrdoc60.txt rrdoc86.txt
rrdoc09.txt rrdoc35.txt rrdoc61.txt rrdoc87.txt
rrdoc10.txt rrdoc36.txt rrdoc62.txt rrdoc88.txt
rrdoc11.txt rrdoc37.txt rrdoc63.txt rrdoc89.txt
rrdoc12.txt rrdoc38.txt rrdoc64.txt rrdoc90.txt
rrdoc13.txt rrdoc39.txt rrdoc65.txt rrdoc91.txt
rrdoc14.txt rrdoc40.txt rrdoc66.txt rrdoc92.txt
rrdoc15.txt rrdoc41.txt rrdoc67.txt rrdoc93.txt
rrdoc16.txt rrdoc42.txt rrdoc68.txt rrdoc94.txt
rrdoc17.txt rrdoc43.txt rrdoc69.txt rrdoc95.txt
rrdoc18.txt rrdoc44.txt rrdoc70.txt rrdoc96.txt
rrdoc19.txt rrdoc45.txt rrdoc71.txt rrdoc97.txt
rrdoc20.txt rrdoc46.txt rrdoc72.txt rrdoc98.txt
rrdoc21.txt rrdoc47.txt rrdoc73.txt rrdoc99.txt
rrdoc22.txt rrdoc48.txt rrdoc74.txt split_by_doctype.pl
rrdoc23.txt rrdoc49.txt rrdoc75.txt
mypoot:test myhome$

Erasmus Darwin
03-01-2010, 10:52 PM
Huh. It looks like it's running, but it's not picking up the photo URLs to allow it to determine the new filename. It's specifically looking for something like:
HREF="photos/*.jpg"
(where * represents arbitrary text).

If your photo links are different, then it won't pick up the filename. It seemed to work okay when I tested it with the sample page you posted.