find eof, then process

mfilby · December 19, 2003, 4:43pm

Newbie question. I want to create a shell script that will only move/copy a file if it's determined that the eof string exists. This is to control files being uploaded via FTP. I don't want to move incomplete files, so my only thought is to check for eof, or compare file size every 15-30 seconds on the same file until size in bytes is the same and then move/copy the file. Some uploads occur over very slow connections, with large file sizes, and can take up to 10 minutes.

Anyone already have something like this? What commands would be best to use to facilitate this?

Optimus_P · December 19, 2003, 4:53pm

what do you mean EOF (end of file) string?

are you looking for something on the last line of the file?

if you are going the shell route you can use if [`tail -1 $file` eq "some_string" ]; then
........
fi

you might need a double [] vs the single

mfilby · December 19, 2003, 5:21pm

[quote]
Originally posted by Optimus_P
[b]what do you mean EOF (end of file) string?

Specifically, I'm working with pdf files, all of which end with EOF.

cat -v /testfile*.pdf on a complete file contains "EOF" at the end. I want to detect that to know if the file is completely uploaded before any cp/mv command on the same file.

Am I wrong to assume all files have an EOF?

fpmurphy · December 21, 2003, 6:48pm

>> Am I wrong to assume all files have an EOF?

Yes, you are.

oombera · December 21, 2003, 8:50pm

fpmurphy, could you possibly elaborate on your answer at all? :rolleyes:

Perderabo · December 21, 2003, 11:01pm

I rather liked fpmurphy's answer. But to spell it out...

Do a "tail /etc/passwd", "tail /etc/group", and "tail /etc/hosts". I'll bet that you can quickly locate a file that does not literally have "EOF\n" as the last four characters. Some files are even less than four characters in length. They don't pop out of existence or anything.

So yes, while pdf files end with the string "EOF\n", many files do not.

oombera · December 22, 2003, 9:10am

Guess I read too quickly.. didn't realize the OP was asking literally for a string that says EOF or that PDF files have that string at the end. Thanks, Perderabo.

Perderabo · December 22, 2003, 10:36am

Actually, I never realized that PDF's have that string at the end either. But I looked at a few and they all seem to have it. What I'm less clear about is if that string is guaranteed to appear only at the end. The OP seems to think so, but I'm not sure. Still, it is perhaps unlikely that ftp transfer would stall at exactly that point. So the technique might be good enough.

I like the flag file idea personally. Transfer a data file. Then transfer a zero-length "done" file. When the second file arrives, the first is clearly finished.

mfilby · December 22, 2003, 12:47pm

You lost me on that one. Transfer a zero-data file to indicate the file is complete?

Let me go back to an earlier assumption, EOF at the end of PDFs. I've found many PDFs that have two EOF strings. Looking at a very small sample (30 PDFs), more than half had two EOF strings. I have no idea what the implications are of this. So it makes me want to do a reverse text search on the file for the same string (which can be made even more concrete, as the EOF is always "%%EOF". %% because it's binary? I dunno). Something that can be done in a text editor, but not in a grep search.

Like I mentioned before, the only other method I can think to do this is to do a cyclical check on file size until file size is the same between checks. This isn't bullet proof, either, as an FTP upload could timeout, drop connection, and leave a partial file. I guess I could do the file size check, then double-check it with the %%EOF search.

To be really specific, I work for a newspaper that has commercial printing clients that upload their print jobs via FTP. We then auto. download those PDFs to our RIPs for processing. With this method we remove hours of manual labor checking for and moving files, not to mention cut deadline times to a minumum and allow clients to upload files late into the night without having to have staff scheduled on our end. We currently do this on Apple hardware, some running OS X, others not, all using Apple Scripts and some pretty crappy proprietary software from a newspaper-specific software company. It's less than perfectly reliable for a process that needs to be 99% reliable... so I figured a shell script in OS X would make a great deal more sense.

I'm sure nobody wanted to know that much...

oombera · December 22, 2003, 1:40pm

mfilby, the second zero-length file (a file with nothing in it) will only be transferred when the first file is completely finished transferring... therefore if you receive that second file, the first must've been received too.

your explanation of why you're doing this actually helps. now that you've explained more, it sounds like that would require your customers to send their pdf file and then send a second empty file which is probably not practical for them.

mfilby · December 22, 2003, 2:04pm

It's good to have a reason to be long-winded. Sometimes it's better to explain what you're actually trying to do than ask help for a solution that solves the wrong problem.

What about a combination of cyclical file-size checks, then check for %%EOF to remove the possibility of a partial file left from a disconnected FTP session or timeout?

I've always approached this with the method that once the file is uploaded and complete, it gets moved to a second "temp" directory that is checked by a local wget or curl script where it's downloaded from without a check.

Perderabo · December 22, 2003, 2:49pm

If you're simply doing a one-sided crc, that's silly. You could just loop checking the file length until that seems stable. If you can get the client to do a crc and send that to you then you compare the crc's to see if they're the same. That is actually a great solution. And if they send you a file length and a crc, that is even better.

But the flag file is so simple... Instead of
put some.pdf
your client does a
put some.pdf
put done
There is much less chance of the client making an error with a simple procedure like that.

I have to say that I don't like the check for EOF string idea, First I still believe that it can appear earlier in the file. But more importantly, someday you will need to handle a *.doc or *.txt or a *.rtf or something else. And your script will need to be quickly changed to what I am now suggesting. Change happens very fast in this business.

mfilby · December 22, 2003, 5:29pm

I like the EOF file idea since I think it's clumsy to expect clients to upload a separate data file that they have no real interest in. It's one more requirement that's out of my control and another responsibility for an unsophisticated user (my apologies, they're really quite intelligent but wouldn't give 2 nickels about a separate data file). The idea is to make things simple for the client. They shouldn't have to do anything different.

I've been reviewing Adobe's PDF format reference, and indeed %%EOF can appear numerous times in the file and yet not truly indicate the end of a file. However, %%EOF will always be in the last 1024 bytes of the file:

3.4.4, �File Trailer�
17. Acrobat viewers require only that the %%EOF marker appear somewhere
within the last 1024 bytes of the file.

So again, if I had anyway, even from the host of the ftp site, to scan a directory for files containing a %%EOF, beginning the search from the "theoretical" end of the file, I'd be a happy camper. Something as simple as "fgrep -l /path/*.pdf", only recursively, and then set the resulting filenames to a variable that I could then process with mv or cp.

I'm I being to simplistic? Take it easy on me, I'm from the Apple World of GUI ignorance.