Error check for copying growing directories

I have a simple script which copies directory from one place to another and deleting the source .
I am facing a situation when new files gets added when the script has started running. Its resulting in data loss

Please suggest a way to avoid data loss. I googled a lot but most are perl solutions . I am looking for something is shell script.

what ratio of files is being seen when you do 'ls -ltr' as changed. whats the frequency of change and what's the phenomena how often these file are updated

cyclic check on files being transferred would be a better idea. Do you expect the files changing every 1 sec, 1 minute or what. calculate that time's mean value and then start transferring them.

2ndly, you may try some register file, where you can write down the files being transferred so far and then in the next run of the script, avoid them.

the files are posted at random intervals . So I would not be sure.

Requirement here is to clean up the source after proper copy .

Don't cp , use mv instead. mv is an atomic command, deleting the source only if copy succeeded.

mv is safe if source and destination are in the same file system.
On different file systems, it must copy the data like cp .
--
It might help to only copy files with ctime greater than 1 hour.

Agreed. But wouldn't it delete the source only if the copy across file systems succeeded?

No offense intended (just honesty), but you're problem statement is useless. Given its utter lack of specificity, I'm surprised anyone invested any of their time in responding to it.

Accurate answers to the following questions will probably lead to a quick resolution:

  • What operating system you are using?
  • What are the exact commands used to add files to the current directory?
  • What are the exact commands used to copy the files to their new location?
  • Are these two directories part of the same filesystem?
  • What are the exact commands (if any) that are run as part of any subsequent clean up.
  • What exactly do you mean by data loss? Are entire files missing? Are you seeing partially complete files? Something else?

For all we know, your problem may be as simple as misusing 'rm -fr' when 'rmdir' is required.

In the future, if you would like accurate, focused assistance, save everyone (yourself included) time and be specific from the start.

Regards,
Alister

1 Like

As another wild guess here, it sounds to me like ningy is starting to copy files while they are being written and then removes them at the source (while they are still being written). I think the code needs to be modified to be sure the file is complete before the move starts.

Thanks for the inputs everyone.

Yes, I was doing :

cp -rp source destination
ln -s  dest source
rm -rf source

But failed while some new files getting written to.

I have just started exploring find command with -newer option to check if new files are added by creating a file just before copy and checking before removing source. Not sure if thats the best thing to do but a start atleast

Yes much better.
The goal is to make a check as close as possible.
I would even go for

find -type f -cmin +1 ... cp ...

if you have GNU find.
The maybe best approach is like your suggestion, but twice:

cd /path/to/source || exit
# create an empty file for a time comparison
 > findstart
find . -type f -print |
while IFS= read -r file
do
 # don't copy file if newer
 find "$file" -newer findstart | grep . >/dev/null && continue
 # another reference file
  > copystart
 cp -rp "$file" /path/to/destination/ || continue
 # don't delete file if newer
 find "$file" -newer copystart | grep . >/dev/null && continue
 rm -f "$file"
done
rm -f findstart copystart

If I understand what you're saying, it won't solve your problem. You don't need to know if a file is new before you remove it; you need to know that a file is complete before you start copying it. You can only do that by having the server provide some indication that the data in the new file is complete. The client can't reliably know that the source file on the server is complete unless the server provides some unambiguous way to determine that.

What MadeInGermany recently proposed is a big step in the right direction, but there is still no guarantee that the process loading the file being copied will not have been sleeping or "swapped out" while the copy to the client was being processed.

You could easily eliminate all of your headaches if you had a directory on the same filesystem as "source" that was dedicated to files in flight. Since it knows when it's done with a file, the script writing the file should be in charge of mv'ing from the in-flight directory to "source". This way, every file in "source" is guaranteed to be complete.

In my opinion, this is the simplest and most robust solution. Anything else will be either more complicated or less dependable or both.

Regards,
Alister