sftp - get newly created files on incremental basis

Hi,

We have a sftp server which creates files daily and keeps 6 months of files on the server. We are creating a daily job to get the files and load into database. My problem is "how to get ONLY those files which got created after my last get". Let me provide some more details to it.

Below example shows files available on server as of day1, 2,3.

Day1:
customer_data_2010-12-09.2010-12-10_03.15.01.zip
customer_data_2010-12-09.2010-12-10_03.15.01.zip
customer_data_2010-12-10.2010-12-10_19.15.02.zip
customer_data_2010-12-10.2010-12-10_19.15.02.zip

Day2:
customer_data_2010-12-09.2010-12-10_03.15.01.zip
customer_data_2010-12-09.2010-12-10_03.15.01.zip
customer_data_2010-12-10.2010-12-10_19.15.02.zip
customer_data_2010-12-10.2010-12-10_19.15.02.zip
customer_data_2010-12-10.2010-12-11_03.15.01.zip
customer_data_2010-12-10.2010-12-11_03.15.01.zip

Day3:
customer_data_2010-12-09.2010-12-10_03.15.01.zip
customer_data_2010-12-09.2010-12-10_03.15.01.zip
customer_data_2010-12-10.2010-12-10_19.15.02.zip
customer_data_2010-12-10.2010-12-10_19.15.02.zip
customer_data_2010-12-10.2010-12-11_03.15.01.zip
customer_data_2010-12-10.2010-12-11_03.15.01.zip
customer_data_2010-12-11.2010-12-11_19.15.01.zip
customer_data_2010-12-11.2010-12-11_19.15.01.zip
customer_data_2010-12-12.2010-12-13_03.15.01.zip
customer_data_2010-12-12.2010-12-13_03.15.01.zip

To get files, Day1 is easier as we can simply get all 4 files. You can see two more files got created on day2. So we should get only those 2 onto database server. On day3, we see 4 more new files got created. Please let me know if you think of any easy way to get files from sftp server that are created after last get. Appreciate your help on this.

Here is what I can think of..
Maintain a file say "processed_file" which keeps names of all files processed so far. This file will be empty on very 1st day and new file names will get appended to it later on while each run.
Now daily job script will create the list of ALL files available on sftp server (put this file list in some temp file). Get a diff of temp file and processed_file and put the diff result in a new file "TO_BE_PROCESSED" file.
Now get all the files listed in "TO_BE_PROCESSED" file,process them as need and append sftp'd file name in processed_file.

Thanks Anurag. Approach looks OK.
I am thinking more like finding all files which got created on ftp server since our last get "based on timestamp". Any other thoughts?

rsync

We do not have rsync installed on our servers. Any other thoughts from anyone else. Appreciate your help.

How about moving the files to a "processed" sub-directory on the sftp server?

I thought about this option. But could not find any command for moving files in sftp help. mv is not working in sftp prompt.

I also see one limitation with this approach. Let me explain.

Say there are 10 files on ftp server. We will get them using mget call*. while getting the files, may be two more got added. After get is completed, we will do move call* to processed directory. So we would miss two files and never get them for processing. Am i missing something?

Yes mget is a bad way to go.
You are better of doing a ls of files first and then generate a script (from the filelist) that gets and moves each file individually without using wildcards.

Fetch the file to a .tfr file first and shell out a mv command to move to final name (stops any local scripts picking up files still being transfered).

Here is a quick script to get you going (no exit status checking or log file writing)

cat > /tmp/list.tmp$$ <<EOF
cd src_dir
ls -1
quit
EOF
# Remove all <CTRL-M> chars from filename
# Ignore any files where name:
#     begins with . (dot) or / (slash)
#     contains a space
#
sed -e '/ /d' -e '/^\./d' -e '/^\//d' -e 's/^M//g' /tmp/list.tmp$$ > /tmp/list.cmd$$
sftp -b /tmp/list.cmd$$ user@host > /tmp/filelist
echo "cd src_dir" > /tmp/get.cmd$$
for file in `cat /tmp/filelist`
do
    echo "get $file $file.tfr" >> /tmp/get.cmd$$
    echo "!mv $file.tfr $file" >> /tmp/get.cmd$$
    echo "rename $file ../processed/$file" >> /tmp/get.cmd$$
done
echo "quit" >> /tmp/get.cmd$$
sftp -b /tmp/get.cmd$$ user@host > /tmp/filelist
rm /tmp/list.tmp$$ /tmp/list.cmd$$ /tmp/get.cmd$$

Note ^M represents a <CR> character type CTRL-V CTRL-M to enter in vi.

1 Like

Thanks Chubler for sample script. It works fine for my problem.
Only one issue i can think of is how to ignore files which are incomplete or in the process of creation. ls -1 is giving list of all files present. Is there a way to skip those files by any way? Please let me know.

maybe use the find utility to only move files that are more than an hour old?

find command does not work from sftp prompt. It is not list of ftp commands. Am i missing something?

Yes, you are missing something. The find command is a separate utility and you need to modify the script supplied by Chubler_XL to utilize it. Man find(1) for more information. It is not hard to do.

fpmurphy, ls is supported via sftp, but find isn't I suspect that ravi.videla dosn't have shell (ssh) access to sftp server so probably can run find there.

ravi.videla, do you have any control of the creation process, if so a couple of suggestions:

  1. upload/build files in another directory (on the same filesystem) and then rename (mv) them after the upload/creation is done.
  2. upload/build with different name (eg .trf extension) and rename after completed. You then exclude .tfr files with the sed:
    text sed -e '/ /d' -e '/^\./d' -e '/*.tfr/d' -e '/^\//d' -e 's/^M//g' /tmp/list.tmp$$ > /tmp/list.cmd$$

I am getting "invalid command" error for fpmurphy in sftp prompt.
I do not have any control on the file creation process. All we have is ftp server user/pass, directory from where we should get files. It looks like, I should request source team use a different file extension like .trf when they are creating files and name it with proper extension after file is completely created.

Ravi,

Work around is that ammend your code to create a refrence file post every transfer.Example tra_`date`.

And when you initiate a transfer run the find command using the attribute

-cnewer or -anewer 
refrence_file_name

it would be something like

find /diretoryto search -cnewer/-anewer refrencefile -type f |while read INPUT
do
ssh -n $IN abc@destination:/direcotry
done

Note : -n option is important for ssh
Also write a one or two liner to deal with old refrence files

I have only ftp access to the file server but no ssh access. So i won't be able to use find command.