Issue with ftp hanging

I could have sworn I posted on this issue earlier a couple of weeks ago but now cannot find the thread to add some updated info so I guess I'll start from scratch.

Running on Oracle Linux 5.6, 64-bit.

I have a weekly job that does an ftp 'mget' to copy a bunch of files from a production server to a test server. The two servers are in two different data centers. One DC has all of our test servers and theoretically would be our disaster recovery location. The other DC has all the production servers. The script has been working flawlessly for years - until .....

We relocated the production DC. With that relocation we got a less reliable and lower bandwidth wireless (microwave ?) link between the two DC's.

Since the DC move, I have not had a successful ftp operation. In every case, it will successfully connect and copy several files, but eventually it will get to a file and just hang. When it hangs, it is not on the first file and it is not on the last file, so the issue has nothing to do with authorization or fundamental connectivity, nor does it have to do with housekeeping of getting started with the overall operation or getting it all wrapped up at the end.

A couple of other 'odd' observations:

1 - For a given set of files to be transferred, once the operation hangs, any repeated attempts hang on the same file.

2 - Each week it is working with an entirely different (newly created since the last week) set of files, so obviously hangs on a different file than the previous week, but see point 1.

3) I have a completely different pair of prod/test servers on which I occasionally have to do the same sort of ftp operation on demand rather than on schedule. It copies a much smaller set of files but exhibits the same 'hanging' behavior.

4) In order to get the real work done, I switched from ftp to scp. With that I can get the files copied and do the work needed, but it is taking much longer. The ftp - when it worked - took between 90 and 120 minutes. With scp it is taking between 6 and 7 hours. I do not know if this time difference is purely due to the bandwidth -- if ftp would now take as long if I could get it to work at all.

My primary question is what could be up with the 'hanging' issue in ftp, and how do I go about resolving it?

Does the moved DC have a new IP that has not been ajusted in the /etc/hosts?

Definition of hang - infinite, if a 5kb file takes 2 hrs.
As it is the same file (within the same week), how large is the file that hangs?
If it is larger than 4gb, what is the destination underlying filesystem (eg: ext4, ntfs, fat)?

not that much experience, but hth & gl

A couple of questions to consider:-

  • Is the file being read a pipe rather than a plain file by any chance?
    If so, it will be waiting for input and an end-of-file to be written to the pipe file.
  • Are there any errors returned or does the whole thing just stop?
  • Are the network errors? Small files might be okay, but larger ones might cause it to go extremely slowly (seemingly infinite) I have experienced this when a switch forced a connection to 100Mb full-duplex but he card could only run at 10Mb half-duplex.

Maybe investigating these might help,
Robin

Since the problem occurred after relocation I'd be inclined to suspect the network. I'd write a script to ping the target, say, every 10 seconds (to avoid too much artificial network traffic) and leave it running in front of me. When the ftp 'hangs' is the ping still fast????

sea -

All of the files are in the 1 to 4 gb range. Most of them are right at 4gb. When tailing the log file, the ones that complete do so in a couple of minutes at most. I can't wait 'forever' but when the next one goes two hours without any progress, I'd call that a 'hang'. If it were the underlying FS, it should have presented problems prior to the DC move, and the scp copy should have the same issue. No, the only variable here is the network connection, and perhaps ftp's reaction to it. Given the apparent reduction in bandwidth, I'd expect it to be slower, but I wouldn't expect ftp to grind to a complete halt, while scp is able to do the same work.

---------- Post updated at 02:12 PM ---------- Previous update was at 02:06 PM ----------

Robin -
1 - no pipes, just plain files.
2 - no errors returned ... it just grinds to a halt. Before it does so, I tail -f the file and watch things go by. I can also see the time stamp on the file (ls -l) progressing every couple of minutes as it completes another file and writes the info about it. Then everything just stops moving.
3 - No network errors that I know how to log (I'm DBA, not Net Admin) but I'm willing to take a look if someone can tell me where.
3a - most of the files are at 4gb. Several of them move just fine, so it can't be file size alone, though I could see that being a 'necessary but not sufficient' component of the problem. I don't know about the possibility of the network speed and duplex switch, but I'll take that up with the Net Admin. Thanks for the idea.

---------- Post updated at 02:14 PM ---------- Previous update was at 02:12 PM ----------

Good idea. I'll try to set up a test.

@edstevens......We are discussing network performance here so, for the avoidance of doubt, there is no Windoze involved in this, is there? We are talking Oracle Linux here, yes?

No Windows at all (thankfully :)) Both servers are Oracle Linux.

---------- Post updated at 01:06 PM ---------- Previous update was at 08:09 AM ----------

Ok, here's what I did and what I found ..
First, a script to capture some good ping statistics

#!/bin/sh
rm edspingtest.log
echo starting first ping at `date`
ping -c 20 -i 10 myprodserver >> edspingtest.log
echo Starting ftp job at `date`
echo Starting ftp job at `date` >> edspingtest.log
nohup /u01/app/oracle/dba/eds_ftp_test > /backup/eds_ftp_test.log &
echo starting second ping at `date`
ping -c 20 -i 10 myprodserver >> edspingtest.log

The summary stats of the first 20 pings (before launching the ftp):

rtt min/avg/max/mdev = 0.155/0.595/4.702/1.036 ms

And the 20 pings during the ftp:

rtt min/avg/max/mdev = 0.157/2.266/16.610/4.356 ms

So there was a significant difference. Not sure what to do with that information.

By way of comparison, a simple 'ping -c 50' from the same test server to a different test server (same data center, so not using the link between DC's) yielded these stats:

rtt min/avg/max/mdev = 0.081/0.124/1.136/0.171 ms

So at "the end of the day", it's obvious that the data link between the two dC's is rather slow, and there's nothing I can do about that. I can (and probably should) get the files copied with scp instead of ftp. But I still find it puzzling that the ftp should completely hang the way it does.

Yea, but.......yea, but......

It's obvious that pings will get slower when other traffic is on the link, ie, the ftp job, and that doesn't tell us anything.

If the ftp is running from node A to node B then I'd set up ping scripts in both directions. The question is what happens to ping response times when the ftp job 'hangs'? Do the pings stop completely (lost packets), or return to full speed (ftp is producing no traffic so ping is fast)?

At that time what speed are pings from node C to node A? And node C to node B? Does this show that the network interface on A or B has completely screwed up and not communicating at all? Or do both interfaces still work but ftp is hung?

Unfortunately, I think you need to run this test to destruction (until the ftp is completely stuffed and 'hung') and then see if you can get either or both interfaces to talk. If one node won't communicate then that's the node to investigate.

Also consider -

netstat -s | grep -i drop

to look for tcp drops - in general there should be few. Check on both servers.

Oops - this is a Red Hat derived system. I know it supports netstat but the assumption is the -s option.
I do not know.

Try:-

ethtool -S ent0

This will give you all sorts of statistics about the card including any dropped packets. What output do you get?

Robin

I think we're going to just have to put this one to bed. I ran another test, pinging both ways and also involving a third server. As expected, the average time of a ping became about 4 times slower during the ftp. I can't run it to any more 'destruction' than what I'm already doing. Watching stdout of the ftp, I can see it succesfully pull several smaller files, then one of the c. 4gb files takes about 2 minutes, then on the next big file it hangs. I leave it trying for about 10 minutes before killing it. Past experience indicates that at this point, 10 minutes is no different than 10 hours. And yes, I have, in the past, left it up for as much as 16 hours.

I appreciate everyone's help, but I'm not sure what else can be done at this point.

Right, so after the ftp job is hung do the pings still work?

After the ftp job is hung can you still login to both systems?

After the ftp job is hung and you run another ftp job in the same direction does it run or hang immediately?

If you run the job reversed ('put' instead of 'get' or vice-versa) does it transfer all the files successfully without hanging?

More questions in addition to those from hicksd8 I'm afraid:-

  • When the FTP stops, is the ping response faster again or still about 4 times slower?
  • Can you leave the job that is stuck and start others that still work?
  • Can you transfer the file with that hash subcommand in first? It puts a # mark on the screen for each block of data (typically 1024 bytes) so you can see if it really stops or just goes very slowly.
  • Is the target file growing at all?
  • Is there something on the network that throttles your use?

Sorry for all these, but it's difficult to spot anything yet.

Robin