data corruption with ftp transfer

Hi again,

first of all thanks for you help on my last problem, the problem is solved now.

But I have many problem :slight_smile:
This time, I transfered a big file, ~3,5 GByte, with ftp from a Sun machine to a linux box, RedHat 7.3. But the file recieved on the Linux Box is corrupt, with smaller files there is no problem, I tested it with a 130 MB file.

On the Linux Box I am using the default wu-ftpd server, with default configuration.
Only the server timeout was set by me to 24h.

thnx and best regards
Alex

What type of file is this large file? Are you using the correct file transfer type (ascii versus image)? What version of Solaris?

Is the ftp completing or failing while transmitting?

Is it stopping at exactly 2GB? This could also be a large file problem.

Hi,

The file transfer runs without error messages, only afer the finished transfer you see a broken file.

The files are data files, and the sun system is SunOS 5.8.

We are using binary mode with ftp.

Alex

Is the filesize the same on both systems after the ftp is completed? if not how much are you missing?

If its the same filesize try using md5 sums to ensure that its really corrupt on the linux box so you dont have any other strange problems?

Just a tought..

/Peter

  • No the corruption of the file is not anytime at the same place.

  • The filesize has on both, the source and destination the same size, when you check the file with md5 you can see that the files are different.

  • If you look into the file, using a hex editor, you see a lot of zeros or parts of other files in the corrupt file.

I had a problem just like that once.. the strange thing that we found was that the disk was broken. fsck didnt report any errors..(If you havent runned fsck on the disk you should try that) but if we did a copy of the corrupted file and did a md5 check against the original it didnt match.. :slight_smile: that was one wierd problem.. You could give it a try.. copy the file to the same disk under a diffrent name an d md5 check it.. or make the sun ftp the file to another disk on the linux box and see if the file still gets corrupted.

sounds like you have a corrupt disk..
/Peter

I don't know it that's already solved, but if not, I wold see if your NICs have "ierrs". Sometimes data corruptions ocurrs in package transmitions. We had some problems here. Here folows an example:
# netstat -ni
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs # netstat -ni
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
lo0 8232 127.0.0.0 127.0.0.1 124727 0 124727 0 0 0
ge0 1500 172.19.148.0 172.19.151.85 410543681 95593 221423296 0 0 0
ge1 1500 172.19.148.0 172.19.151.84 259411750 0 161355918 0 0 0
ge2 1500 10.152.231.0 10.152.231.150 258896122 0 496691048 0 0 0
hme0 1500 20.10.1.0 20.10.1.40 781 0 3 0 0 0

In ge0 (gigabit interface) we had some "ierrs" while "ftp'ing". That corrupted data. The FTP did not return any error, and completed successfull. But we had corruption problem. As we have two Nics in the same Network (Sun's IPMP thing) I forced my FTP connection to the other NIC (ge1) and then the data corruption problem was gone. Maybe this would help.

Htsubamoto

A few trouble shooting points:

1) have you tried to Gzip or tar the file before transfer?

2) have you tried to copy it to another host to check if it is a platform issue?

3) Can you establish an rlogini/rcp relationship and do a direct copy from one to the other?

4) If possible can you perform a reset on the NIC card to see if that may clear any errors? Of course this may interrupt normal dataflow while it is resetting.

Hi,

so first the problem is still alive :frowning: But we have the right direction I think.

We were able to identify the problem, it looks like a linux - ia64 - NFS problem. We tried other platforms, and didn't get this error.

Only with our linux, ia64, fileserver is it possible to reporduce this problem.

At this moment our development is investigating the NFS code.

If someone is interested, I can public our solution for the problem, when our develoment send me the solution.

Regards
Alex

Hi again,

thank you to all of you very very much, with your hints, I was able to go in the right direction.

Next time, we will drink a beer together !!

Alex

Yes please post your answer.

I am sure that someone else may have this problem and need a starting point...

Hi all,

today I received the answer from our development related to the data corruption problem.
The solution is XFS, our used XFS has a bug, but see below the detailed answer :

--- snip ---

We have now found the cause of the data corruption.
This is original bug from XFS.

The special timing of the delete operation of a block and
flush operation from the memory leads to the data corruption.

To be prepared for next I/O operation, XFS pre-allocates the data
on the memory larger than what is actually needed.
Sometimes also unnecessary block are removed from pre-allocated data
on the memory to efficiently use the memory resources.
This happens when memory resource are fully utilized.

When this "pre-allocation" and "removal of unnecessary data"
occurs with certain timing, next data to be processed is written
in the area where just removed from the pre-allocation which shouldn't
happen. Since the data is written into the wrong place, the block
in the memory which should contain the data does not contain the
actual data which causes the data corruption when it is flushed to

Following conditions are needed to have the bug to aprear:

  • Memory is fully loaded
  • I/O is done via NFS
  • Large file is being written continuously and concurrently
    (magnitude of GB)

--- snap ---

may be someone has a similar problem, so thanks a lot to all and have a nice day.

Regards
Alex