Ls returning differing results

MuntyScrunt · September 30, 2018, 6:38am

Hi there.

I have several raspberries, Im currently logged (ssh) into 2 of them, both identical os, same user, both up to date. Both mounted to a Nas, but when I ls the same dir on the Nas one lists all the 1498 files, one lists a random subset of files, about 30.

I had a problem yesterday where ls returned nothing, but didn't error. I re-mounted the Nas and all was well again. The Nas doesn't have any power saving turned on.

All of the pis are running code and exporting data to this dir, so it's kind of important that the mount is solid. My network is fine, it's all good kit and I have no problems with anything else, the only possible weak point is; the pis are all wifi connected at the moment, but they're only 3 ft away from the router.

So, (1) why is the mount proving to be a bit flakey, (2) why would ls return random results, (3) if ls returns nothing will data still be saved to the dir, and (4) what can I do to strengthen this weak point?

Many Thanks.

------ Post updated at 11:38 AM ------

As a quick follow up to this, what would be a safe way to refresh the problem mount?

Of course I can umount <share>, mount -a, every hour or something, but how do i make sure my other processes aren't writing at this time? I can't stop the processes, I could potentially loose up to 5x12 hours of processing.

bakunin · September 30, 2018, 10:07am

OK, there are several things to talk about here, some specifically addressing your problem, some of more general nature. My hope is that the more general points will help you solve other problems better in the future so bear with me:

First, your description of the problem would be better if you could show us some examples. In this case this means: a snippet of the filelist in one system with some of the files seen in the other system along with some of the files NOT in the other system - we might find some differences between the two groups and deduct the problem from there. Instead of telling us "same OS" you might tell us exactly which version(s) you are using because it might be that version 1.2.3.4 is known to have specific problems which were addressed in version 2.3.4.5. Right now we are just left to speculate.

One such speculation is: in NFS there is a difference between user@systemA and user@systemB. Even if they both have the same name they are still not regarded as the same account. Therefore it might be a privilege problem where user@systemA is allowed to access all files and user@systemB is only allowed access to a subset. Notice that the privilege comes from the NFS server - your NAS box - not the client (your pi-systems).

Questions: which NFS version do you use? NFSv3 works differently than NFSv4
If it is NFSv4, have ou set the NFS-domain on all your systems?
Do the user accounts have the same UID too or just the same name?

What exactly do you mean by "all was well"? Did all files show up on both systems? Did/does this problem occur after the system is running some time? If yes, it is always the same amount of time (which might point to some kind of timeout)?

The "weak point" you have identified is not weak per se in my view. If you connect the systems temporarily with a cable rather than WiFi, does the problem go away? You could test this. If it persists then the WiFi is not the culprit, otherwise we could further analyse the problem.

That depends on what the cause for your problem is. If it is - just making this up as example - a crashing nfsd on the client side then refreshing (restarting) it will solve the problem. On the other hand it might be a better solution to use another (version of the) nfsd instead of restarting the daemon. It is generally better to replace a failing part than to use band-aid to make it "fail better".

How would you lose 5x12 hours? If you disconnect one system you will only prevent this one system in writing, not the others, so IMHO you would lose only 1x12h at most. The answer to your question how to prevent processes from writing is: in principle you can't - at least not if the process doesn't offer provisions for exactly this. If you have some amount of local storage large enough to buffer the time it takes to remount you could create a FIFO (a named pipe), have the original process write to that and use a second process to do the writing to the NAS which you can stop and restart accordingly. This way you will use OS resources to buffer your output. Notice that this is also a workaround, not a solution.

Basically it is a healthy strategy to never use workarounds: a problem solved is not a problem which symptoms are gone away but a problem which symptoms are gone away AND we have understood the reason why they went away. This affords understanding the reason why the problem occurred in first place.

i.e. If you need some a light to be on but its off this could have several reasons: lamp burned out, switch is off, loss of electrical current, etc.. If it comes back on eventually you can rule out the "lamp burned out", but all the other possible reasons could still apply. And as long as you don't know which one it was it can come back at any time come back to haunt you. If you just close the case with "it working again, so what" then chances are the light goes out at the least convenient moment possible. Since you still don't know what caused it to go out back then you right now still don't know what to do to correct the problem. Hence, better to investigate immediately because whenever the problem reoccurs there is a big chance that it is under even less convenient circumstances than right now. If, on the other hand, you found out that the current was first lost because of a burnt fuse and somebody replaced it then you might look immediately if the fuse is burnt out again and arrive at a solution much faster than if you start looking where the fuse is located to begin with.

I hope this helps.

bakunin

MadeInGermany · September 30, 2018, 2:53pm

Do you mount the NAS via /etc/fstab?
Then have mount options sync,hard,bg

In case you see the random number of files, please check them with
ls -li and df .
And compare it with the working situation.