Daily script taking increasingly longer each day

Hello,

I was wondering if anyone had an idea why a recurring script might take slightly longer each day.

The intent of the script is to run on a dedicated VM and connect to several thousand remote servers to assay their operating status by running a series of commands and storing the results locally. This script runs once daily.

On the VM, the script is run in parallel (using GNU parallel) to increase the rate that the target servers are checked. The VM runs as many parallel connections as it can handle (maxed out resources) which equates to roughly 100 connections at any given time.
The first part of the script initiates an SSH connection and passes about 30 commands to the target server (basic commands such as top, cat ..., grep ..., etc). Results are stored in a local file.
The second part of the script initiates an SCP request and copies zero to many (max 30) gzip files to the local server based on their filename timestamp -- these files represent logs of the activity on the servers, where more active servers have larger/more gzipped log files.

I've set a limit in GNU parallel of 140 seconds. If steps 1 and 2 combined take longer than 140 seconds then end the job and move on. This number is high as the process shouldn't take longer than 100 seconds.

So far I've covered how the data is pulled. The data is stored in many txt files on the local VM, each named with the ip address and date of creation -- this is necessary for later aggregation and analysis by our monitoring tool. I've also had to hash the files into directories based on their IP structure.. i.e.:
ip 1.2.3.4 would be in folder 1.2/1.2.3/1.2.3.4_sshresult_date.txt
This was done because the monitoring/aggregation tool had a difficult time reading from a single folder with such a large number of files. Once our monitoring tool reads the .txt files it also deletes them to prepare for the next day's run.

This is a big issue for me as when this process was first put in, it took 5 hours to complete. In the past 2 months that number has climbed to 8.5 hours, with no changes in the script. I've added some extra logging to the SSH and SCP components of the script and each day I can see that the average SSH time increases by 0.5 to 1 second. The same goes for the SCP execution.

What I've tried:
1) Reinitializing the monitoring/aggregating tool (which sits activtely on the VM as well) in case it was causing a memory leak or file locking issues.
2) Rebooting the server occasionally to clear memory in case there existed a general memory leak from any source.

Some possible explanations that I've thought of or have been suggested to me:
1) inodes on the local VM may be out of wack due to the large number of files created and deleted each day. I've never had to deal with inode management so I'm not sure if this is plausible, or how to deal with it if it is the issue.
2) Something I thought was perhaps the process of connecting each day by SSH and SCP to a remote server is producing some effect on that remote server which is actually causing it to respond more slowly each day -- this would be worst case scenario. I don't see how initiating an SSH&SCP session once each per day could affect the target server, but could this be possible assuming the target server could be considered 'old/fragile'?

Sorry for the long post, I just wanted to make sure I gave enough detail. Please let me know if you have any questions and I'll do my best to answer.

Thanks for any help provided! I'm relatively new to large scale bash scripting and I want this thing to run efficiently.

Two comments

In performance tuning slower means you check I/O first. A priori it sounds like a disk efficiency problem. That slight increment indicates I/O is most likely the problem - increasingly bigger file sizes, more files, bad directory lookup performance.

Example: huge numbers of files in a single directory degrade lookup performance. This is disk hardware and filesystem dependent. We had a poster here years back who could not understand why it took ls 90 seconds to locate a file in a directory with a million emails. Workaround was/is to create a multi-branched directory tree with a lot fewer entries per directory. The find command has similar problems. The tell for this is when you get really large directory file sizes. ls -ld somedirectory PS: directory files in most filesystems are not self-reorganizing - they don't shrink. So the smoking gun does not go away usually.

Because this runs on lots of servers, you should use the very data you are getting to see if the problem is localized to a few servers, or generally spread across all of them. Not clear to me you did this.

Parallel is also obfuscating granularity for process observation. One process that does something outrageous may not show itself right away. You may have to resort to using the time command on each process and look for outliers. But first you must find at least one poster child server that clearly runs slower today compared with a while back.

Sounds like your next few Sundays (or whenever you can monopolize some servers) are spoken for.

One more. In thinking about the system design, why do you not put the onus of processing on each remote?

The remote has crontab script to run stuff at 2:00AM or whatever. Then it sends a communication via scp-ed file to your local system, 'come get your files and it ran 03:14:10 duration today' or ' I have a problem'. Whatever you need to see

Your local code just checks once a minute to see who has sent files. At the end of the run, checks that a required number of communication files exists or from required servers.

Do not forget to include the monitoring nagios box (or whatever) as part of the problem set. It could have problems, too.

Why would you do this? It is kind of like instrumenting a code base deployed all over the place, what you need to start finding problems. We use a database to keep this stuff. Oracle in our case. SQL is an extremely efficient and powerful tool for scanning datasets for almost anything. Oracle dumps a daily control file for our local monitoring script, because we have a lot of 'If today is Tuesday and I like bacon then do this' kinds of ill-conceived business rules about monitoring stuff.

Do you have shared resources like NFS?
Hundreds of parallel df can cause load in the NFS server, and increase execution time.

Run an ssh job manually with "time", stop your monitoring engine and run the ssh job again. Compare the execution times.

I looked at specific server trends over time and there is no incremental pattern while looking at a single box -- it could be 20 seconds lower the second day, 40 seconds higher the third, etc. This variability is probably due to two factors: the general activity on the target box (if it's working hard it may respond more slowly), and the fact that the local process could be resource impacted due to parallel maxing out connections.

I would love to but I have a restriction of not being allowed to write anything to any of the remote servers. Another factor is that these servers are constantly flipping between up and down due to maintenance and a crontab script would not run reliably.

The idea was to get ahead of potential problems or catch things that other monitoring tools weren't looking at. It is not the solution I wanted, but it's what someone higher up signed off on.

Yes, actually. The SCP segment of the code downloads to an NFS parition... I will investigate this.

Thank you for your suggestions! :b: I'll let you know how things go when I dig into this.

So I adjusted the script so that it would not utilize the NFS storage partition at all and I'm seeing roughly an equivalent execution speed (there was no improvement). Leads me to believe that NFS I/O was not the limiting factor here...

My known_hosts file is surprisingly large. With the way I wrote the script it would have plateaued over time (don't ask), so my question is could a very large known_hosts file slow down SSH connection time? I would guess that it could take longer to validate an actively initiating SSH connection...

What is "surprisingly large"? I don't think a few thousand entries, although needing their time to be checked, would noticeably increase the time to login. And, once you're in, the entries are not needed nor checked any more.

Maybe a newbie mistake but I now just added "-o CheckHostIP=no" and "-o BatchMode=yes" and it appears to be running about 30% faster than previous days.

This is a great increase but if anyone can suggest anything else along the same vein that could be causing connection delays I would appreciate it.

known_hosts file was 350 MB. I under-exaggerated the numbers to simplify the question.