Hello,
I was wondering if anyone had an idea why a recurring script might take slightly longer each day.
The intent of the script is to run on a dedicated VM and connect to several thousand remote servers to assay their operating status by running a series of commands and storing the results locally. This script runs once daily.
On the VM, the script is run in parallel (using GNU parallel) to increase the rate that the target servers are checked. The VM runs as many parallel connections as it can handle (maxed out resources) which equates to roughly 100 connections at any given time.
The first part of the script initiates an SSH connection and passes about 30 commands to the target server (basic commands such as top, cat ..., grep ..., etc). Results are stored in a local file.
The second part of the script initiates an SCP request and copies zero to many (max 30) gzip files to the local server based on their filename timestamp -- these files represent logs of the activity on the servers, where more active servers have larger/more gzipped log files.
I've set a limit in GNU parallel of 140 seconds. If steps 1 and 2 combined take longer than 140 seconds then end the job and move on. This number is high as the process shouldn't take longer than 100 seconds.
So far I've covered how the data is pulled. The data is stored in many txt files on the local VM, each named with the ip address and date of creation -- this is necessary for later aggregation and analysis by our monitoring tool. I've also had to hash the files into directories based on their IP structure.. i.e.:
ip 1.2.3.4 would be in folder 1.2/1.2.3/1.2.3.4_sshresult_date.txt
This was done because the monitoring/aggregation tool had a difficult time reading from a single folder with such a large number of files. Once our monitoring tool reads the .txt files it also deletes them to prepare for the next day's run.
This is a big issue for me as when this process was first put in, it took 5 hours to complete. In the past 2 months that number has climbed to 8.5 hours, with no changes in the script. I've added some extra logging to the SSH and SCP components of the script and each day I can see that the average SSH time increases by 0.5 to 1 second. The same goes for the SCP execution.
What I've tried:
1) Reinitializing the monitoring/aggregating tool (which sits activtely on the VM as well) in case it was causing a memory leak or file locking issues.
2) Rebooting the server occasionally to clear memory in case there existed a general memory leak from any source.
Some possible explanations that I've thought of or have been suggested to me:
1) inodes on the local VM may be out of wack due to the large number of files created and deleted each day. I've never had to deal with inode management so I'm not sure if this is plausible, or how to deal with it if it is the issue.
2) Something I thought was perhaps the process of connecting each day by SSH and SCP to a remote server is producing some effect on that remote server which is actually causing it to respond more slowly each day -- this would be worst case scenario. I don't see how initiating an SSH&SCP session once each per day could affect the target server, but could this be possible assuming the target server could be considered 'old/fragile'?
Sorry for the long post, I just wanted to make sure I gave enough detail. Please let me know if you have any questions and I'll do my best to answer.
Thanks for any help provided! I'm relatively new to large scale bash scripting and I want this thing to run efficiently.