Script to Proceed to the Next IP if the current IP hangs

Hi there,

Say I have a list of IPs, I am running scripts on them. If the process hang.
I want to continue with the rest of the IPs.

10.11.1.1
10.11.1.2
10.11.1.3
10.11.1.4
10.11.1.5
10.11.1.6 <-- Process Hangs here
10.11.1.7
10.11.1.8
10.11.1.9
10.11.1.10
10.11.1.11
10.11.1.12

-----------------------

10.11.1.7 <--- start a new process from this point onwards.
10.11.1.8
10.11.1.9
10.11.1.10
10.11.1.11
10.11.1.12

How can this be achieved?

What do you mean with "hangs" ? Do you mean the timeout that occurs when a server is unreachable or down?

Are you referring to a shell script loop from within which ssh commands are used to perform remote tasks? If so you could try something like:

ssh -o ConnectTimeout=2

From man ssh_config:

     ConnectTimeout
             Specifies the timeout (in seconds) used when connecting to the SSH server, instead of using the default system TCP timeout.  This value is used only when the target is down or
             really unreachable, not when it refuses the connection.

Note that, like it says, this timeout will not work with hosts that refuse connection..

Yes, when it is timeout, it will move on to the next item in the loop.

Hi,

Without knowing any actual details of what your script does or contains, it's hard to give you a definitively correct answer here. But in terms of a general principle, you could process each IP in a loop, run your script for each IP in the background, then wait a number of seconds before proceeding to the next one in the loop. That way at least you would be able to continue with each IP in the list.

So for example, something like this:

for ip in `cat ip-list.txt`
do
        ./script.sh "$ip" &
        sleep 300
done

Now I'm making a great deal of assumptions here, since you haven't given us any actual code of your own or any details about what precisely you're trying to do to each IP. But the above code fragment would iterate through every IP address in the file ip-list.txt and run the external script ./script.sh on it in the background. It would then pause for five minutes (300 seconds), and proceed regardless of the outcome with the next one in the list.

There are many potential problems with this approach, but this is about as generic a solution as I can suggest without anything detailed to actually go on. Hope this helps.

HI all,

I am actually look at ways to restart a process if it hangs.
The line I highlighted in red sometime work, it will continue to the next line.
If it doesn't work,I would expect a way to restart that line first before proceeding to the next line.
Hope you can advise.

while read ip; do
    echo -e "${BLUE}[+]${RESET} Scanning $ip for $proto ports..."

    # unicornscan identifies all open TCP ports
    if [[ $proto == "tcp" || $proto == "all" ]]; then 
        echo -e "${BLUE}[+]${RESET} Obtaining all open TCP ports using unicornscan..."
        echo -e "${BLUE}[+]${RESET} unicornscan -i ${iface} -r20000 -mT ${ip}:a -l ${log_dir}/udir/${ip}-tcp.txt"
        unicornscan -i ${iface} -mT ${ip}:a  -r20000 -l ${log_dir}/udir/${ip}-tcp.txt
        ports=$(cat "${log_dir}/udir/${ip}-tcp.txt" | grep open | cut -d"[" -f2 | cut -d"]" -f1 | sed 's/ //g' | tr '\n' ',')
        if [[ ! -z $ports ]]; then 
            # nmap follows up
            echo -e "${GREEN}
[*]${RESET} TCP ports for nmap to scan: $ports"
            echo -e "${BLUE}[+]${RESET} nmap -e ${iface} ${nmap_opt} -oA ${log_dir}/ndir/${ip}-tcp -p ${ports} ${ip}"
            nmap -e ${iface} ${nmap_opt} -oA ${log_dir}/ndir/${ip}-tcp -p ${ports} ${ip}
        else
            echo -e "${RED}[!]${RESET} No TCP ports found"
        fi
    fi

Hi.

Rather than waiting for something to finish before starting the next task, I find that pdsh performing remote tasks in parallel is most useful for our situation. There is a timeout option (along with many other options).

Some details for pdsh (which calls pdsh.bin ):

pdsh    issue commands to groups of hosts in parallel (man)
Path    : /usr/bin/pdsh
Version : -2.31 (+debug)
Length  : 15 lines
Type    : Bourne-Again shell script, ASCII text executable
Shebang : #! /bin/bash
Repo    : Debian 8.9 (jessie) 
Home    : https://computing.llnl.gov/linux/pdsh.html (pm)

See man pdsh , and note that there are several alternate codes that may be considered as a result of searching for alternatives to pdsh -- for example see:

remote - What is a good modern parallel SSH tool? - Server Fault

remote access - Linux - Running The Same Command on Many Machines at Once - Server Fault

Best wishes ... cheers, drl

The problem is: we do not really know what "doesn't work" means. If it is that the quoted line just hangs and doesn't finish: start it in the background and have a wait command at the end collecting all the hanging processes. There are a lot of threads here dealing with exactly this problem.

If you mean by "doesn't work" that the process just comes back unsuccessfully: usually a process has a return code. You can query this return code and re-run the process if it is not zero (0 usually means it was successful and everything else some sort of failure).

Replace the quoted line with something like this:

MAXRETRIES=<some number>       # define this at the beginning globally

...

(( iCnt = MAXRETRIES ))
while ! unicornscan -i ${iface} -mT ${ip}:a  -r20000 -l ${log_dir}/udir/${ip}-tcp.txt && [ $iCnt -gt 0 ] ; do
     (( iCnt -= 1 ))
done

This will try MAXRETRIES times to execute the code until it is either successful or the number of tries run out.

I hope this helps.

bakunin

Hi,

I was actually thinking of 3 conditions:
1.Move on to the next line if doesn't work or hang
2. Restart until it works and move to the next line
3. Wait for a number of seconds , if it doesn't move to the next line, restart the current line (just in case for whatever reason there is no exit code)

Unfortunately for my case it still doesn't work after applying the code above. It still hangs

Send exiting main didnt connect, exiting: system error Interrupted system call
Recv exiting main didnt connect, exiting: system error Interrupted system call

That doesn't sound like "hanging" (no more activities nor reactions) but more like exiting with an error. It would be very surprising if NO error resp. exit code were given indicating what error occurred.

Definitely more info is necessary here.

Same for your three conditions. What in bakunin's proposal doesn't solve your problem? Please be way more informative!

This is one way on how you trap a too long wait for a command.

  1. set up a child process that kills the parent after a pre-set time.
  2. run the command
  3. clean up child
#!/bin/bash

# sleep for a while then clobber parent
# 30 is the value for the signal SIGUSR1 on my system
# SIGUSR1 is a signal that the system does not care about at all, you use it locally

naptime() {
    
    sleep 10 # take a nap
    kill -n 30 $PPID  # wake the parent 
}

run_ssh()
{
    trap 'echo "ssh took too long"; return 1'  SIGUSR1  # return an error
    naptime & 
    naptime_pid=$!
    ssh myuser@somewhere.com 'ls myfile.txt'  # you better be sure this command will complete on success in less than 10 seconds
    kill $naptime_pid
    return 0 # no error
}

# -------- main
run_ssh   # will run for 10 seconds max
ssh_rc=$?
[ $ssh_rc -q 0 ] && echo "things went fine"  || echo "oops ssh timout error"

Also, I would suggest using ping to start then call ssh if things went okay in terms of being able to the the remote box. ping has a default timeout setting.
Example:

ping -w [timeout in seconds]  -q remotenode
[ $? -eq 0 ]  &&  ssh me@somewhere command ||  echo 'failed to connect'

Actually I am ok with bakunin's proposal.
Just that it still doesn't work. The line just freezes there an does not proceed to the next iteration? I have been waiting for the code to execute as when it is possible.

In this case, may i gently remind you on what i wrote above:

Save for the wrong usage of the progressive form which should read "collect" instead of "collecting" i still stand by that.

I hope this helps.

bakunin