File copy problem?

Hi All

I can't get my head around a problem I have with a control file.

The file is to control a "Listener" of sorts that listens on a named pipe. A script kicks off the listener in the background and passes it a control file. In the file it sets the Status field to pending. It then waits for the listener to get up and running and change its status to "Listening" before launching another job that uses the named pipe to report its status.

The control file: -

Name:m_fatal_error
Pid:20755
JobFile:/home/sshtest/muse/run/demo/120929165000/m_control/4-m_job_data_10519
LogFile:/home/sshtest/muse/run/demo/120929165000/m_control/m_run_log
TmpFile:/home/sshtest/muse/run/demo/120929165000/m_tmp/1_u_listener_4498.tmp
JobHost:tx5xn
Start:12/09/29-16:50:01
Finish:
Pipe:4-m_msg_data_10519
Status:Pending
Constants:/home/brad/wip/muse_root/lib/muse.constants
sshtest@ubuntu-dt64:~$ cat /home/sshtest/muse/run/demo/120929165000/m_tmp/1_u_listener_4498.tmp

The listener uses a function (that works 90% of the time) to change the status to Listening -

m_write_ctl_file_field ${C_MSG_JOB_STATUS} "Listening" "${MSG_CTL_FILE}"
m_write_ctl_file_field()
{
        TMP="$(sed -n ''${1}'p' ${3} | cut -d":" -f1):${2}"
        for s in ${PIPESTATUS[@]}; do [[ $s -eq 0 ]] || error_exit "Error:Pipe failed 2 (${SCRIPT})" ; done
        awk -v ln=${1} -v lo="${TMP}" '(NR == ln){print lo;next}{print $0}' "${3}" > "${U_MSG_TMPFILE}"
        cp "${U_MSG_TMPFILE}" "${3}" || error_exit "Error: cp failed in m_write_ctl_file_field"
}

C_MSG_JOB_STATUS is the line number for the status field
MSG_CTL_FILE is the control file
and "Listening" is my change of status

When I look in the tmp file this function uses after I time out waiting for the listener to change its status to listening -

Name:m_fatal_error
Pid:
JobFile:/home/sshtest/muse/run/demo/120929165000/m_control/4-m_job_data_10519
LogFile:/home/sshtest/muse/run/demo/120929165000/m_control/m_run_log
TmpFile:/home/sshtest/muse/run/demo/120929165000/m_tmp/1_u_listener_4498.tmp
JobHost:tx5xn
Start:12/09/29-16:50:01
Finish:
Pipe:4-m_msg_data_10519
Status:Listening
Constants:/home/brad/wip/muse_root/lib/muse.constants

So the function has written it to the tmp file. The check on the exit status for the cp command doesn't seem to error or I would see an error log in /tmp. So I can't understand why the tmp file and the control file aren't the same....

like I said, most of the time this code works.

I'm wondering if this is some sort of buffering problem and whether I need to adopt a safer method that explicitly closes the file or something???

Any help appreciated :slight_smile:

Steady

Have you checked the timestamps on ${MSG_CTL_FILE} and ${U_MSG_TMPFILE}?
Are you sure that the cp in m_write_ctl_file_field() didn't succeed and then something else updated ${MSG_CTL_FILE} changing the status back to Pending?

Hi Don

Thanks for getting back to me.

I'm sure nothing else is writing it back again as each file has a unique name with a RND number appended. On top of that I don't have any code writing Pending to it at the moment.

I have just run this in a loop and then run an lsof and can see I have hundreds of open files in use by listeners that should be closed-

u_listene 29054    sshtest  mem       REG    8,1   134344  264820 /lib/i386-linux-gnu/ld-2.15.so
u_listene 29054    sshtest  mem       REG    8,1    13940  266883 /lib/i386-linux-gnu/libdl-2.15.so
u_listene 29054    sshtest  mem       REG    8,1  2932080  391615 /usr/lib/locale/locale-archive
u_listene 29054    sshtest  mem       REG    8,1    26256   16482 /usr/lib/i386-linux-gnu/gconv/gconv-modules.cache
u_listene 29054    sshtest    0r      CHR    1,3      0t0    4748 /dev/null
u_listene 29054    sshtest    1u      CHR  136,2      0t0       5 /dev/pts/2
u_listene 29054    sshtest    2u      CHR  136,2      0t0       5 /dev/pts/2
u_listene 29054    sshtest  255r      REG   8,17     3971 1705114 /home/brad/wip/muse_root/utils/u_listener~ (deleted)

The listener is u_listener and I have hundreds of files in use which probably explains why the problem is getting worse throughout the day... :smiley:

ps shows that I have lots of open instances of the process -

sshtest@ubuntu-dt64:~$ ps -ef | grep u_listen
sshtest   2523     1  0 18:27 pts/2    00:00:00 /bin/bash /home/brad/wip/muse_root/utils/u_listener /home/sshtest/muse/run/demo/120929182746/m_control/1-m_msg_data_31692
sshtest   3132     1  0 17:45 pts/2    00:00:00 /bin/bash /home/brad/wip/muse_root/utils/u_listener /home/sshtest/muse/run/demo/120929174550/m_control/4-m_msg_data_7666
sshtest   5562     1  0 16:15 pts/2    00:00:00 /bin/bash /home/brad/wip/muse_root/utils/u_listener /home/sshtest/muse/run/demo/120929161543/m_control/1-m_msg_data_23624
sshtest   5659     1  0 16:22 pts/2    00:00:00 /bin/bash /home/brad/wip/muse_root/utils/u_listener /home/sshtest/muse/run/demo/120929162238/m_control/6-m_msg_data_21829
sshtest   7384     1  0 16:29 pts/2    00:00:00 /bin/bash /home/brad/wip/muse_root/utils/u_listener

I have just introduced some functionality to exit in the event of fatal errors and this problem started when I was running some test jobs that failed.

I hope this means that I just need to extend the clean exit to explicitely kill off listeners that haven't exited prior to exiting the main application.

Cheers

Steady

I assume that you've verified that your RND number generator hasn't generated the same number for two or more temp files. Random is not the same as different; and for an application like this, different is what you need. :wink: