Hi All
I can't get my head around a problem I have with a control file.
The file is to control a "Listener" of sorts that listens on a named pipe. A script kicks off the listener in the background and passes it a control file. In the file it sets the Status field to pending. It then waits for the listener to get up and running and change its status to "Listening" before launching another job that uses the named pipe to report its status.
The control file: -
Name:m_fatal_error
Pid:20755
JobFile:/home/sshtest/muse/run/demo/120929165000/m_control/4-m_job_data_10519
LogFile:/home/sshtest/muse/run/demo/120929165000/m_control/m_run_log
TmpFile:/home/sshtest/muse/run/demo/120929165000/m_tmp/1_u_listener_4498.tmp
JobHost:tx5xn
Start:12/09/29-16:50:01
Finish:
Pipe:4-m_msg_data_10519
Status:Pending
Constants:/home/brad/wip/muse_root/lib/muse.constants
sshtest@ubuntu-dt64:~$ cat /home/sshtest/muse/run/demo/120929165000/m_tmp/1_u_listener_4498.tmp
The listener uses a function (that works 90% of the time) to change the status to Listening -
m_write_ctl_file_field ${C_MSG_JOB_STATUS} "Listening" "${MSG_CTL_FILE}"
m_write_ctl_file_field()
{
TMP="$(sed -n ''${1}'p' ${3} | cut -d":" -f1):${2}"
for s in ${PIPESTATUS[@]}; do [[ $s -eq 0 ]] || error_exit "Error:Pipe failed 2 (${SCRIPT})" ; done
awk -v ln=${1} -v lo="${TMP}" '(NR == ln){print lo;next}{print $0}' "${3}" > "${U_MSG_TMPFILE}"
cp "${U_MSG_TMPFILE}" "${3}" || error_exit "Error: cp failed in m_write_ctl_file_field"
}
C_MSG_JOB_STATUS is the line number for the status field
MSG_CTL_FILE is the control file
and "Listening" is my change of status
When I look in the tmp file this function uses after I time out waiting for the listener to change its status to listening -
Name:m_fatal_error
Pid:
JobFile:/home/sshtest/muse/run/demo/120929165000/m_control/4-m_job_data_10519
LogFile:/home/sshtest/muse/run/demo/120929165000/m_control/m_run_log
TmpFile:/home/sshtest/muse/run/demo/120929165000/m_tmp/1_u_listener_4498.tmp
JobHost:tx5xn
Start:12/09/29-16:50:01
Finish:
Pipe:4-m_msg_data_10519
Status:Listening
Constants:/home/brad/wip/muse_root/lib/muse.constants
So the function has written it to the tmp file. The check on the exit status for the cp command doesn't seem to error or I would see an error log in /tmp. So I can't understand why the tmp file and the control file aren't the same....
like I said, most of the time this code works.
I'm wondering if this is some sort of buffering problem and whether I need to adopt a safer method that explicitly closes the file or something???
Any help appreciated
Steady
Have you checked the timestamps on ${MSG_CTL_FILE} and ${U_MSG_TMPFILE}?
Are you sure that the cp in m_write_ctl_file_field() didn't succeed and then something else updated ${MSG_CTL_FILE} changing the status back to Pending?
Hi Don
Thanks for getting back to me.
I'm sure nothing else is writing it back again as each file has a unique name with a RND number appended. On top of that I don't have any code writing Pending to it at the moment.
I have just run this in a loop and then run an lsof and can see I have hundreds of open files in use by listeners that should be closed-
u_listene 29054 sshtest mem REG 8,1 134344 264820 /lib/i386-linux-gnu/ld-2.15.so
u_listene 29054 sshtest mem REG 8,1 13940 266883 /lib/i386-linux-gnu/libdl-2.15.so
u_listene 29054 sshtest mem REG 8,1 2932080 391615 /usr/lib/locale/locale-archive
u_listene 29054 sshtest mem REG 8,1 26256 16482 /usr/lib/i386-linux-gnu/gconv/gconv-modules.cache
u_listene 29054 sshtest 0r CHR 1,3 0t0 4748 /dev/null
u_listene 29054 sshtest 1u CHR 136,2 0t0 5 /dev/pts/2
u_listene 29054 sshtest 2u CHR 136,2 0t0 5 /dev/pts/2
u_listene 29054 sshtest 255r REG 8,17 3971 1705114 /home/brad/wip/muse_root/utils/u_listener~ (deleted)
The listener is u_listener and I have hundreds of files in use which probably explains why the problem is getting worse throughout the day...
ps shows that I have lots of open instances of the process -
sshtest@ubuntu-dt64:~$ ps -ef | grep u_listen
sshtest 2523 1 0 18:27 pts/2 00:00:00 /bin/bash /home/brad/wip/muse_root/utils/u_listener /home/sshtest/muse/run/demo/120929182746/m_control/1-m_msg_data_31692
sshtest 3132 1 0 17:45 pts/2 00:00:00 /bin/bash /home/brad/wip/muse_root/utils/u_listener /home/sshtest/muse/run/demo/120929174550/m_control/4-m_msg_data_7666
sshtest 5562 1 0 16:15 pts/2 00:00:00 /bin/bash /home/brad/wip/muse_root/utils/u_listener /home/sshtest/muse/run/demo/120929161543/m_control/1-m_msg_data_23624
sshtest 5659 1 0 16:22 pts/2 00:00:00 /bin/bash /home/brad/wip/muse_root/utils/u_listener /home/sshtest/muse/run/demo/120929162238/m_control/6-m_msg_data_21829
sshtest 7384 1 0 16:29 pts/2 00:00:00 /bin/bash /home/brad/wip/muse_root/utils/u_listener
I have just introduced some functionality to exit in the event of fatal errors and this problem started when I was running some test jobs that failed.
I hope this means that I just need to extend the clean exit to explicitely kill off listeners that haven't exited prior to exiting the main application.
Cheers
Steady
I assume that you've verified that your RND number generator hasn't generated the same number for two or more temp files. Random is not the same as different; and for an application like this, different is what you need.