Child killing parent process and how to set up SMF

Bryan.Eidson · September 7, 2012, 12:13pm

Hello,

A little background on what we are doing first. We are running several applications from a CLI, and not all of them are fully functional. They do on occasion core dump, not a problem. We are running a service that takes a screen scrape of those apps and displays them in a more user friendly Java window. So this process is the parent of all of these applictions when run in the GUI.

When one of the applications core dumps, the service goes down and restarts, and after four times in a close span, it doesn't come back up because it is "restarting too quickly." We can deal with the core dumps on their own, but having the whole process go down stops other users from running other applications through the GUI.

My question is how do I set it so when the child process errors out the parent will ignore it and keep on chugging. And if that is not possible, the next solution, although not ideal, would be to change it so the restarter does not decide that after four restarts it will throw the service into maintenance mode.

Any help or direction would be amazing.

Thank you,
Bryan

I have tried adding the following to the manifest, but it so far does not seem to make a difference.
<property_group name='startd' type='framework'>

<propval name='ignore_error' type='astring'

value='core,signal' />
<propval name='duration' type='astring'
value='transient' />

</property_group>

Also, this is all on solaris 10, and the program is called infonet if that is relevant.

jim_mcnamara · September 7, 2012, 12:30pm

In order to help we need:

What CLI are we talking about here?
Are you Solaris 10?
What shell do you use?

Bryan.Eidson · September 7, 2012, 12:35pm

Thanks for your response.

Yes, I am using Solaris 10.

I use /bin/sh, some others use ksh.

I am not sure what you mean by what CLI, but it is an old (~25 year?) ERP system that we fix inhouse as it breaks.

jim_mcnamara · September 7, 2012, 2:12pm

So, you are asking us how to ignore signals in an old ERP. The ERP apps I have seen mostly run on their own in something like realtime and have a database.

Are there shell scripts that invoke your code (would be called by the GUI app).
If there are then you have to modify the ones you run in DEV to add a trap command

trap 11 'exit 1'

just returns an error when the process dumps core. NOTE: I am assuming you are not getting SIGILL or SIGBUS signals. Just segfaults.

Otherwise should this html-like string

<propval name='ignore_error' type='astring'
value='core,signal' />

use the signal name(s), core is not a signal name. SEGV or SIGSEV is a signal name, it may also be 11, which is the signal number in Solaris for a segfault. (I am guessing here, not about signals but about your ERP GUI)

Bryan.Eidson · September 7, 2012, 5:43pm

There aren't any shell scripts launching it, so I took the second route you suggested. Had some issues with svccfg importing my new manifest, but manually entering those three entries into the configuration instead of core seems to have done it, atleast allowed it to work more than the 5 times it would take to put it in maintenance mode before.

Many thanks!

Bryan.Eidson · September 13, 2012, 4:06pm

An update to where I am:

I entered made the above changes to the manifest and it would not import. I then used svccfg to manually add the startd and ignore_error entries. At the time I thought it was working.

Next I wanted to figure out why the manifest I was given was not working, which I will get back to. Also I wanted to make sure if the system restarted that the service would retain those settings. It turns out it did, but the service would not start.

The problem with the way the manifest was written was it defined another instance that we weren't using and then everything else was written inside of it. So no start and stop methods for the default instance, and my property_group settings were not being applied to the default. I rewrote the manifest by getting rid of the defined instance, declared a single_instance, and added the property group to the default.

Swell, I thought I solved the problem because the manifest imported and the service started fine. Except I am back at square one, because the

 
startd/ignore_error                astring  SIGSEGV,SIGABRT,SIGILL,SIGQUIT,SIGSYS

that are clearly showing up in the svcprop do not seem to be doing what they are supposed to: it went back to maintenence mode after 4 failures. I think the only reason it seemed to work is because the lack of a stop and start made it impossible for the service to bring it down. I am not sure how the system even allowed that to get in, but its the only thing I could think of. :wall:

After all that, any idea how to stop it from going down after a failure?

---------- Post updated 09-13-12 at 01:06 PM ---------- Previous update was 09-12-12 at 01:44 PM ----------

Not sure if there is any extra information I can provide. It's not perfect, but I think I have purged the obvious flaws in the manifest:

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<!--
    Production Infonet Manifest
    July 23, 2012 03:31:42 PDT dgz
-->
<service_bundle type='manifest' name='site:infonet'>
<service
        name='site/infonet'
        type='service'
        version='1'>
        <single_instance />
        <dependency name='paths'
            grouping='require_all'
            restart_on='error'
            type='path'>
                <service_fmri value='file://localhost/opt/unisyn/infonet/' />
                <service_fmri value='file://localhost/opt/unisyn/infonet/lib/system.conf' />
        </dependency>
        <dependency name='network'
            grouping='require_any'
            restart_on='error'
            type='service'>
                <service_fmri value='svc:/network/service' />
        </dependency>
        <dependent
                name='infonet_multi-user'
                grouping='optional_all'
                restart_on='none'>
                <service_fmri value='svc:/milestone/multi-user' />
        </dependent>
        <!--
                The timeout needs to be large enough to wait for startup.
        -->
        <exec_method
            type='method'
            name='start'
            exec='/lib/svc/method/infonet start'
            timeout_seconds='60' />
        <exec_method
            type='method'
            name='stop'
            exec='/lib/svc/method/infonet stop'
            timeout_seconds='60' />
        <property_group name='startd' type='framework'>
                <propval name='ignore_error' type='astring'
                        value=SIGSEGV,SIGABRT,SIGILL,SIGQUIT,SIGSYS' />
        </property_group>
        <instance name='default' enabled='true'>
         <property_group name='startd' type='framework'>
           <propval name='ignore_error' type='astring'
                value='SIGSEGV,SIGABRT,SIGILL,SIGQUIT,SIGSYS' />
         </property_group>
        </instance>
        <template>
                <common_name>
                        <loctext xml:lang='C'>
                        Production Infonet (httpd)
                        </loctext>
                </common_name>
        </template>
 
</service>
</service_bundle>

Also the svcprop for after it went into maintenence mode:

# svcprop infonet:default
startd/ignore_error astring SIGSEGV,SIGABRT,SIGILL,SIGQUIT,SIGSYS
general/enabled boolean true
general/single_instance boolean true
paths/entities fmri [URL removed because < 5 posts]
paths/grouping astring require_all
paths/restart_on astring error
paths/type astring path
network/entities fmri svc:/network/service
network/grouping astring require_any
network/restart_on astring error
network/type astring service
dependents/infonet_multi-user fmri svc:/milestone/multi-user
start/exec astring /lib/svc/method/infonet\ start
start/timeout_seconds count 60
start/type astring method
stop/exec astring /lib/svc/method/infonet\ stop
stop/timeout_seconds count 60
stop/type astring method
tm_common_name/C ustring Production\ Infonet\ \(httpd\)
restarter/start_pid count 972
restarter/start_method_timestamp time 1347486995.549591000
restarter/start_method_waitstatus integer 0
restarter/transient_contract count
restarter/logfile astring /var/svc/log/site-infonet:default.log
restarter/auxiliary_state astring restarting_too_quickly
restarter/next_state astring none
restarter/state astring maintenance
restarter/state_timestamp time 1347487004.979156000
restarter/contract count
restarter_actions/refresh integer