Interesting script issue clubbed with crontab.

Hello All,

Finally I am posting an issue and it's solution which I faced last week. Let me explain it by headings.

Issue's background: It was a nice Tuesday for me, went to office as usual started checking emails and work assigned to me. Suddenly a gentleman reached out to me on my desk(in a horrified and terrified condition). He is from another team and would have listen my name from one of our common friend. ASAP he reached out to my seat; after a warm, quick and simple handshake he immediately started telling that he is facing a PROD issue. OK, so since I heard the word "PRODUCTION" so my ears started working more actively now.

Now he is explaining me like a new OP in UNIX & LINUX forums :smiley: eg-> script was working before a server's migration and not working after migration to a new server(that was the only main sentence I got from him)

Now I taken him to free meeting room and requested him to explain me things there(He again explained me the same thing but this time he showed me his previous server script's output and script). I taken a deep breath and started understanding their workflow(which tells a person very high level view of what is happening with their script(s)). After understanding their workflow I started looking their old server(but seems their old server was on decommission request so few of the things already gone from there(a bit access, mount points etc), I will let you know the crispy part of it later :)).

At this time I had how their output looks, so now as a troubleshooting part I started my journey of fixing it.

Journey of fixing script: First of first I asked them if All is Well in QA environment and answer was "YES", then I verified QA server myself and yes everything was working. Initially I decided not to compare scripts as they are many in numbers. So first thing came into my mind since that person told server migration is, are they having same filesystem names or user names which they were using in previous server? After checking servers seems to be they were changed.

Now I used few find commands eg--> find -type f -exec grep -l "old_user" {} \+ and find -type f -exec grep -l "old_path" {} \+ . Guess what results were shocking these guys have not changed paths in new place(since we were already in an issue so their people allowed me to do changes with a backup of all things off course). So now I changed their values successfully in all of the places, since there were many scripts(being related to each other) so made this change to everyplace.

Time had come to run the script manually and guess what that fixed passed with flying colors :). I was on "seventh heaven", since script is doing many different tasks on DB and content mgmt. level so it took almost 2 hours to complete, once they verified that All is Well I requested them to schedule it(by whichever way they want by Jenkins or by cron etc), they told me they checked old prod server and crontab entries are gone(may be affect of their decomm request, the above mentioned crispy part :)). So I asked them let us check in QA environment(at this point of time I really lost faith that their QA things are in sync with PROD, though I still thought to give it a shot to check there). Yes, entries were there so I got to know they want it to run every day on a specific time, then I have put simple cron entry eg--> 12 12 * * * /actual/path/of/script.ksh .

I asked them to check(I intestinally set up cron job to run after 30 mins, thought could for lunch and after that will see if All is Well).

After lunch I got the news that script didn't kick off, I was surprised as crontab entry was perfect and while checking the logs found that cron kicked it off but NO other logs. Then I have setup set -x in the starting of script and scheduled it again because their was NOT at all logging anything specially there was no error handling at all. Once cron picked it up next time too, NO errors shown up.

I was sure something fishy again could be related to OLD server's references etc. So now I started comparing QA and PROD scripts and I was in huge shock when I saw they were like 70 to 80% different(though their logics seems to be same and trust me their QA script was much better than PROD). I got to know till this point that I am on my own now.

I started reading very first script now, which was calling almost 5 to 6 more scripts(till then I asked that person to take back your decomm request for that server so that we could get some more information from it). While checking scripts I saw there were many relative paths were there(NO absolute paths were most of the times). Then I suspected this could be the culprit and I have put multiple pwd commands, specially wherever their custom jars(java code archives were getting called).

Believe me or not I was shocked for almost 2 mins to see results their pwd value was NOT at all changing(which they claimed that in OLD server these relative paths worked because I believe when they come out side of jar 's working somehow their working directory was getting set but in this server this was not happening with cron, which I came to know cron's never export their full paths of DOT profile), I was HAPPY that I found it out, so ASAP I find out I have changed all the relative paths from ../../bla/bla/bla to actual/path/bla/bla , it took some time to change them because there were many paths(I proposed them to write some loggings in script and most important create a variable file also now because in future they need not to change any script(s) for paths etc, which they may be working I guess so :)).

Good time has come to run the script again by crontab and when I setup to run after 2 mins, script has run successfully and things were gong well.

Leanings: There were lot of learning points out of this episode:

  • The BEST one for me will be our QA and PROD environments should be always sync.
  • NEVER EVER decomm. a server without confirming that new things are going well for sure.
  • Be very careful with relative paths as in cron it could be tricky.

Thought to share this with you folks. Would like to know your views/comments(if any), keep learning and keep sharing knowledge :b:

Thanks,
R. Singh

2 Likes

I stopped using relative paths and find long time ago (unless doing interactive work).

This approach is much safer and easier to maintain :

cd $ABSOLUTE_PATH_DIR && find . <further options and operands>   || exit 1
cd -

This way, an error if directory does not exist or permission is denied is printed on stderr.
You will see this error in local mail if running in crontab (if no stderr redirection has been made inside crontab line).

Do not use trailing slashes with directory variables.
Example of things going haywire when doing that :

set -x
#DIR=/home/user # someone decided to comment this line or makes a mistake.
# more lines of code, hundreds of them
cd $DIR/ && find . -type f .. || exit 1 # This will expand into cd / && find .. # enough said

Decommission -> create a virtual machine out of it - if you can, not some obscure OS/hardware :wink:

Good things to do before decommission are (if you cannot virtualize it in lab) :

  1. Copy all crontab and at entries and related scripts used in those.
  2. Issue a mount, output into a file and copy it somewhere safe.
  3. Issue a share into a file (if NFS or other network file systems are exported on the box).
  4. FC and LAN topology should be written down (WWNS, lan port configuration etc.)
  5. /etc/passwd, /etc/shadow, /etc/group, /etc/hosts (perhaps more depending) files should be copied.

All above is done in couple of minutes or less, and is golden for post mortem analysis.

Hope that helps
Regards
Peasant.

2 Likes

A trick i have learned here (and i am ashamed i can't remember from who) is to always add a second file to grep when using it this way.

grep , when called with a single file, will not show the file name where it found something:

find /some/where -type f -exec grep "bla foo" {} \;
whatever bla foo something
another hit bla foo
....

But write it like this:

find /some/where -type f -exec grep "bla foo" /dev/null {} \;

and grep will add the file names of the files at the beginning of the line.

bakunin

1 Like