Hi,
I've written a ksh script that read a file and parse/filter/format each line. The script runs as expected but it runs for 24+ hours for a file that has 2million lines. And sometimes, the input file has 10million lines which means it can be running for more than 2 days and still not finish. And of course, SA's been chasing me up as it is showing in top as running like forever.
I need some advise on maybe instead of reading one line at a time, I can run an awk one liner instead. I wish I can code it in Perl but not sure how to. Most says it is faster in Perl but not sure how to use Perl-like equivalence of the UNIX command besides using system
Anyway, hopefully I can interest someone into looking into this.
Below is the excerpt / part of the script that is taking the most time:
for LOG in *search_string_found.out
#for LOG in *xyz
do
server_db=`echo $LOG | awk -F"_" '{ print $1 }'`
server_app=`echo $LOG | awk -F"_" '{ print $2 }'`
echo "- [ `date` ] // `wc -l $LOG | awk '{ print $1 }'` lines ==> Processing $LOG // ${server_db} from ${server_app}"
#while IFS="*" read TS CS HOST RESULT SERVICE RETURNCODE
oIFS=$IFS
while read line
do
IFS="*"
echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE
timestamp=`echo $TS | awk '{ print $2 }'`
year=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $3 }'`
day=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $1 }'`
month=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $2 }'`
case $month in
"JAN" ) mm="01" ;;
"FEB" ) mm="02" ;;
"MAR" ) mm="03" ;;
"APR" ) mm="04" ;;
"MAY" ) mm="05" ;;
"JUN" ) mm="06" ;;
"JUL" ) mm="07" ;;
"AUG" ) mm="08" ;;
"SEP" ) mm="09" ;;
"OCT" ) mm="10" ;;
"NOV" ) mm="11" ;;
"DEC" ) mm="12" ;;
esac
TS2="$year-$mm-$day $timestamp"
program=`echo $CS | awk -F"(" '{ print $4 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
user=`echo $CS | awk -F"(" '{ print $6 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
service_name=`echo $CS | awk -F"(" '{ print $8 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
app_protocol=`echo $HOST | awk -F"(" '{ print $3 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
app_host=`echo $HOST | awk -F"(" '{ print $4 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
app_port=`echo $HOST | awk -F"(" '{ print $5 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
#echo "- line = $line"
#echo "- timestamp = $TS"
#echo " TS2 = $TS2"
#echo "- connectstring = $CS"
#echo " program = $program"
#echo " user = $user"
#echo " service_name = $service_name"
#echo "- host = $HOST"
#echo " app_protocol = $app_protocol"
#echo " app_host = $app_host"
#echo " app_port = $app_port"
#echo "- result = $RESULT"
#echo "- service = $SERVICE"
#echo "- returncode = $RETURNCODE"
#echo "-------------------------------------------------------------"
#echo
RETURNCODE=`echo $RETURNCODE | sed "s/ *//g"`
detail="$TS2^${server_db}^${server_app} = ${app_host}^$program^$user^${service_name}^$RETURNCODE^$line"
#echo "${detail}" | tee -a ${f_report}
echo "${detail}" >> ${f_report}
IFS=$oIFS
done < $LOG
Below are example entries of the input file that the script reads, it can be 2million lines at least and go to as much as 10million lines. I've change entries as they are customer data.
04-MAR-2018 03:19:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60791)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:19:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60795)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:21:07 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=mickeyp0))(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.7)(PORT=14582)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:22:25 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=mickeyp0))(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.7)(PORT=15176)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:24:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60881)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:24:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60885)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:29:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60965)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:29:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60969)) * establish * test_app.abcde.xx.yy * 12514
04-MAR-2018 03:29:02 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=xyzimain)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60973)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:57:10 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=mickeyp0))(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.7)(PORT=24394)) * establish * test_app.abcde.xx.yy * 12514
What I am wanting to do really in simplest term is as below:
- Change the date format to YYYY-MM-DD. Main reason being is it is most convenient sorting in this date format
- Filter some information from each line, i.e host name, IP, program name, service name, return code etc.
I then re-direct these formatted line/record to a file that I can check group by return code value or simply do a sort | uniq -c so it displays and show a count of occurrence.
Any advice much appreciated. Thanks in advance.