Sed/awk command to convert number occurances into date format and club a set of lines

Chinmaya_Kabi · October 13, 2015, 3:19am

Hi,

I have been stuck in this requirement where my file contains the below format.

20150812170500846959990854-25383-8.0.0
"ABC Report" hp96880
"4952"
20150812170501846959990854-25383-8.0.0 End of run
20150812060132846959990854-20495-8.0.0
"XYZ Report" vg76452
"1006962188"
20150812060141846959990854-20495-8.0.0
"ZZY Report" fu59172
20150812060147846959990854-20495-8.0.0 End of run

It follows the below pattern.
Line 1: Start Time
Line 2: Report Name and User
Line 3: Identifier
Line 4: End Time
In the following lines, the 2nd block is missing the End Time and the 3rd block is missing the Identifier.

The requirement is to

convert all lines starting with "20" into date format i.e. YYYY/MM/DD
Merge block from Start Time till End time separated by commas.
Ignore blocks that that don't have the end time.
Add a blank space in the block which doesn't contain identifier.
If possible, separate Report Name and User Name with comma.

The output should basically look like the below.

2015/08/12:17:05:00,"ABC Report",hp96880,"4952",2015/08/12:17:05:01
2015/08/12:06:01:41,"ZZY Report",fu59172,"",2015/08/12:06:01:47

I used the if loop for addressing the requirements but the script slows down when run for large files and hence I'm looking for a faster solution using sed or awk.
Can anyone please help me out here ?

RudiC · October 13, 2015, 3:24am

Please use code tags as required by forum rules!

And, post your attempts so far.

Chinmaya_Kabi · October 13, 2015, 3:30am

Hi Rudi,

Sorry.

I had used the below

joinstr=""
HDate=""
userrpt=""
while read line
do
printf "."
EndHeader=`echo $line | grep -c "End of run"`;
if [ "$EndHeader" -eq 1 ]
then
HDate=`echo $line | awk 'BEGIN { FS=OFS="," } {$1=substr($1,0,4)"/"substr($1,5,2)"/"substr($1,7,2)":"substr($1,9,2)":"substr($1,11,2)":"substr($1,13,2);print}'`
joinstr=$joinstr","$HDate
echo $joinstr >> $OUT_PATH/$OUT_FILE
joinstr="" 
else
BegHeader=`echo $line | grep -c "^20"`;
if [ "$BegHeader" -eq 1 ]
then
HDate=`echo $line | awk 'BEGIN { FS=OFS="," } {$1=substr($1,0,4)"/"substr($1,5,2)"/"substr($1,7,2)":"substr($1,9,2)":"substr($1,11,2)":"substr($1,13,2);print}'`
joinstr=$HDate
else
userrpt=`echo $line | sed 's/" /",/g' | sed 's/ "/,"/g'`
joinstr=$joinstr","$userrpt
fi
fi
done<tempreportuserfile

But the script slows down and hence would like a faster solution.

RudiC · October 13, 2015, 3:55am

No surprise, you're creating 12 processes per line read. Try

awk '
function TMCVT(TStr)    {return substr(TStr,  1, 4) "/" substr(TStr,  5, 2) "/" substr(TStr,  7, 2) ":" \
                                substr(TStr,  9, 2) ":" substr(TStr, 11, 2) ":" substr(TStr, 13, 2) ":" \
                                substr(TStr, 15, 2)
                        }
                {CNT = split ($1, T, "-")
                 if (length (T[1]) == 26) TVAR = TMCVT(T[1])
                }

/End of run/    {print STRT, RPT, USR, ID, TVAR
                }
/^20/           {RPT = USR = ""
                 ID = "\" \""
                 STRT = TVAR
                 next
                }
NF == 1         {ID = $1
                 next
                }
                {for (i=1; i<NF; i++) RPT = RPT (RPT?FS:_) $i
                 USR = $NF
                }
' OFS="," file
2015/08/12:17:05:00:84,"ABC Report",hp96880,"4952",2015/08/12:17:05:01:84
2015/08/12:06:01:41:84,"ZZY Report",fu59172," ",2015/08/12:06:01:47:84

Chinmaya_Kabi · October 13, 2015, 5:00am

Thanks Rudi,

I'm however getting the below errors.

awk: syntax error near line 2
awk: bailing out near line 2

Also is it possible for you to explain a bit as I'm quite new to these commands ?

Regards,
Chinmaya

RudiC · October 13, 2015, 5:18am

Citing Don Cragun: "If you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk or nawk instead of awk ."

Try a pure shell solution as well:

while read V1 V2 V3 V4 REST
    do  TIM="${V1:0:4}/${V1:4:2}/${V1:6:2}:${V1:8:2}:${V1:10:2}:${V1:12:2}"
        [[ "$V1" == 20* ]] \
          &&    { [[ "$V2 $V3 $V4" == "End of run" ]] \
                  &&    { printf "%s,%s,%s,%s,%s\n" "$BEG" "$RPT" "$USR" "$ID" "$TIM"; } \
                  ||    { BEG="$TIM"; ID='" "'; continue; }   
                } 
        [[ "$V2" ]] || { ID=$V1; continue; }
        RPT="$V1 $V2"; USR="$V3"
    done < file
2015/08/12:17:05:00,"ABC Report",hp96880,"4952",2015/08/12:17:05:01
2015/08/12:06:01:41,"ZZY Report",fu59172," ",2015/08/12:06:01:47

It's quite difficult to sync in on those records with elements missing and fields consisting of several words. So the above is far from elegant and may benefit from some polishing...

Chinmaya_Kabi · October 13, 2015, 5:45am

Thanks Rudi.

That helped to a major extent.
I'm looking into additional cases where there are multiple identifiers in the file and would attempt tweaking the code. I'll ask your help if I fail.