Shell script Help - Data cleansing

pdathu · April 7, 2016, 9:42am

Hello community, I am getting a log files from system and I need to clean the data and store as txt files for reporting purposes. Since these files are generated in Unix box, so we have to write shell script to handle the data cleansing.

Please find the sample file data looks like:

InsertTime:201604070523 DocID:101
#headers 
'DocID: 101    MOVEABLE TOOLS:   2 QTY:    0     HELD TOOLS:   0 QTY:    0     BLOCKED TOOLS:   0 QTY:    0'
#columns  'TargetDoc' 'GRank' 'LRank' 'Priority' 'Loc ID'
#widths 12 3 3 12 25
#types 'STRING' 'INTEGER' 'INTEGER' 'STRING' 'STRING'
#rows
'aaaaa' '1' '1' 'Slow' '8gkahinka.01'
'aaaaa' '1' '0' 'Slow' '7nlafnjbaflnbja.01'

#blocked '' '' 
#rule 'Rule_Abcd'
#doc '101'
#station_type ' '
#queue_duration '1.09673e-05'
#process_duration '4.61456'
#ISS-DLIS-DIAGS
InsertTime:201604070523 DocID:102
#headers 
'DocID: 102    MOVEABLE TOOLS:   2 QTY:    0     HELD TOOLS:   0 QTY:    0     BLOCKED TOOLS:   0 QTY:    0'
#columns  'TargetDoc' 'Rank' 'Check Name' 'Loc ID'
#widths 12 3 3 12 25
#types 'STRING' 'INTEGER' 'INTEGER' 'STRING' 'STRING'
#rows
'aa' '1' 'xyz' '8gkahinka.01'
'aax' '1' 'none' '7nlafnjbaflnbja.01'

#blocked '' '' 
#rule 'Rule_Axf'
#doc '102'
#station_type ' '
#queue_duration '1.09673e-05'
#process_duration '4.61456'
#ISS-DLIS-DIAGS
InsertTime:201604070750 DocID:101
#headers 
'DocID: 101    MOVEABLE TOOLS:   2 QTY:    0     HELD TOOLS:   0 QTY:    0     BLOCKED TOOLS:   0 QTY:    0'
#columns  'TargetDoc' 'GRank' 'LRank' 'Priority' 'Loc ID'
#widths 12 3 3 12 25
#types 'STRING' 'INTEGER' 'INTEGER' 'STRING' 'STRING'
#rows
'xxxx' '1' '1' 'Slow' 'bjkkacka.01'
'yyyy' '1' '0' 'Slow' 'jiafjklas.001'

#blocked '' '' 
#rule 'Rule_Abcd'
#doc '101'
#station_type ' '
#queue_duration '1.09673e-05'
#ISS-DLIS-DIAGS

This was a raw data and I need to write a shell script to cleanse the data.

row started with # is like comment and we need to ignore that other than #coulmns
#columns are give the columns names and #rows give the actual data.
unwanted data highlighted with red color and useful data highlighted as black color
The header for out put file is always all the #headers in the data along with InsertTime and DocID
assign the values as per header and add InsertTime & DocID values too.
data delimiter is | in the out put file.

Please find the desired out put:

InsertTime|DocID|TargetDoc|GRank|LRank|Priority|Loc ID|Rank|Check Name
201604070523|101|aaaaa|1|1|Slow|8gkahinka.01||
201604070523|101|aaaaa|1|0|Slow|7nlafnjbaflnbja.01||
201604070523|102|aa||||8gkahinka.01|1|xyz
201604070523|102|aax||||7nlafnjbaflnbja.01|1|none
201604070750|101|xxxx|1|1|Slow|bjkkacka.01||
201604070750|101|yyyy|1|0|Slow|jiafjklas.001||

pilnet101 · April 7, 2016, 9:56am

What have you tried so far? Also please use code tags.

RudiC · April 7, 2016, 10:09am

Any attempts/ideas/thoughts from your side?

pdathu · April 7, 2016, 10:21am

I am newbie to shell scripting. I know basic UNIX commands & never had a chance to work on the advanced commands.

I read online about advanced commands but I don't how to use those.

Also, Please suggest any website or book for shell scripting.

---------- Post updated at 09:21 AM ---------- Previous update was at 09:20 AM ----------

I am newbie to shell scripting. I am new to community and I will use the code tag my later posts.

RudiC · April 7, 2016, 4:36pm

Phew, what an ordeal! Far from elegant, utterly clumsy, but it does what is requested:

awk '
BEGIN           {ARGV[ARGC++] = ARGV[1]
                 HD[++HDCNT]  = "InsertTime"
                 HD[++HDCNT]  = "DocID"
                }

/^#col/         {FS = "\047"
                 $0 = $0
                 for (i=2; i<NF; i+=2)  {gsub (" ", "_", $i)
                                         if (FNR == NR) {if (!($i in X))        {X[$i]
                                                                                 HD[++HDCNT] = $i
                                                                                }
                                                        }
                                        }
                 FS = " "
                 $0 = $0
                 if (FNR != NR) {delete COL
                                 COL[1] = 1
                                 COL[2] = 2
                                 for (i=2; i<=NF; i++)  {for (j=3; j<=HDCNT; j++)       {if ($i == HD[j])       {COL[i+1] = j
                                                                                                                 break
                                                                                                                }
                                                                                        }
                                                        }
                                }
                }

FNR == NR       {next
                }

FNR == 1        {for (i=1; i<HDCNT; i++) printf "%s|", HD
                 printf "%s%s", HD[HDCNT], ORS
                }

                {gsub (/\047/, "")
                }

/^#/ || /^ *$/ ||
/^DocID/        {next
                }

/^Inse/         {gsub (/(^| )[^ :]*:/, " ")
                 INS = $1
                 DOC = $2
                 next
                }

                {$1 = INS OFS DOC OFS $1
                 n = split ($0, T)
                 $0 = ""
                 OFS = "|"
                 for (i=1; i<=n; i++)   {$(COL) = T
                                         $HDCNT = $HDCNT
                                        }   
                 print
                 OFS = " "
                }
'  file
InsertTime|DocID|TargetDoc|GRank|LRank|Priority|Loc_ID|Rank|Check_Name
201604070523|101|aaaaa|1|1|Slow|8gkahinka.01||
201604070523|101|aaaaa|1|0|Slow|7nlafnjbaflnbja.01||
201604070523|102|aa||||8gkahinka.01|1|xyz
201604070523|102|aax||||7nlafnjbaflnbja.01|1|none
201604070750|101|xxxx|1|1|Slow|bjkkacka.01||
201604070750|101|yyyy|1|0|Slow|jiafjklas.001||

We're running through the file twice, and I had to replace the spaces in the column headers by underscore char to avoid additional field splits. Anybody out there to try to pimp it up?

RudiC · April 7, 2016, 5:32pm

Well, one pass only, and slightly prettier:

awk '
BEGIN           {HD[++HDCNT]  = "InsertTime"
                 HD[++HDCNT]  = "DocID"
                }

/^#col/         {FS = "\047"
                 $0 = $0
                 for (i=2; i<NF; i+=2)  {gsub (" ", "_", $i)
                                         if (!($i in X))        {X[$i]
                                                                 HD[++HDCNT] = $i
                                                                }
                                        }
                 FS = " "
                 $0 = $0
                 RW[++RCNT] = $0
                }

/^#/       ||
/^ *$/     ||
/^.DocID/       {next
                }

/^Inse/         {gsub (/(^| )[^ :]*:/, " ")
                 INS = $1
                 DOC = $2
                 next
                }

                {gsub (/\047/, "")
                 LN[RCNT,++LCNT[RCNT]] = INS FS DOC FS $0
                }


END             {for (i=1; i<HDCNT; i++) printf "%s|", HD
                 printf "%s%s", HD[HDCNT], ORS
                 OFS = "|"
                 for (r=1; r<=RCNT; r++)
                        {delete COL 
                         COL[1] = 1
                         COL[2] = 2
                         n = split (RW[r], T)
                         for (i=2; i<=n; i++) for (j=3; j<=HDCNT; j++) if (T == HD[j]) COL[i+1] = j
                         for (l=1; l<=LCNT[r]; l++)
                                {n = split (LN[r,l], T)
                                 $0 = ""
                                 for (i=1; i<=n; i++) $(COL) = T
                                 $HDCNT = $HDCNT
                                 print
                                }
                        }
                }
'  file
InsertTime|DocID|TargetDoc|GRank|LRank|Priority|Loc_ID|Rank|Check_Name
201604070523|101|aaaaa|1|1|Slow|8gkahinka.01||
201604070523|101|aaaaa|1|0|Slow|7nlafnjbaflnbja.01||
201604070523|102|aa||||8gkahinka.01|1|xyz
201604070523|102|aax||||7nlafnjbaflnbja.01|1|none
201604070750|101|xxxx|1|1|Slow|bjkkacka.01||
201604070750|101|yyyy|1|0|Slow|jiafjklas.001||

Don_Cragun · April 7, 2016, 10:38pm

Hi pdathu,
Someone who has been posting questions to this forum for well over 4 years is not "a newbie who has never had a chance to work on the advanced commands".

You may have chosen to use this site as your unpaid programming staff instead of using it as a way to learn how to use basic UNIX, Linux, and BSD system utilities; but that doesn't qualify you to be someone who has never had a chance to work on the advanced commands.

In future threads, we will be very disappointed if you still claim to be a newbie and refuse to make any attempt to show us that you have invested any effort to solve problems on your own.

Hi RudiC,
I won't claim that this is any prettier than your suggestions (and it does read the input file twice), but it is a slightly different approach and doesn't need to convert spaces in field names to underscores. But, it does depend on the fact that there are no spaces in any of the fixed header fields names or data (i.e., those that are not defined by a "#columns" comment line) since it uses spaces and colons as field separators on the:

'FixedHeaderName1:FHN1data[ OtherFixedHeaderName:OTHERdata]...'

lines.

And, since it defines SQ outside of the awk script using "'" instead of using "\047", it will also work on mainframe systems using EBCDIC as well as on systems using codesets that are supersets of ASCII.

awk -v SQ="'" '
function AddHeader(name) {
	# Add a given header name (name) to the list of headers to be used by
	# this run of this script if this header name has not been seen before
	# and note the output field number to be used for this header name.
	if(name in OutField) return
	Hdr[++HdrCnt] = name
	OutField[name]
}
BEGIN {	# Add fixed header names.  Note that fixed header names must not
	# contain any spaces and the 1st fixed header name must appear as the
	# first field on any input line supplying fixed header data!
	AddHeader("InsertTime")
	AddHeader("DocID")

	# Set number of fixed header names.
	FixedHdrCnt = HdrCnt

	# Set output field separator.
	OFS = "|"
}
FNR == NR && $1 == "#columns" {
	# On the 1st pass through the input file, gather the header names to be
	# processed on this run.
	n = split($0, a, SQ)
	for(i = 2; i <= n; i += 2)
		AddHeader(a)
	next
}
FNR == 1 && NR > 1 {
	# We are starting the 2nd pass through the input file.  Print the
	# header.
	for(i = 1; i <= HdrCnt; i++)
		printf("%s%s", Hdr, (i == HdrCnt) ? ORS : OFS)
}
NR > FNR && $0 ~ ("^" Hdr[1] ":") {
	# Gather fixed header data for following row lines.
	n = split($0, a, /[: ]/)
	for(i = 1; i <= FixedHdrCnt; i++)
		Data[a[i * 2 - 1]] = a[i * 2]

	# Clear columnsFound so we do not try to process "#headers" data lines.
	columnsFound = 0
}
$1 == "#columns" {
	# Gather input field header names to be used for this section of the
	# input file.
	n = split($0, a, SQ)
	for(i = 2; i <= n;  i += 2)
		FieldName = a

	# Note that we have found headers for this section.  (There really
	# should be a separate check for a line starting with "#rows", but the
	# directions given state that that line has to be ignored.)
	columnsFound = 1
}
NR > FNR && $0 ~ ("^" SQ) && columnsFound {
	# Gather and print data from this input line:
	# Gather data...
	n = split($0, a, SQ)
	for(i = 2; i <= n; i += 2)
		Data[FieldName] = a

	# Print fixed data...
	for(i = 1; i <= FixedHdrCnt; i++)
		printf("%s%s", Data[Hdr], OFS)

	# Print and delete line specific data...
	for(i = FixedHdrCnt + 1; i <= HdrCnt; i++) {
		printf("%s%s", Data[Hdr], (i == HdrCnt) ? ORS : OFS)
		delete Data[Hdr]
	}
}' file file

If anyone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

With the given sample input data, the above script produces the output:

InsertTime|DocID|TargetDoc|GRank|LRank|Priority|Loc ID|Rank|Check Name
201604070523|101|aaaaa|1|1|Slow|8gkahinka.01||
201604070523|101|aaaaa|1|0|Slow|7nlafnjbaflnbja.01||
201604070523|102|aa||||8gkahinka.01|1|xyz
201604070523|102|aax||||7nlafnjbaflnbja.01|1|none
201604070750|101|xxxx|1|1|Slow|bjkkacka.01||
201604070750|101|yyyy|1|0|Slow|jiafjklas.001||

as requested.

pdathu · April 8, 2016, 4:21pm

don cragun:

Hi pdathu,
Someone who has been posting questions to this forum for well over 4 years is not "a newbie who has never had a chance to work on the advanced commands".

You may have chosen to use this site as your unpaid programming staff instead of using it as a way to learn how to use basic UNIX, Linux, and BSD system utilities; but that doesn't qualify you to be someone who has never had a chance to work on the advanced commands.

In future threads, we will be very disappointed if you still claim to be a newbie and refuse to make any attempt to show us that you have invested any effort to solve problems on your own.

Hi RudiC,
I won't claim that this is any prettier than your suggestions (and it does read the input file twice), but it is a slightly different approach and doesn't need to convert spaces in field names to underscores. But, it does depend on the fact that there are no spaces in any of the fixed header fields names or data (i.e., those that are not defined by a "#columns" comment line) since it uses spaces and colons as field separators on the:
'FixedHeaderName1:FHN1data[ OtherFixedHeaderName:OTHERdata]...'
lines.

And, since it defines SQ outside of the awk script using "'" instead of using "\047", it will also work on mainframe systems using EBCDIC as well as on systems using codesets that are supersets of ASCII.
awk -v SQ="'" '
function AddHeader(name) {
	# Add a given header name (name) to the list of headers to be used by
	# this run of this script if this header name has not been seen before
	# and note the output field number to be used for this header name.
	if(name in OutField) return
	Hdr[++HdrCnt] = name
	OutField[name]
}
BEGIN {	# Add fixed header names.  Note that fixed header names must not
	# contain any spaces and the 1st fixed header name must appear as the
	# first field on any input line supplying fixed header data!
	AddHeader("InsertTime")
	AddHeader("DocID")

	# Set number of fixed header names.
	FixedHdrCnt = HdrCnt

	# Set output field separator.
	OFS = "|"
}
FNR == NR && $1 == "#columns" {
	# On the 1st pass through the input file, gather the header names to be
	# processed on this run.
	n = split($0, a, SQ)
	for(i = 2; i <= n; i += 2)
		AddHeader(a)
	next
}
FNR == 1 && NR > 1 {
	# We are starting the 2nd pass through the input file.  Print the
	# header.
	for(i = 1; i <= HdrCnt; i++)
		printf("%s%s", Hdr, (i == HdrCnt) ? ORS : OFS)
}
NR > FNR && $0 ~ ("^" Hdr[1] ":") {
	# Gather fixed header data for following row lines.
	n = split($0, a, /[: ]/)
	for(i = 1; i <= FixedHdrCnt; i++)
		Data[a[i * 2 - 1]] = a[i * 2]

	# Clear columnsFound so we do not try to process "#headers" data lines.
	columnsFound = 0
}
$1 == "#columns" {
	# Gather input field header names to be used for this section of the
	# input file.
	n = split($0, a, SQ)
	for(i = 2; i <= n;  i += 2)
		FieldName = a

	# Note that we have found headers for this section.  (There really
	# should be a separate check for a line starting with "#rows", but the
	# directions given state that that line has to be ignored.)
	columnsFound = 1
}
NR > FNR && $0 ~ ("^" SQ) && columnsFound {
	# Gather and print data from this input line:
	# Gather data...
	n = split($0, a, SQ)
	for(i = 2; i <= n; i += 2)
		Data[FieldName] = a

	# Print fixed data...
	for(i = 1; i <= FixedHdrCnt; i++)
		printf("%s%s", Data[Hdr], OFS)

	# Print and delete line specific data...
	for(i = FixedHdrCnt + 1; i <= HdrCnt; i++) {
		printf("%s%s", Data[Hdr], (i == HdrCnt) ? ORS : OFS)
		delete Data[Hdr]
	}
}' file file
If anyone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

With the given sample input data, the above script produces the output:
InsertTime|DocID|TargetDoc|GRank|LRank|Priority|Loc ID|Rank|Check Name
201604070523|101|aaaaa|1|1|Slow|8gkahinka.01||
201604070523|101|aaaaa|1|0|Slow|7nlafnjbaflnbja.01||
201604070523|102|aa||||8gkahinka.01|1|xyz
201604070523|102|aax||||7nlafnjbaflnbja.01|1|none
201604070750|101|xxxx|1|1|Slow|bjkkacka.01||
201604070750|101|yyyy|1|0|Slow|jiafjklas.001||
as requested.

Hello Sir, I joined this community 4 yrs ago but my responsibilities changed to different OS & tech. Now, I am working with Unix OS & shell scripting.

Thanks for your code and its working fine for some test cases.

Can you suggest any books or online sites to improve my Unix OS & its scripting knowledge.