awk to get multiple strings in one variable

shunya · January 24, 2018, 12:03pm

I am processing a file using awk to get few input variables which I'll use later in my script. I am learning to script using awk so please advise in any mistakes I made in my code. File sample is as follows

# cat junk1.jnk
  Folder1                    : test_file     (File)
                                test1_file    (File)
                                test2_file    (File)
   Lines (9):
    00140  Li                      CHAR                         188
    00141  Li                      CHAR                         188
    00142  Li                      CHAR                         188
    00143  Li                      CHAR                         188
    00144  Li                      CHAR                         188
    00145  Li                      CHAR                         375
    00146  Li                      CHAR                         375
    00147  Li                      CHAR                         375

I am trying to extract comma separated list of file names identified by last field in braces (File) followed by Number of Lines which is (9) and comma separated list of uniq CHAR - last field of the line starting with HEX values after string "Lines (9):". I am using following code. I get the file names and Line number but unable to get the comma separated list of uniq CHAR. In this case it should be 188,375.

cat junk1.jnk | awk 'BEGIN { printf ("%-23s %-4s %-5s\n", "File Names"," Lines", "CHARS")
printf ("%-23s %-4s %-5s\n", "--------------"," ----"," ------")}
{
if ($0 ~ /Folder1/){
FLAG=1
}

if (FLAG == 1) {
if (($0 ~/Folder/) || ($0 ~ /^[ \t]+|[ \t]+\(File\)$/) || ($0 ~ /Lines/) || ($1 ~ /^[0-9A-Fa-f]{5}+$/)) {
split ($0,VAL,FS)

if ($NF ~ /\(File\)/) {
CSG=$(NF-1);printf ("%s,", CSG)
}
if ($0 ~ /Lines/) {
## split ($0,VAL,FS)
        LN=VAL[2]
        LNN=(substr( LN,2,length(LN)-2))
}

if ($1 ~ /^[0-9A-Fa-f]{5}+$/) {
## split ($0,VAL,FS)
        CHR=VAL[NF]
        }
      }
   }
}
END {printf ("%s %s %s\n", CSG, (substr(LNN, 1, length(LNN)-1)), CHR)}'

My Current O/P is as follows. As you can see the only value I get for CHAR is last one - 375. Also if you can help me understand why am I getting file name test2_file,test2_file twice.

File Names               Lines CHARS
--------------           ----  ------
test_file,test1_file,test2_file,test2_file 9 375

I am expecting following o/p

File Names               Lines CHARS
--------------              ----  ------
test_file,test1_file,test2_file  9 188,375

As usual you guys are rock stars and would appreciate your help.

rdrtx1 · January 24, 2018, 2:25pm

awk 'BEGIN {
   printf ("%-40s %-5s %-15s\n", "File Names","Lines", "CHARS")
   printf ("%-40s %-5s %-15s\n", "--------------","-----","------")
}

$NF ~ /\(File\)/ {
   CSG=CSG $(NF-1) ","
}

$0 ~ /Lines/ {
   gsub("[^0-9]", "")
   LNN=$1
}

$1 ~ /^[0-9A-Fa-f]+$/ && length($1)==5 {
   if (! c[$NF]) CHR=CHR $NF ","
   c[$NF]=$NF
}

END {
   sub(",*$", "", CSG)
   sub(",*$", "", CHR)
   printf ("%-40s %-5s %-15s\n", CSG, LNN, CHR)
}' junk1.jnk

shunya · January 24, 2018, 3:28pm

Hi rdrtx1...this is superb!

Can you educate me little bit about following lines.

gsub("[^0-9]", "")

if (! c[$NF]) CHR=CHR $NF ","
   c[$NF]=$NF

Thank you! for your help

rdrtx1 · January 24, 2018, 3:46pm

gsub("[^0-9]", "") # eliminate all non-digits

if (! c[$NF]) CHR=CHR $NF ","
   c[$NF]=$NF

# if last field was not stored in c array then add to CHR string (eliminate duplicates)

Better yet, use if (! ($NF in c)) CHR=CHR $NF "," just in case $NF values include zero.

Don_Cragun · January 24, 2018, 7:22pm

shunya:

I am processing a file using awk to get few input variables which I'll use later in my script. I am learning to script using awk so please advise in any mistakes I made in my code. File sample is as follows
# cat junk1.jnk
  Folder1                    : test_file     (File)
   test1_file    (File)
   test2_file    (File)
   Lines (9):
   00140  Li                      CHAR                         188
   00141  Li                      CHAR                         188
   00142  Li                      CHAR                         188
   00143  Li                      CHAR                         188
   00144  Li                      CHAR                         188
   00145  Li                      CHAR                         375
   00146  Li                      CHAR                         375
   00147  Li                      CHAR                         375
I am trying to extract comma separated list of file names identified by last field in braces (File) followed by Number of Lines which is (9) and comma separated list of uniq CHAR - last field of the line starting with HEX values after string "Lines (9):". I am using following code. I get the file names and Line number but unable to get the comma separated list of uniq CHAR. In this case it should be 188,375.
cat junk1.jnk | awk 'BEGIN { printf ("%-23s %-4s %-5s\n", "File Names"," Lines", "CHARS")
printf ("%-23s %-4s %-5s\n", "--------------"," ----"," ------")}
{
if ($0 ~ /Folder1/){
FLAG=1
}

if (FLAG == 1) {
if (($0 ~/Folder/) || ($0 ~ /^[ \t]+|[ \t]+$File$$/) || ($0 ~ /Lines/) || ($1 ~ /^[0-9A-Fa-f]{5}+$/)) {
split ($0,VAL,FS)

if ($NF ~ /$File$/) {
CSG=$(NF-1);printf ("%s,", CSG)
}
if ($0 ~ /Lines/) {
## split ($0,VAL,FS)
   LN=VAL[2]
   LNN=(substr( LN,2,length(LN)-2))
}

if ($1 ~ /^[0-9A-Fa-f]{5}+$/) {
## split ($0,VAL,FS)
   CHR=VAL[NF]
   }
   }
   }
}
END {printf ("%s %s %s\n", CSG, (substr(LNN, 1, length(LNN)-1)), CHR)}'
My Current O/P is as follows. As you can see the only value I get for CHAR is last one - 375. Also if you can help me understand why am I getting file name test2_file,test2_file twice.
File Names               Lines CHARS
--------------           ----  ------
test_file,test1_file,test2_file,test2_file 9 375
I am expecting following o/p
File Names               Lines CHARS
--------------              ----  ------
test_file,test1_file,test2_file  9 188,375
As usual you guys are rock stars and would appreciate your help.

There is no reason to use cat to feed data to awk ; awk is perfectly capable of reading files on its own. Using cat causes all of the data to be read and written an extra time, consumes more system resources, and slows down your script.

Note that in your code that I marked in red above, you are careful to print each filename value (followed by a comma) when you find one. (But you then also print the last filename found when you get to the END clause in your awk script.

You don't do that with the values you find that you store in the CHR variable (so you just print the last value found) instead of all of them. And there isn't any check in your code to look for matching values to eliminate duplicates.

You might have also noticed that your two heading lines don't line up with each other nor with the data line that you print at the end.

The code rdrtx1 suggested accumulates the comma-separated value strings always adding a comma to the end of the string when a new value is added and then removes the last comma in the END clause. That code also lines up header columns and data columns as long as the list of filenames isn't more than 40 characters long.

The following code self adjusts headings to match the data found in the file being processed. It takes a short-cut assuming that no field will contain data that is longer than 61 characters. If your real data will have one or more fields longer than that, the DASHES variable needs to have more dashes added to its value, or the second printf in the END clause needs to be replaced by three loops that print as many dashes as are needed for each of the three headings. (I will leave that adjustment as an exercise for the reader.)

It also uses a function to add values to the two string variables and only adds a comma as a subfield-separator when the string isn't empty to start with.

awk '
function AddVal(Value, String) {
	# Add "Value" to a comma-separated value string identified by "String"
	# or, if it does not already exist, create it.
	String = ((String == "" ? "" : String ",")) Value

	# Return the new value for "String".
	return(String)
}

$NF == "(File)" {
	# Add a filename to the CSG variable.
	CSG = AddVal($(NF - 1), CSG)	
	next
}

$1 == "Lines" {
	# Grab the number of lines to be reported.
	match($0, /[[:digit:]]+/)	# I assume this is a decimal number.
	LNN = substr($0, RSTART, RLENGTH)
	next
}

$1 ~ /^[[:xdigit:]]{5}$/ {
	# We found a 5 hexadecimal digit string in $1, determine if we have
	# seen the value in the last field before...
        if($NF in seen) 
		next	# We have seen it, move on to the next input record.
	# We have not seen it before.  Note that we have seen it now...
	seen[$NF]
	# and add this value to the CHR variable.
	CHR = AddVal($NF, CHR)
}

END {	# Set DASHES to a long string of dashes...
	DASHES = "-------------------------------------------------------------"
	# Calculate the longest string to be printed in the filenames field...
	fnl = ((l1 = length("File Names")) > (l2 = length(CSG))) ? l1 : l2
	# and in the lines field...
	ll = ((l1 = length("Lines")) > (l2 = length(LNN))) ? l1 : l2
	# and in the CHARS field.
	vall = ((l1 = length("CHARS")) > (l2 = length(CHR))) ? l1 : l2

	# Print the two line header adjusted to fit the actual data.
	printf("%-*.*s %-*.*s %-*.*s\n", fnl, fnl, "File Names",
	    ll, ll, "Lines", vall, vall, "CHARS")
	printf("%-*.*s %-*.*s %-*.*s\n", fnl, fnl, DASHES,
	    ll, ll, DASHES, vall, vall, DASHES)
	# Print the accumulated data.
	printf ("%*.*s %*.*s %*.*s\n", fnl, fnl, CSG,
	    ll, ll, LNN, vall, vall, CHR)
}' junk1.jnk

The code above produces the output:

File Names                      Lines CHARS  
------------------------------- ----- -------
test_file,test1_file,test2_file     9 188,375

while the code suggested by rdrtx1 produces the output:

File Names                               Lines CHARS          
--------------                           ----- ------         
test_file,test1_file,test2_file          9     188,375

and with a different input file containing:

  Folder1                    : test_file     (File)
                                test1_file    (File)
                                test2_file    (File)
                                test3_file    (File)
   Lines (8):
    00140  Li                      CHAR                         188
    00141  Li                      CHAR                         188
    00142  Li                      CHAR                         190
    00143  Li                      CHAR                         190
    00144  Li                      CHAR                         192
    00145  Li                      CHAR                         375
    00146  Li                      CHAR                         375
    00147  Li                      CHAR                         395

the code above produces the output:

File Names                                 Lines CHARS              
------------------------------------------ ----- -------------------
test_file,test1_file,test2_file,test3_file     8 188,190,192,375,395

while the code suggested by rdrtx1 would produce the output:

File Names                               Lines CHARS          
--------------                           ----- ------         
test_file,test1_file,test2_file,test3_file 8     188,190,192,375,395

Hopefully, these two suggestions will give you some ideas you can use as you hone your awk expertise.

shunya · January 25, 2018, 10:57am

Awesome Don! You explained each and every line ... This is very helpful. Thank you!