I am processing a file using awk to get few input variables which I'll use later in my script. I am learning to script using awk so please advise in any mistakes I made in my code. File sample is as follows
# cat junk1.jnk
Folder1 : test_file (File)
test1_file (File)
test2_file (File)
Lines (9):
00140 Li CHAR 188
00141 Li CHAR 188
00142 Li CHAR 188
00143 Li CHAR 188
00144 Li CHAR 188
00145 Li CHAR 375
00146 Li CHAR 375
00147 Li CHAR 375
I am trying to extract comma separated list of file names identified by last field in braces (File) followed by Number of Lines which is (9) and comma separated list of uniq CHAR - last field of the line starting with HEX values after string "Lines (9):". I am using following code. I get the file names and Line number but unable to get the comma separated list of uniq CHAR. In this case it should be 188,375.
My Current O/P is as follows. As you can see the only value I get for CHAR is last one - 375. Also if you can help me understand why am I getting file name test2_file,test2_file twice.
There is no reason to use cat to feed data to awk ; awk is perfectly capable of reading files on its own. Using cat causes all of the data to be read and written an extra time, consumes more system resources, and slows down your script.
Note that in your code that I marked in red above, you are careful to print each filename value (followed by a comma) when you find one. (But you then also print the last filename found when you get to the END clause in your awk script.
You don't do that with the values you find that you store in the CHR variable (so you just print the last value found) instead of all of them. And there isn't any check in your code to look for matching values to eliminate duplicates.
You might have also noticed that your two heading lines don't line up with each other nor with the data line that you print at the end.
The code rdrtx1 suggested accumulates the comma-separated value strings always adding a comma to the end of the string when a new value is added and then removes the last comma in the END clause. That code also lines up header columns and data columns as long as the list of filenames isn't more than 40 characters long.
The following code self adjusts headings to match the data found in the file being processed. It takes a short-cut assuming that no field will contain data that is longer than 61 characters. If your real data will have one or more fields longer than that, the DASHES variable needs to have more dashes added to its value, or the second printf in the END clause needs to be replaced by three loops that print as many dashes as are needed for each of the three headings. (I will leave that adjustment as an exercise for the reader.)
It also uses a function to add values to the two string variables and only adds a comma as a subfield-separator when the string isn't empty to start with.
awk '
function AddVal(Value, String) {
# Add "Value" to a comma-separated value string identified by "String"
# or, if it does not already exist, create it.
String = ((String == "" ? "" : String ",")) Value
# Return the new value for "String".
return(String)
}
$NF == "(File)" {
# Add a filename to the CSG variable.
CSG = AddVal($(NF - 1), CSG)
next
}
$1 == "Lines" {
# Grab the number of lines to be reported.
match($0, /[[:digit:]]+/) # I assume this is a decimal number.
LNN = substr($0, RSTART, RLENGTH)
next
}
$1 ~ /^[[:xdigit:]]{5}$/ {
# We found a 5 hexadecimal digit string in $1, determine if we have
# seen the value in the last field before...
if($NF in seen)
next # We have seen it, move on to the next input record.
# We have not seen it before. Note that we have seen it now...
seen[$NF]
# and add this value to the CHR variable.
CHR = AddVal($NF, CHR)
}
END { # Set DASHES to a long string of dashes...
DASHES = "-------------------------------------------------------------"
# Calculate the longest string to be printed in the filenames field...
fnl = ((l1 = length("File Names")) > (l2 = length(CSG))) ? l1 : l2
# and in the lines field...
ll = ((l1 = length("Lines")) > (l2 = length(LNN))) ? l1 : l2
# and in the CHARS field.
vall = ((l1 = length("CHARS")) > (l2 = length(CHR))) ? l1 : l2
# Print the two line header adjusted to fit the actual data.
printf("%-*.*s %-*.*s %-*.*s\n", fnl, fnl, "File Names",
ll, ll, "Lines", vall, vall, "CHARS")
printf("%-*.*s %-*.*s %-*.*s\n", fnl, fnl, DASHES,
ll, ll, DASHES, vall, vall, DASHES)
# Print the accumulated data.
printf ("%*.*s %*.*s %*.*s\n", fnl, fnl, CSG,
ll, ll, LNN, vall, vall, CHR)
}' junk1.jnk