Using awk, for records with common field 2, I am trying to replace all the shortest substrings by the longest string in field 3.
In order to get the following (changes in bold):
However, although I think I understand the following lines, I cannot place them in the context:
# When reading the input file for the first time, create an array based on $1 for which the values are $3 if length($3)>a[$1] or b[$1] if not.
# How come this line does not trigger an error since a[$1] is not yet defined?
b[$1]=length($3)>a[$1]?$3:b[$1]
# Defining a[$1]
# if length($3) > a[$1], then a[$1] equals length($3), otherwise equals a[$1]
a[$1]=length($3)>a[$1]?length($3):a[$1]
For your questions, why a[$1] didn't throw errors because if any variable is NOT initialized in awk and we are using it in any condition or etc then its value will be considered as NULL, hence NO ERRORS in it.
I am adding a detailed level of explanation here for my solution above:
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this awk code here.
FS=OFS="|" ##Setting FS and OFS as pipe here.
}
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when first time Input_file is being read.
b[$1]=length($3)>a[$1]?$3:b[$1] ##Creating array b with index $1 and checking if value of length of $3 is grater than value of a[$3] then keep value of length of $3 else keep OLD Value in it.
a[$1]=length($3)>a[$1]?length($3):a[$1] ##Creating array a with index $1 and checking condition if length of $3 is grater than a[$1] then save value as length($3) or keep the OLD value to it. This array a is basically has length in integer form value with index $1 to be used later in condition.
next ##next will skip all further statements from here,
}
length($3)<a[$1]{ ##Checking condition if length of 3rd field is lesser than value of array a with index $1 then
$3=b[$1] ##Setting current $3 to value of array b with index of $1 here.
}
1 ##1 will print edited/non-edited values of current line.
' Input_file Input_file ##Mentioning Input_file 2 times here.
An not inititalized variable (or array element) becomes 0 in number context, and "" in string context.
In this case, since 0 is the minimal possible string length, the 0 is perfect.
With two passes through the input file one only needs one array that holds the longest string:
awk '
BEGIN { FS=OFS="|" }
# NR == FNR when reading the 1st file
NR == FNR {
# 1st file
# a[$2] holds the longest $3
if (length(a[$2]) < length($3)) a[$2]=$3
# jump to next input cycle, do not run the following code
next
}
# 2nd file, here: pass 2
{
# always update $3
$3=a[$2]
print
}' Input_file Input_file
And, similar to post#4, with one pass through the input file, where everything is read into a line[] array, and in the END section this is printed in a loop.
awk '
BEGIN { FS=OFS="|" }
{
# store $0 in line[1..]
line[NR]=$0
# a[$2] holds the longest $3
if (length(a[$2]) < length($3)) a[$2]=$3
}
END {
for (n=1; n<=NR; n++) {
# restore $0
$0=line[n]
# always update $3
$3=a[$2]
print
}
}
' Input_file