Replace substring by longest string in common field (awk)

beca123456 · January 14, 2020, 5:44am

Hi,

Let's say I have a pipe-separated input like so:

name_10|A|BCCC|cat_1
name_11|B|DE|cat_2
name_10|A|BC|cat_3
name_11|B|DEEEEEE|cat_4

Using awk, for records with common field 2, I am trying to replace all the shortest substrings by the longest string in field 3.
In order to get the following (changes in bold):

name_10|A|BCCC|cat_1
name_11|B|DEEEEEE|cat_2
name_10|A|BCCC|cat_3
name_11|B|DEEEEEE|cat_4

A beginning of a code so far, but I am getting stuck:

echo -e "name_10|A|BCCC|cat_1\nname_11|B|DE|cat_2\nname_10|A|BC|cat_3\nname_11|B|DEEEEEE|cat_4" |
awk '
BEGIN{FS="|"}
{
    if(a[$2] < length($3)){
        a[$2]=$3
    }
}
END{
    for(i in a){
        print i FS a
    }
}'

RavinderSingh13 · January 14, 2020, 6:07am

Hello beca123456,

Could you please try following.

awk 'BEGIN{FS=OFS="|"} FNR==NR{b[$1]=length($3)>a[$1]?$3:b[$1];a[$1]=length($3)>a[$1]?length($3):a[$1];next} length($3)<a[$1]{$3=b[$1]} 1'  Input_file  Input_file

A non-one liner form of solution is:

awk '
BEGIN{
  FS=OFS="|"
}
FNR==NR{
  b[$1]=length($3)>a[$1]?$3:b[$1]
  a[$1]=length($3)>a[$1]?length($3):a[$1]
  next
}
length($3)<a[$1]{
  $3=b[$1]
}
1
'   Input_file  Input_file

Output will be as follows.

name_10|A|BCCC|cat_1
name_11|B|DEEEEEE|cat_2
name_10|A|BCCC|cat_3
name_11|B|DEEEEEE|cat_4

Thanks,
R. Singh

beca123456 · January 14, 2020, 6:55am

Brilliant !

However, although I think I understand the following lines, I cannot place them in the context:

# When reading the input file for the first time, create an array based on $1 for which the values are $3 if length($3)>a[$1] or b[$1] if not.
# How come this line does not trigger an error since a[$1] is not yet defined?
b[$1]=length($3)>a[$1]?$3:b[$1]

# Defining a[$1]
# if length($3) > a[$1], then a[$1] equals length($3), otherwise equals a[$1]
a[$1]=length($3)>a[$1]?length($3):a[$1]

RudiC · January 14, 2020, 6:56am

Try also

awk -F\| '
        {LN[NR] = $0
         L      = length($3)
         if (L>MX[$2])  {MX[$2] = L
                         D3[$2] = $3
                        }
        }
END     {for (n=1; n<=NR; n++)  {$0 = LN[n]
                                 $3 = D3[$2]
                                 print
                                }
        }
' OFS=\| file
name_10|A|BCCC|cat_1
name_11|B|DEEEEEE|cat_2
name_10|A|BCCC|cat_3
name_11|B|DEEEEEE|cat_4

On versions that don't keep NR 's value into the END section you'll need to use a temp var to convey its value.

RavinderSingh13 · January 14, 2020, 7:10am

Thank you

For your questions, why a[$1] didn't throw errors because if any variable is NOT initialized in awk and we are using it in any condition or etc then its value will be considered as NULL, hence NO ERRORS in it.

I am adding a detailed level of explanation here for my solution above:

awk '                                                ##Starting awk program from here.
BEGIN{                                               ##Starting BEGIN section of this awk code here.
  FS=OFS="|"                                         ##Setting FS and OFS as pipe here.
}
FNR==NR{                                             ##Checking condition if FNR==NR which will be TRUE when first time Input_file is being read.
  b[$1]=length($3)>a[$1]?$3:b[$1]                    ##Creating array b with index $1 and checking if value of length of $3 is grater than value of a[$3] then keep value of length of $3 else keep OLD Value in it.
  a[$1]=length($3)>a[$1]?length($3):a[$1]            ##Creating array a with index $1 and checking condition if length of $3 is grater than a[$1] then save value as length($3) or keep the OLD value to it. This array a is basically has length in integer form value with index $1 to be used later in condition.
  next                                               ##next will skip all further statements from here,
}
length($3)<a[$1]{                                    ##Checking condition if length of 3rd field is lesser than value of array a with index $1 then
  $3=b[$1]                                           ##Setting current $3 to value of array b with index of $1 here.
}
1                                                    ##1 will print edited/non-edited values of current line.
'  Input_file Input_file                             ##Mentioning Input_file 2 times here.

MadeInGermany · January 14, 2020, 7:54am

An not inititalized variable (or array element) becomes 0 in number context, and "" in string context.
In this case, since 0 is the minimal possible string length, the 0 is perfect.

With two passes through the input file one only needs one array that holds the longest string:

awk '
BEGIN { FS=OFS="|" }
# NR == FNR when reading the 1st file
NR == FNR {
# 1st file
# a[$2] holds the longest $3
  if (length(a[$2]) < length($3)) a[$2]=$3
# jump to next input cycle, do not run the following code
  next
}
# 2nd file, here: pass 2
{
# always update $3
  $3=a[$2]
  print
}' Input_file Input_file

And, similar to post#4, with one pass through the input file, where everything is read into a line[] array, and in the END section this is printed in a loop.

awk '
BEGIN { FS=OFS="|" }
{
# store $0 in line[1..]
  line[NR]=$0
# a[$2] holds the longest $3
  if (length(a[$2]) < length($3)) a[$2]=$3
}
END {
  for (n=1; n<=NR; n++) {
# restore $0
    $0=line[n]
# always update $3
    $3=a[$2]
    print
  }
}
' Input_file