Replace one column from fixed width file with another column from another file

pchang · March 9, 2020, 12:59pm

Hi Forum.

I tried to search online for the solution but most of the examples I found the data that I'm trying to manipulate doesn't quite match (col pos#134, 12 bytes) and my code is not working as expected.

Format of the output file should remain the same and if the data value from File 1 is not found in File 2, it should replaced with blank spaces.

Sample Input Data - File 1:

1GTD         03/03/20R20200303010004170USD
23923                     3684761733     000000378183JAN20 ABC                     01/31/20NET20       02/20/20  3605        1007  DR606951-000  1020  Software                    N/A            GST_SELF
23923                     3684787963     000001258323JAN20 BBB C                   01/31/20NET20       02/20/20  3605        1007  DR606951-000  1020  Software                    N/A            GST_SELF
26288                     40169          000000250000PRO SERV                      12/31/18NET20       01/20/19  3605        1007  DR607650-000  1020  Software                     N/A            GST_SELF
26288                     INV-600        000000400000PRO SERV REMOTE               05/31/19NET20       06/20/19  3605        1007  DR607650-000  1020  Software                     N/A            GST_SELF
26731                     26955          000003519000MAR20-FEB21 PRO               01/23/20NET20       02/12/20  3605        1007  DR162010-000  1007  Software                     N/A            GST_SELF
9GTD         03/03/20R20200303010004000000005805506000000000000000000005

Sample Input Data - File 2:

606951-000|7543
607650-000|9654
100145-050|

Output expected:

1GTD         03/03/20R20200303010004170USD
23923                     3684761733     000000378183JAN20 ABC                     01/31/20NET20       02/20/20  3605        1007  DR7543        1020  Software                     N/A            GST_SELF
23923                     3684787963     000001258323JAN20 BBB C                   01/31/20NET20       02/20/20  3605        1007  DR7543        1020  Software                     N/A            GST_SELF
26288                     40169          000000250000PRO SERV                      12/31/18NET20       01/20/19  3605        1007  DR9654        1020  Software                     N/A            GST_SELF
26288                     INV-600        000000400000PRO SERV REMOTE               05/31/19NET20       06/20/19  3605        1007  DR            1020  Software                     N/A            GST_SELF
26731                     26955          000003519000MAR20-FEB21 PRO               01/23/20NET20       02/12/20  3605        1007  DR            1007  Software                     N/A            GST_SELF
9GTD         03/03/20R20200303010004000000005805506000000000000000000005

cat File1.txt | while read line
do
  code=`echo "${line}" | awk '{print substr($0,134,12)}' | awk '{ gsub(/[ ]+/,""); print }'` 
  cn=$(awk -v CID=$code '$1==CID {print $2}' FS=\| File2.txt)
  awk -v CN=$cn 'BEGIN {FIELDWIDTHS="1 10 15 15 12 30 8 12 8 2 12 6 2 12 6 15 14 15 8"} {$14=CN} 1'
done

Any help would be greatly appreciated.

Thank you.

nezabudka · March 9, 2020, 3:16pm

Hi
try this

awk '
NR==FNR  {pat["DR"$1]=$2; next}
NF>12    { sub($(NF-4),"DR" pat[$(NF-4)])}
1' FS='|' File2.txt FS='[[:blank:]]+' File1.txt

pchang · March 9, 2020, 3:54pm

Thanks nezabudka for your response. Your code yielded a very close results - only thing is that the update column value is not 12 bytes long (like the original).

Also, if I read your code correctly, you are looking for a "DR" string - it doesn't always have to be DR - could be something else. Also, what would happen if DR value appears in a different column position.

1GTD         03/03/20R20200303010004170USD
23923                     3684761733     000000378183JAN20 ABC                     01/31/20NET20       02/20/20  3605        1007  DR7543  1020  Software                     N/A            GST_SELF
23923                     3684787963     000001258323JAN20 BBB C                   01/31/20NET20       02/20/20  3605        1007  DR7543  1020  Software                     N/A            GST_SELF
26288                     40169          000000250000PRO SERV                      12/31/18NET20       01/20/19  3605        1007  DR9654  1020  Software                     N/A            GST_SELF
26288                     INV-600        000000400000PRO SERV REMOTE               05/31/19NET20       06/20/19  3605        1007  DR9654  1020  Software                     N/A            GST_SELF
26731                     26955          000003519000MAR20-FEB21 PRO               01/23/20NET20       02/12/20  3605        1007  DR  1007  Software                     N/A            GST_SELF
9GTD         03/03/20R20200303010004000000005805506000000000000000000005

nezabudka · March 9, 2020, 5:00pm

Hi, @pchang
NF>12 We select for modification only those lines where there are more than 12 fields(columns).
In other words, you can NF>3 to cut off the first and last lines and avoid the error output.
Because the first and last line contains only 2 fields, which means if we subtract from 2-4 (FN-4)
we get an error - the fields with the number $(-2) cannot exist.
We select only 5 from the end of the fields for substitution (NF-4), which means
that the appearance of the 'DR' in other fields will remain untouched.
If a different letter prefix is expected in the fourth field from the end
then the program must be rewritten

pchang · March 9, 2020, 5:18pm

Hi nezabudka.

This is an external file that we receive from a third party vendor and there's no guarantee that we will always have a space in between the fields (so NF-4 might not always work correctly).

I think it would be better to look for field value starting at column position#134 for 12 bytes and replace that value. Then we wouldn't need to be concerned if it's "DR" or "CR" or something else.

Unfortunately, I'm stuck on how to go about writing the code.

Let me know if you need any other clarifications.

Thanks
Paul

RudiC · March 9, 2020, 5:49pm

Adapting nezabudka's proposal, (untested):

awk '
NR==FNR    {pat[$1] = $2
            next
           }

           {$0 = substr ($0, 1, 133) sprintf ("%12s", pat[substr ($0, 134, 12)]) substr ($0, 147)
           }
1
' FS='|' File2.txt  File1.txt

nezabudka · March 10, 2020, 2:47am

awk '
NR==FNR {pat[$1]=$2; next}
NF>12   {b=a=$(NF-4)
         sub(/[0-9-]+$/, "", a)
         sub(/^[[:alpha:]]+/, "", b)
         sub($(NF-4), a pat)}
1' FS='|' File2.txt FS='[[:blank:]]+' File1.txt

--- Post updated at 10:30 ---

Hi, @pchang
I did not notice your comment. I think it would be more correct to finalize the program from @RudiC

--- Post updated at 10:47 ---

I seem to have counted correctly

NR==FNR    {pat[$1] = $2
            next
           }
           {$0 = substr ($0, 1, 133) pat[substr ($0, 134, 10)] substr($0, 144)
           }
1 ' FS='|' File2.txt  File1.txt

RudiC · March 10, 2020, 5:32am

The OP specified

, that's what the formatted sprintf is for. As the sample data are just 10 byte long, not 12, we need to wait until resolved by the OP, we can just guess. Try this simplified but more flexible version

awk -vPOS=134 -vLEN=12 '
NR==FNR    {pat[$1] = $2
            next
           }

           {print substr ($0, 1, POS-1) sprintf ("%*s", LEN, pat[substr ($0, POS, LEN)]) substr ($0, POS+LEN)
           }
' FS='|' File2.txt  File1.txt

nezabudka · March 10, 2020, 8:34am

Hi, @RudiC

awk '
NR==FNR {pat[$1] = $2
         next
        }
        {print substr($0, 134, 12)
        }
' FS='|' File2.txt File1.txt | cat -vet

606951-000  $
606951-000  $
607650-000  $
607650-000  $
162010-000  $

awk '
NR==FNR {pat[$1] = $2
         next
        }
        {print pat[substr($0, 134, 12)]
        }
' FS='|' File2.txt File1.txt | cat -vet
$
...
$

awk '
NR==FNR {pat[$1] = $2
         next
        }
        {print pat[substr($0, 134, 10)]
        }
' FS='|' File2.txt File1.txt | cat -vet
7543$
7543$
9654$
9654$
$

Then it means so?

pat[$1"  "] = $2

pchang · March 10, 2020, 9:43am

Thank you to Nezabudka and Rudi C for your inputs.

After doing some additional testing, here's the final code after some tweaking:

awk '
NR==FNR    {pat[$1] = $2
            next
           }
           {$0 = substr ($0, 1, 133) sprintf ("%-12s", pat[substr ($0, 134, 12)]) substr($0, 146)
           }
1 ' FS='|' File2.txt  File1.txt