Issue with awk when joining two files when field has '-' hyphen

Dear Community;

I need to join two files but I am facing issues.

1st file has multiple columns. Primary (1st) columns has unique values. There are other columns out of which some has non-ascii characters as well (other language).
Example File below:

 
1-1001JRL,BiRecurring
1-1001W5O,QuadRecurring
1-1002KQK,SpclRecurring
1-1002KR9,1-Recurring
1-1002KRN,2-Recurring
1-1002KS1,One-Time
1-1003BAY,Recurring

2nd file has multiple columns as well. Primary (1st) columns has repeated values.
Example File below:

 
1-1001JRL,abc-1
1-1001JRL,abc-2
1-1001JRL,abc-3
1-1001JRL,abc-4
1-1001JRL,abc-5
1-1001JRL,abc-6
1-1001JRL,abc-7
1-1001JRL,abc-8
1-1001W5O,def-1
1-1001W5O,def-2
1-1001W5O,def-3
1-1001W5O,def-4
1-1001W5O,def-5

I need to join these two files. Expected Output:

 
1-1001JRL,abc-1,1-1001JRL,BiRecurring
1-1001JRL,abc-2,1-1001JRL,BiRecurring
1-1001JRL,abc-3,1-1001JRL,BiRecurring
1-1001JRL,abc-4,1-1001JRL,BiRecurring
1-1001JRL,abc-5,1-1001JRL,BiRecurring
1-1001JRL,abc-6,1-1001JRL,BiRecurring
1-1001JRL,abc-7,1-1001JRL,BiRecurring
1-1001JRL,abc-8,1-1001JRL,BiRecurring
1-1001W5O,def-1,1-1001W5O,QuadRecurring
1-1001W5O,def-2,1-1001W5O,QuadRecurring
1-1001W5O,def-3,1-1001W5O,QuadRecurring
1-1001W5O,def-4,1-1001W5O,QuadRecurring
1-1001W5O,def-5,1-1001W5O,QuadRecurring
1-1002KQK,SpclRecurring
1-1002KR9,1-Recurring
1-1002KRN,2-Recurring
1-1002KS1,One-Time
1-1003BAY,Recurring

I used below awk commands:

awk 'NR==FNR {T[$1]=FS $1; next} {print $0 T[$1]}' FS="," File1 File2 
awk -F"," 'FNR==NR{A[$1]=$1;B[$1]=$0;next} ($1 in A){print $0 FS B[$1]} !($1 in A){print $0}' File1 File2 

Both the commands are only printing output of 2nd file as the condition for matching the fields are not macthing. I tried using the files after removing all non-ascii columns by just keeping the primary columns as well and it did not work.
I am suspecting that it is because of the presence of spcl character '-' hyphen.

Kindly assist and suggest.

Best Regards//

awk -F, 'NR==FNR {T[$1]=$2;B[$1]=$2; next} ($1 in T) { $0 = $0 FS $1 FS T[$1] ;delete B[$1]} END { for(i in B) print i FS B } 1' file1 file2

And I also made a mistake in the name of the array at the output. Fixed T on B

1 Like

Similar but slightly simplified approach:

awk -F, '
NR==FNR         {T[$1] = $0
                 next
                }
($1 in T)       {$0 = $0 FS T[$1]
                 B[$1]
                }
END             {for (t in T) if (!(t in B)) print T[t]
                }
1
' file[12]

Be aware that the order that the residual T elements are printed in is arbitrary by awk implementation. If you need them in the order given in file1, additional measures must be taken.

2 Likes

Dear Both;

Many thanks for the response. Unfortunately - these commands are not working as expected.

Input File 1:

$ cat test2.txt
1-1001JRL,Recurring
1-1001W5O,One-Time
1-KHC3G1,Recurring
1-JKP5QY,One-Time
1-13M7ORR,One-Time
1-LXZSK3,Recurring
1-80QXOZ,One-Time
3-83ODZZ,Recurring
1-VKWGEP,Recurring
3-2FCQ4Y,Recurring
1-W65E6A,Recurring
1-13M7BLZ,Recurring
3-14UC92,Recurring
1-QRMHWU,One-Time

Input File 2:

cat test1.txt
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O

Output:

 awk -F, 'NR==FNR{T[$1] = $0;next}($1 in T){$0 = $0 FS T[$1]B[$1]}END{for (t in T) if (!(t in B)) print T[t]}1' test2.txt test1.txt
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001JRL
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-1001W5O
1-VKWGEP,Recurring
1-W65E6A,Recurring
3-83ODZZ,Recurring
1-QRMHWU,One-Time
1-KHC3G1,Recurring
1-13M7ORR,One-Time
1-JKP5QY,One-Time
1-1001W5O,One-Time  --> This line was not matched.
1-1001JRL,Recurring --> This line was not matched.
3-14UC92,Recurring
3-2FCQ4Y,Recurring
1-13M7BLZ,Recurring
1-LXZSK3,Recurring
1-80QXOZ,One-Time

Expected Output:

1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-VKWGEP,Recurring
1-W65E6A,Recurring
3-83ODZZ,Recurring
1-QRMHWU,One-Time
1-KHC3G1,Recurring
1-13M7ORR,One-Time
1-JKP5QY,One-Time
3-14UC92,Recurring
3-2FCQ4Y,Recurring
1-13M7BLZ,Recurring
1-LXZSK3,Recurring
1-80QXOZ,One-Time

DON'T manipulate scripts without EXACTLY knowing what you are doing, here: condensing a multi line script into a one liner. By sheer accident, your unfortunate manipulation still works for me:

awk -F, 'NR==FNR{T[$1] = $0;next}($1 in T){$0 = $0 FS T[$1]B[$1]}END{for (t in T) if (!(t in B)) print T[t]}1' file[12]
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001JRL,1-1001JRL,Recurring
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-1001W5O,1-1001W5O,One-Time
1-VKWGEP,Recurring
1-QRMHWU,One-Time
1-13M7BLZ,Recurring
1-JKP5QY,One-Time
1-LXZSK3,Recurring
1-KHC3G1,Recurring
3-14UC92,Recurring
1-80QXOZ,One-Time
1-W65E6A,Recurring
3-83ODZZ,Recurring
1-13M7ORR,One-Time
 3-2FCQ4Y,Recurring

Please show the result of running any script AS GIVEN! on the samples in post #1. If using a different sample set, make sure the structure is identical to the original samples.

1 Like

I made a mistake in the name of the variable on the output and fixed it, but it is better to correct the code from RudiC
Copy my code else and replace files places