Replace and add line in file with line in another file based on matching string

Hi,

I want to achieve something similar to what described in another post:

The difference is I want to add the line if the pattern is not found.

File 1:

A123, valueA, valueB
B234, valueA, valueB
C345, valueA, valueB
D456, valueA, valueB
E567, valueA, valueB
F678, valueA, valueB

File 2:

C345, valueX, valueY
D456, valueX, valueY
G789, value A, valueB

Output:

A123, valueA, valueB
B234, valueA, valueB
C345, valueX, valueY
D456, valueX, valueY
E567, valueA, valueB
F678, valueA, valueB
G789, value A, valueB

I am wondering how you can achieve it using awk.
The code:

awk -F, 'NR==FNR{a[$1]=$0;next;}a[$1]{$0=a[$1]}1' file2 file1

can only replace the value with existing pattern. How can I change it to add the line?

Please help!

Try this:

awk -F, '
        NR == FNR {
                A[$1] = $0
                next
        }
        {
                B[$1] = $0
        }
        A[$1] {
                $0 = A[$1]
        }
        END {
                for ( k in A )
                {
                        if ( ! ( k in B ) )
                                print A[k]
                }
        }
        1
' file2 file1
2 Likes
 awk -F"," 'NR==FNR{a[$1]=$0;next}
{print $0;delete a[$1]}
END {for (i in a){print a}}' File1 file2 | sort
1 Like

Try

awk -F, 'NR==FNR{A[$1]++;print;next} !A[$1]++' file1 file2

@pamu
This does not give correct output as requested, value from file2 is missing.

awk -F, 'NR==FNR{A[$1]++;print;next} !A[$1]++' f1 f2
A123, valueA, valueB
B234, valueA, valueB
C345, valueA, valueB
D456, valueA, valueB
E567, valueA, valueB
F678, valueA, valueB
G789, value A, valueB

Highlighted in red line belongs to file2 :slight_smile:

Please check with OP's output

Hi Pamu,

value from file2 is require if matching. In your case, it's printing from file1

Thanks
Pravin

request

A123, valueA, valueB
B234, valueA, valueB
C345, valueX, valueY
D456, valueX, valueY
E567, valueA, valueB
F678, valueA, valueB
G789, value A, valueB

Replacing file1 and file2 in the awk seems to give correct result.

awk -F, 'NR==FNR{A[$1]++;print;next} !A[$1]++' file2 file1 | sort
A123, valueA, valueB
B234, valueA, valueB
C345, valueX, valueY
D456, valueX, valueY
E567, valueA, valueB
F678, valueA, valueB
G789, value A, valueB
1 Like

Messed up with requirement :o

awk -F, 'NR==FNR{A[$1]=$0;next}{if(A[$1]){print A[$1];delete A[$1]}else{print}}
    END{for(i in A){if(A){print A}}}' file2 file1

Thanks for changing the code Jotne, which also works with changing file sequence. :slight_smile:

1 Like

Thanks so much, guys. You guys are genius!!
All the solutions work. But I need to study each of them to see how they work.
Awk code is so hard to understand. It's like a magic to me.

awk Tutorial

PS I have changed !A[$1]++ to !A[$1] , there are no need to imcrement value for the second part

awk -F, 'NR==FNR{A[$1]++;print;next} !A[$1]' f2 f1 | sort

Some more easy to read

awk -F, '
	NR==FNR {
		A[$1]++
		print
		next}
	!A[$1]
	' f1 f2 | sort
cat f2	
C345, valueX, valueY
D456, valueX, valueY
G789, value A, valueB
cat f1
A123, valueA, valueB
B234, valueA, valueB
C345, valueA, valueB
D456, valueA, valueB
E567, valueA, valueB
F678, valueA, valueB

When you run awk with multiple files, it will run once for every line in all files.
NR is line nuber that increase all the time trough all files
FNR is the line number for one single file and is reset on the next file
To show this run

awk -F, '{print $0,NR,FNR}' f2 f1
C345, valueX, valueY 1 1
D456, valueX, valueY 2 2
G789, value A, valueB 3 3
A123, valueA, valueB 4 1
B234, valueA, valueB 5 2
C345, valueA, valueB 6 3
D456, valueA, valueB 7 4
E567, valueA, valueB 8 5
F678, valueA, valueB 9 6

Hence when run NR==FNR it will only do some finction on the first file in the line f2
So this code will be run in f2 {A[$1]++;print;next}
The reason that only this code is run on f2 and no other code is the next .

Part 1
Now we look at the array.
A[$1]++
It stores every value of column 1 in f2 into an array named A
The ++ ads a value to every array, and since all value of column 1 in f2 is unique the ++ can be change to =1
So A[$1]++ can be replaced by A[$1]=1
This gives

A[C345]=1
A[D456]=1
A[G789]=1

print
Here we print all post of the f2 , and this is what we want. Print value of f2 , even if there are same index in f1

C345, valueX, valueY
D456, valueX, valueY
G789, value A, valueB

next
Stop doing more on this record and go to next record and start code from start.

Part 2
When the 3 record of f2 are done, start working on file f1
Since NR==FNR is no more true on f1 , skip this part and run
!A[$1]
We continue working on same array A
the $1 now handles column 1 in f1
This is test part, and its tru only if array is 0 due to the !
Run in this on f1 we get this, since we do not add any value to it.

A[A123]=0
A[B234]=0
A[C345]=1 That it got form f2
A[D456]=1 That it got form f2
A[E567]=0
A[F678]=0

! neglect the value and print only those with value 0
the !A[$1] has no action {} , so it does the default, print record $0 from f1 - {print $0}

A123, valueA, valueB
B234, valueA, valueB
E567, valueA, valueB
F678, valueA, valueB

After awk
The final sort gives

A123, valueA, valueB
B234, valueA, valueB
C345, valueX, valueY
D456, valueX, valueY
E567, valueA, valueB
F678, valueA, valueB
G789, value A, valueB

I do hope this give some light on what awk does :slight_smile:

1 Like

Jotne,

This is an excellent excellent tutorial of awk! This the first time I actually understand how the code works instead of just copy and paste it in my script. I couldn't be more thankful.

Thanks for all the effort that you spent to come up with such a nice explanation.

You are the hero!!!!