awk Tutorial
PS I have changed !A[$1]++
to !A[$1]
, there are no need to imcrement value for the second part
awk -F, 'NR==FNR{A[$1]++;print;next} !A[$1]' f2 f1 | sort
Some more easy to read
awk -F, '
NR==FNR {
A[$1]++
print
next}
!A[$1]
' f1 f2 | sort
cat f2
C345, valueX, valueY
D456, valueX, valueY
G789, value A, valueB
cat f1
A123, valueA, valueB
B234, valueA, valueB
C345, valueA, valueB
D456, valueA, valueB
E567, valueA, valueB
F678, valueA, valueB
When you run awk
with multiple files, it will run once for every line in all files.
NR
is line nuber that increase all the time trough all files
FNR
is the line number for one single file and is reset on the next file
To show this run
awk -F, '{print $0,NR,FNR}' f2 f1
C345, valueX, valueY 1 1
D456, valueX, valueY 2 2
G789, value A, valueB 3 3
A123, valueA, valueB 4 1
B234, valueA, valueB 5 2
C345, valueA, valueB 6 3
D456, valueA, valueB 7 4
E567, valueA, valueB 8 5
F678, valueA, valueB 9 6
Hence when run NR==FNR
it will only do some finction on the first file in the line f2
So this code will be run in f2
{A[$1]++;print;next}
The reason that only this code is run on f2
and no other code is the next
.
Part 1
Now we look at the array.
A[$1]++
It stores every value of column 1
in f2
into an array named A
The ++
ads a value to every array, and since all value of column 1
in f2
is unique the ++
can be change to =1
So A[$1]++
can be replaced by A[$1]=1
This gives
A[C345]=1
A[D456]=1
A[G789]=1
print
Here we print all post of the f2
, and this is what we want. Print value of f2
, even if there are same index in f1
C345, valueX, valueY
D456, valueX, valueY
G789, value A, valueB
next
Stop doing more on this record and go to next record and start code from start.
Part 2
When the 3 record of f2
are done, start working on file f1
Since NR==FNR
is no more true on f1
, skip this part and run
!A[$1]
We continue working on same array A
the $1
now handles column 1
in f1
This is test part, and its tru only if array is 0
due to the !
Run in this on f1
we get this, since we do not add any value to it.
A[A123]=0
A[B234]=0
A[C345]=1 That it got form f2
A[D456]=1 That it got form f2
A[E567]=0
A[F678]=0
!
neglect the value and print only those with value 0
the !A[$1]
has no action {}
, so it does the default, print
record $0
from f1
- {print $0}
A123, valueA, valueB
B234, valueA, valueB
E567, valueA, valueB
F678, valueA, valueB
After awk
The final sort
gives
A123, valueA, valueB
B234, valueA, valueB
C345, valueX, valueY
D456, valueX, valueY
E567, valueA, valueB
F678, valueA, valueB
G789, value A, valueB
I do hope this give some light on what awk does