Help with file processing using awk

julearn · March 5, 2016, 12:35am

hello All, I'm new to AWK programming and learned myself few things to process a file and deal with duplicate lines, but I got into a scenario which makes me clueless to handle. Here is the scenario..

Input file:

user  role
-----  ----
AAA  add
AAA  delete
BBB  delete
CCC  delete
DDD  add
BBB  add

Expected output:

user role
----  -----
AAA  add
BBB  add
CCC  delete
DDD  add

As per my expected output, if the same user has 2 roles (add & delete) only "add" should be preferred for that user and to be printed ignoring the delete row, but if a user has just one role (add / delete) it can be printed as such. note: line sequence can be in any order.

Thanks for guiding me the rite way to approach!!

Scrutinizer · March 5, 2016, 2:05am

Hi, try:

awk '!($1 in A) || $2=="add" {A[$1]=$2} END{for(i in A) print i, A}' file

or if

user  role
-----  ----

is part of the file:

awk 'FNR<3{print; next} !($1 in A) || $2=="add" {A[$1]=$2} END{for(i in A) print i, A}' file

julearn · March 5, 2016, 11:04am

Thank you!, appreciate... it works like a charm.
But can you let me know how to print two more columns accordingly,
cos when I tried with printing $3 and $4 it always printed the final column value

<<ignore the header>>

user role  location eid
-------------------------
AAA  add  UK  1
AAA  delete  US  4
BBB  delete  CA  5
CCC  delete  AU  2
DDD  add  FR  3
BBB  add  IN  4

expected output :

user role  location  eid
-------------------------
AAA  add  UK  1
BBB  add  IN  4
CCC  delete  AU  2
DDD  add  FR  3

MadeInGermany · March 5, 2016, 1:00pm

Store the entire line, $0, in the array. Because the key $1 is part of the $0, one does not need to print it at the END.

awk '!($1 in A) || $2=="add" {A[$1]=$0} END{for(i in A) print A}' file

RudiC · March 5, 2016, 1:09pm

Does it have to be entirely in awk ? Try

{ read A; read B; printf "%s\n" "$A" "$B";  sort; } < file | awk '!T[$1]++'
user role location eid
-------------------------
AAA add UK 1
BBB add IN 4
CCC delete AU 2
DDD add FR 3

MadeInGermany · March 5, 2016, 1:24pm

sort is a nice idea - add comes before delete when sorted.
If the key field is precisely specified , even sort -u should work

sort -k 1,1 -u file

Cannot test it right now...might need the second key field to be added...

looney · March 5, 2016, 2:22pm

{ read A; read B; printf "%s\n" "$A" "$B";  sort; } < file | awk '!T[$1]++'

Hi RudiC,

Could you please explain the code and the flow.

Don_Cragun · March 5, 2016, 4:48pm

The first part:

{ read A; read B; printf "%s\n" "$A" "$B";  sort; } < file

reads the first line from file into the shell variable A ( read A ) and the second line from file into the shell variable B ( read B ), reprints those two lines unchanged ( printf '%s\n' "$A" "$B" and sorts the remaining lines from file ( sort ). This sorts the data in file while keeping the headers (unsorted) at the start of the output.

All of the output from the first part is then piped ( | ) into the second part:

awk '!T[$1]++'

which creates an array indexed by the first field on each input line ( T[$1] ) and sets the value of that array to the number of times that index has been seen (after using the previous value of the variable). In awk , statements consist of a condition section and an action section. In this case the condition is !T[$1]++ and there is no action section. When there is no action section, the default action (print the line) is performed if, and only if, the condition evaluates to a non-zero numeric value or a non-empty string value. Since undefined elements of an array are treated as a zero or an empty string (depending on context), the condition !T[$1]++ is evaluated as follows the first time a line is seen with a given string in the first field:

T[$1] is seen as a zero value
!T[$1] yields a value of 1 for the condition
T[$1]++ increments the value of the array element to 1

and since the condition's value is non-zero, the line is printed. On subsequent lines with the same first field the evaluation is:

T[$1] is seen as a non-zero value (a count of the number of times this value has been seen before)
!T[$1] yields a value of 0 for the condition
T[$1]++ increments the value of the array element

and since the condition's value is zero, the line is not printed.

Note that the sort in the first part grouped all lines with the same 1st field together and (if both an "add" and a "delete" appeared for that value, the line(s) with "add" for that 1st field value will come before any line(s) with "delete".

Nice, simple, straight-forward magic. The only likely problem with this approach is that if the data in your file contains a 1st field value that is identical to the 1st field in one of the header lines, those data lines will be lost. But, as long as you don't have a user named "user", that shouldn't be a problem for you.

rdrtx1 · March 6, 2016, 9:51am

try also:

awk 'NR==FNR {a[$1]++; next} $2 ~ /add/ {print; a[$1]=0} ++b[$1]==a[$1]' infile infile

Don_Cragun · March 6, 2016, 3:05pm

Note that if you don't care about the order of the data lines in the output, you can also try something like:

awk '
NR < 3 {# Copy header lines unchanged.
	print
	next
}
!($1 in A) && $2 == "add" {
	# Print 1st occurrence of an "add" line for each user.
	print
	A[$1]
	next
}
!($1 in A) && !($1 in D) && $2 == "delete" {
	# If user does not have an "add" line yet and does not have a "delete"
	# line yet, capture the 1st "delete" line.
	D[$1] = $0
}
END {	# After we reach EOF on the input file, print captured "delete" lines
	# for users who did not have an "add" line.
	for(u in D)
		if(!(u in A))
			print D
}' file

which doesn't need sort and only reads the input file once. (It will print all users with "add" in the 2nd field in the order in which they were first found in the input file followed by users who have no "add" but do have a "delete" in the 2nd field in random order.) With your latest sample input file, the output will be:

user role  location eid
-------------------------
AAA  add  UK  1
DDD  add  FR  3
BBB  add  IN  4
CCC  delete  AU  2

Scrutinizer · March 6, 2016, 3:12pm

Combining MadeInGermany's modification in post #4 of my suggestion with the bit that saves the header, you could also try:

awk 'FNR<3{print; next} !($1 in A) || $2=="add" {A[$1]=$0} END{for(i in A) print A}' file