Code to exclude lines with similar values

Tzole · March 6, 2013, 12:48pm

Hi!!!

I have a problem with txt file. For example:

File:

CATEGORY OF XXX
  AAA    1          XXX     BBB     CCC
  AAA    1          XXX     DDD     EEE
  AAA    1          XXX     FFF     GGG
  AAA    1          XXX     KKK     LLL
  AAA    1          XXX     MMM     NNN
  
CATEGORY OF YYY
  AAA    1          YYY     OOO    PPP
  AAA    1          YYY     DDD    EEE
  AAA    1          YYY     QQQ    RRR

When I am analyzing the category of XXX, I don�t want the lines that have same values with the category of YYY.
So the output will be:

CATEGORY OF XXX
  AAA     1          XXX     BBB     CCC
  AAA     1          XXX     FFF     GGG
  AAA     1          XXX     KKK     LLL
  AAA     1          XXX     MMM     NNN

(without the second line).

Any suggestions??? Thank you in advance

rdrtx1 · March 6, 2013, 1:17pm

try:

awk '
NR==FNR {if ($3!=cat) a[$1$2$4$5]=$0; next}
$NF==cat
$3==cat {if (!a[$1$2$4$5]) print }
' cat="XXX" infile infile

Scrutinizer · March 6, 2013, 1:27pm

@rdrtx1: It is better to use SUBSEP to separate the fields in the index of the array.

a[$1,$2,$4,$5]

In the sample they all happen to have the same length, but if they vary in length then one value may "blur" into another value and create unexpected results

hanson44 · March 6, 2013, 2:43pm

Here is a possibility. Instead of going through machinations with complex scripts, improve the file format first. The "Category of XXX", etc. information is redundant, already in field #3. "Category of XXX" is extraneous, and hard to deal with. I know you didn't ask for different file format! But I think this is better solution to making file easier to deal with. Suggested new data file format:

  AAA    1          XXX     BBB     CCC
  AAA    1          XXX     DDD     EEE
  AAA    1          XXX     FFF     GGG
  AAA    1          XXX     KKK     LLL
  AAA    1          XXX     MMM     NNN
  AAA    1          YYY     OOO    PPP
  AAA    1          YYY     DDD    EEE
  AAA    1          YYY     QQQ    RRR

sort on field #4 (BBB).
run uniq with option to limit comparison to fields #4 and #5.
uniq step will get rid of the "DDD EEE" duplication.
sort on field #3, to put categories back in order.

Tzole · March 6, 2013, 3:01pm

@rdrtx1 it works!!! Thank you so much!!

I will see also the other useful suggestions of @Scrutinizer and @hanson44