Filter uniq field values (non-substring)

yifangt · May 7, 2014, 5:16pm

Hello, I want to filter column based on string value. All substring matches are filtered out and only unique master strings are picked up.
infile:

1 abcd
2 abc
3 abcd
4 cdef
5 efgh
6 efgh
7 efx
8 fgh

Outfile:

1 abcd
4 cdef
5 efgh
7 efx

I have tried

awk '!a[$2]++; match(a[$2], $2)>0' infile

Apparently the match part is wrong that should be a loop to go through all the members already saved in a[].

awk '!a[$2]++; END{
for (i in a)
{
if ( match(a, $2)>0) {
print $0;
}
}
}' infile

but this is not working.
Thanks a lot!

vgersh99 · May 7, 2014, 5:32pm

I think that's what you're after:

awk '{for(i in a) if(i ~ $2 )next;a[$2]}1' myFile

yifangt · May 7, 2014, 5:50pm

Exactly! Thanks a lot!

alister · May 7, 2014, 5:51pm

That approach only tests potential substrings against some of the strings which precede it. As a consequence, the first string in the file is always included.

Consider the result if the first line in the OP's data sample were deleted.

Regards,
Alister

yifangt · May 7, 2014, 6:02pm

Alister, I did not get full of your point. Could you give any example? Thanks

vgersh99 · May 7, 2014, 6:08pm

very good point - thank you!
here's a diff approach:

awk '{for(i in a) if ($2 ~ i) delete a;a[$2]=$0}END {for (i in a) print a}' myFile

yifangt · May 7, 2014, 6:15pm

Did a test:

awk '{for(i in a) if ($2 ~ i) delete a;a[$2]=$1}END {for (i in a) print a,i}' infile

8 fgh
3 abcd
6 efgh
4 cdef
7 efx

8 fgh is a substring of 6 efgh More ideas?

vgersh99 · May 7, 2014, 6:31pm

sorry, mah bad:

 awk '{for(i in a) {if (i~$2) next;if ($2 ~ i) delete a};a[$2]=$0}END {for (i in a) print a}' myFile

alister · May 7, 2014, 11:01pm

Since the strings tested aren't regular expressions, using the regular expression operator is, at best, unnecessarily expensive. At worst, if the strings are allowed to contain regular expression metacharacters, it can lead to an erroneous result.

I suggest using index() instead. For non-trivial data sets, it will also speed things up dramatically.

Testing a near-worst case scenario. The file contains 1501 lines and only the last line contains a string which is a substring of another. Note that while gawk is used, testing with mawk and nawk showed similar improvements:

$ yes | awk 'NR>1500 {exit} {print NR, NR+1000} END {print NR, 25}' > 1501_1.txt
$ tail -n5 1501_1.txt 
1497 2497
1498 2498
1499 2499
1500 2500
1501 25
$ time gawk '{for(i in a) {if (i~$2) next;if ($2 ~ i) delete a};a[$2]=$0}END {for (i in a) print a}' 1501_1.txt | tail -n5
638 1638
269 1269
228 1228
639 1639
229 1229

real	0m10.462s
user	0m10.149s
sys	0m0.276s
$ time gawk '{for(i in a) {if (index(i,$2)) next;if (index($2, i)) delete a};a[$2]=$0}END {for (i in a) print a}' 1501_1.txt | tail -n5
638 1638
269 1269
228 1228
639 1639
229 1229

real	0m0.895s
user	0m0.892s
sys	0m0.004s

Regards,
Alister

yifangt · May 8, 2014, 12:01pm

Thanks!
I get it now, actually I have hundred thousand lines.

I just thought another scenario: Is possible to do with two columns? i.e. any substring of the same column (but need both col2 and col4 at the same time), should be skipped.

infile:
1 abcd    idx01    ijklm
2 abc    idx03    klm
3 abcd    idx05    jkl
4 cdef    idx06    ijklm
5 efgh    idx07    abcd
6 efg    idx09    abc
7 efx    idx11    abcd
8 fgh    idx12    bcd

output:
1 abcd    idx01    ijklm
4 cdef    idx06    ijklm
5 efgh    idx07    abcd
7 efx    idx11    abcd

I tried using two arrays to loop:

gawk '{for(i in a) {if (index(i,$2)) next; for(j in b) {if (index(j,$4)) next; if (index($2, i) && index($4, j)) delete a}; a[$2]=$0; b[$4]=$4}} END {for (i in a) print a}' a[$2]=$0} END {for (i in a) print a}'  infile

But there was no output. The second loop seems of problem, any suggestions please? Thanks a lot!

alister · May 8, 2014, 12:59pm

That's the type of information that should always be mentioned in the initial post. Please keep that in mind going forward.

yifangt:

I just thought another scenario: Is possible to do with two columns? i.e. any substring of the same column (but need both col2 and col4 at the same time), should be skipped.
infile:
1 abcd    idx01    ijklm
2 abc    idx03    klm
3 abcd    idx05    jkl
4 cdef    idx06    ijklm
5 efgh    idx07    abcd
6 efg    idx09    abc
7 efx    idx11    abcd
8 fgh    idx12    bcd
output:
1 abcd    idx01    ijklm
4 cdef    idx06    ijklm
5 efgh    idx07    abcd
7 efx    idx11    abcd
I tried using two arrays to loop:
gawk '{for(i in a) {if (index(i,$2)) next; for(j in b) {if (index(j,$4)) next; if (index($2, i) && index($4, j)) delete a}; a[$2]=$0; b[$4]=$4}} END {for (i in a) print a}' a[$2]=$0} END {for (i in a) print a}'  infile
But there was no output. The second loop seems of problem, any suggestions please? Thanks a lot!

I haven't given it much thought, but upon cursory examination, your logic is definitely very flawed. If i is in a, you jump to the next line. That's wrong. If I understood the task, before you can skip a line, both i must be in a and j must be in b.

Once included, if later lines prove that an earlier line's $2 and $4 are substrings, then that earlier line must be excluded. Note that an earlier line's $2 and $4 may be disqualified by different subsequent lines, so you must track that as well.

Was there some copy-paste malfunction in your post? There are two END sections, nearly identical, which doesn't make sense (multiple END pattern-action pairs are allowed, but in this case I don't see the point of them).

One simple, if not optimal, way to solve the problem is to handle each column individually and then join the results.

Regards,
Alister

yifangt · May 8, 2014, 2:11pm

You are right: If i is in a, you jump to the next line. That's wrong. ... before you can skip a line, both i must be in a and j must be in b. ...Was there some copy-paste malfunction in your post? Sorry for that!

This part is not what I want:
One simple, if not optimal, way to solve the problem is to handle each column individually and then join the results.
My logic is only if i in a is substring of $2 and j in b is substring of $4 at the same time, that line should be skipped. Delete a [i]will skip the whole line as a[$2]=$0. If $2 and $4 are handled separately before joined later, some lines are deleted but should not! For example:

1 abcd    idx01    ijklm 
2 abc    idx03    klm 
3 abcd    idx05    jkl
...
9 abcd    idx05   abcd

Line 9 should not be deleted even $2 is identical to that in Line 1, as $4 at Line 9 is not a substring of $4 in Line 1! No cross comparing between $2 and $4! I thought of combine the two columns to have a single one, which does not work either obviously, as not all the string start with the same char to have substring. I have hard time to catch array in awk. Thanks a lot!

vgersh99 · May 8, 2014, 2:51pm

a bit verbose and most likely not optimal:

awk '
   {idx=$2 SUBSEP $4}
   {
      for(i in a) {
        split(i,tA,SUBSEP)
        if (index(tA[1],$2) && index(tA[2],$4)) 
           next
        if (index($2, tA[1]) && index($4,tA[2])) 
           delete a
      }
      a[idx]=$0
   }
END {
   for (i in a) print a
}' myFile

can probably generalize this to specify any number of fields AND-ed together.....

alister · May 8, 2014, 3:40pm

vgersh99:

a bit verbose and most likely not optimal:
awk '
   {idx=$2 SUBSEP $4}
   {
   for(i in a) {
   split(i,tA,SUBSEP)
   if (index(tA[1],$2) && index(tA[2],$4)) 
   next
   if (index($2, tA[1]) && index($4,tA[2])) 
   delete a
   }
   a[idx]=$0
   }
END {
   for (i in a) print a
}' myFile 
can probably generalize this to specify any number of fields AND-ed together.....

That's not correct because it only accounts for cases where one line's field pair matches another line's field pair. The case where a line's fields are substrings of fields of two different lines is unaccounted for.

Regards,
Alister

vgersh99 · May 8, 2014, 3:44pm

hmmm... have problems visualizing this. Could you provide a sample, pls.
It works for the OP's sample... Just trying to see the boarder conditions......

alister · May 8, 2014, 4:52pm

After re-reading the OP's previous two posts, I'm not certain if I correctly understood the task. If I am mistaken, apologies for the noise.

What I have in mind is the following scenario:

1 aaaa 10 aaaa
2 bbbb 20 bbbb
3 aaa 30 bbb

At the time that I wrote my previous post, I was under the impression that line 3 should be excluded because $2 is a substring of line1 and $4 of line 2.

However, your interpretation may well be correct. Perhaps the substring matches must be constrained to the same line. I am no longer confident in my assumption. Both of the fields of the lines excluded from the sample data in post #10 match the same preceding line, but nothing in that data set precludes independent column matching.

Perhaps I read too much into it.

Regards,
Alister

yifangt · May 8, 2014, 5:54pm

There should not be cross column comparison. Line 3 3 aaa 30 bbb is unique as its field2 and field4 are not substring of their corresponding field of any previous lines at the same time.

1 aaaa 10 aaaa
3 aaa  30 bbb

Only the script is slow for my 76,000 lines, still running after ~1hr on Linux 2.6.32-431.5.1.el6.x86_64. Thank you very much, both of you!

vgersh99 · May 8, 2014, 6:27pm

see if this makes it faster - getting away from the associate array and split-ing....:

awk '{
      for(i=1;i<=c;i++) {
        if (index(a,$2) && index(b,$4))
           next
        if (index($2, a) && index($4,b[iA])) {
           delete a
           delete b
        }
      }
      a[++c]=$2
      b[c]=$4
      all[c]=$0
   }
END {
   for (i=1; i in all;i++) print all
}' myFile

probably there's a better way to handle delete-d array elements that doesn't create 'holes' to be iterated over and over again, but... First let's see if this change makes any difference

alister · May 8, 2014, 6:52pm

If you add an array of keys, for the added memory, it's possible to completely bypass a scan of a and b:

awk '{
    if (($2 SUBSEP $4) in k) next
    for(i=1;i<=c;i++) {
...

Regards,
Alister

vgersh99 · May 8, 2014, 6:52pm

actually, it should be as simple as this for the sample file provided:

awk  '{
      for(i=1;i<=c;i++)
        if (index(a,$2) && index(b,$4))
           next
      a[++c]=$2
      b[c]=$4
      all[c]=$0
   }
END {
   for (i=1; i in all;i++) print all
}' myFile

If you had a better, more representative data sample, maybe we could tweak the above - it works for the provided sample.