Speed : awk command to count the occurrences of fields from one file present in the other file

mdkm · September 4, 2014, 4:23am

Hi,

file1.txt

AAA
BBB
CCC
DDD

file2.txt

abc|AAA|AAAabcbcs|fnwufnq
bca|nwruqf|AAA|fwfwwefwef
fmimwe|BBB|fnqwufw|wufbqw
wcdbi|CCC|wefnwin|wfwwf
DDD|wabvfav|wqef|fwbwqfwfe

i need the count of rows of file1.txt present in the file2.txt
required output:

AAA 2
BBB 1
CCC 1
DDD 1

these are sample files

i cannot copy the original contents here as its a sensitive data
original file1.txt lines are 11142
original file2.txt lines are 602866

for this scenario i have written some awk commands
but they are very slow, please help me to improve the speed of the command

1)

awk '{ if(NR==FNR) { arr1[FNR]=$0;count1++;next }
          if(NR!=FNR) { count=0
                             for(i=1;i<=count;i++) { if (arr1 ~ $0)
                                                               { count ++ }
                                                           }
                             print $0,count
                           }
        } file2.txt file1.txt

in the real time this took 1 1/2 hour to execute

2)

awk 'NR==FNR{ arr[count]=$0;count++ }
       NR!=FNR{ for(i=0;i<count;i++)
                     { 
                       if( index($0,arr)!=0 ) { arr_count++ }
                      }
                    }
       END{ for(i=0;i<count;I++)
               {
                 print arr,arrcount
               }
             }' file1.txt file2.txt

this took around 1 hour to execute

please advice to improve the speed of the awk

SriniShoo · September 4, 2014, 4:40am

If a line in file2.txt has 2 matching patterns, what do you want to do
let's say, second line in file2.txt is

bca|BBB|AAA|fwfwwefwef

Scrutinizer · September 4, 2014, 4:46am

I would use a slightly different approach

awk 'NR==FNR{C[$1]=0; next}{for (i=1; i<=NF; i++) if ($i in C) C[$i]++} END{for(i in C) print i, C}' file1 FS=\| file2

You could also do it the other way around, but that will use a lot more memory, while the order will be preserved..:

awk 'NR==FNR {for (i=1; i<=NF; i++) C[$i]++; next} $1 in C{print $1, C[$1]}' FS=\| file2 file1

If possible, try to use mawk since that will usually be fastest..

---
A very different approach:

tr -s '|' '\n' < file2 | grep -xFf file1 - | sort | uniq -c

or in bash / ksh93 you could also use

grep -xFf file1 <(tr -s '|' '\n' < file2) | sort | uniq -c

mdkm · September 4, 2014, 4:50am

@ SriniShoo SriniShoo
It will not be the case
that will not happen in the real time , so dont consider that scenario

SriniShoo · September 4, 2014, 4:56am

awk 'NR == FNR {a[$1]; next} {for(x in a) {if($0 ~ "(^||)" x "(||$)") {a[x]++; next}}} END {for(x in a) print x, a[x]}' file1 file2

RavinderSingh13 · September 4, 2014, 5:02am

Hello,

Following may help in same.

awk 'NR==FNR{a[$1]=$1;next} {for(i=1;i<=NF;i++){if($i == a[$i]){v[$i]++}}} END{for(l in v){print l OFS v[l]}}' file FS="|" file2

Output will be as follows.

AAA 2
BBB 1
CCC 1
DDD 1

Thanks,
R. Singh

mdkm · September 5, 2014, 12:19am

AAA
 
BBB
CCC
123
A12
32B
Z H4
M H4

Scrutinizer · September 5, 2014, 3:31am

Do you mean if there is an empty line in the first file?

It will produce an extra 0 value with the first suggestion. You can get rid of it like so:

awk 'NR==FNR{if(NF)C[$1]=0; next}{for (i=1; i<=NF; i++) if ($i in C) C[$i]++} END{for(i in C) print i, C}' file1 FS=\| file2

It will make no difference with the second suggestion.

With the grep / pipeline approaches one would need to get rid of the empty lines:

grep -xFf <(awk NF file1) <(tr -s '|' '\n' < file2) | sort | uniq -c

---
By the way, with the latter approaches a sort is also required. Added it to my post..

mdkm · September 5, 2014, 7:38am

sorry, forget about the empty lines.
your awk command is too fast , it just took 4 seconds thanks .
i am talking about the Z M4 , Z H4 kind data in file1 ,
because the for loop in the end part is not displaying all the values from the file1 like Z M4, and some data is missing.

Scrutinizer · September 5, 2014, 7:41am

OK, I see. Try with $0 :

awk 'NR==FNR{C[$0]=0; next}{for (i=1; i<=NF; i++) if ($i in C) C[$i]++} END{for(i in C) print i, C}' file1 FS=\| file2

mdkm · September 5, 2014, 8:03am

superb, thanks alot scrutinizer.