mdkm
September 4, 2014, 4:23am
1
Hi,
file1.txt
AAA
BBB
CCC
DDD
file2.txt
abc|AAA|AAAabcbcs|fnwufnq
bca|nwruqf|AAA|fwfwwefwef
fmimwe|BBB|fnqwufw|wufbqw
wcdbi|CCC|wefnwin|wfwwf
DDD|wabvfav|wqef|fwbwqfwfe
i need the count of rows of file1.txt present in the file2.txt
required output:
AAA 2
BBB 1
CCC 1
DDD 1
these are sample files
i cannot copy the original contents here as its a sensitive data
original file1.txt lines are 11142
original file2.txt lines are 602866
for this scenario i have written some awk commands
but they are very slow, please help me to improve the speed of the command
1)
awk '{ if(NR==FNR) { arr1[FNR]=$0;count1++;next }
if(NR!=FNR) { count=0
for(i=1;i<=count;i++) { if (arr1 ~ $0)
{ count ++ }
}
print $0,count
}
} file2.txt file1.txt
in the real time this took 1 1/2 hour to execute
2)
awk 'NR==FNR{ arr[count]=$0;count++ }
NR!=FNR{ for(i=0;i<count;i++)
{
if( index($0,arr)!=0 ) { arr_count++ }
}
}
END{ for(i=0;i<count;I++)
{
print arr,arrcount
}
}' file1.txt file2.txt
this took around 1 hour to execute
please advice to improve the speed of the awk
If a line in file2.txt
has 2 matching patterns, what do you want to do
let's say, second line in file2.txt
is
bca|BBB|AAA|fwfwwefwef
I would use a slightly different approach
awk 'NR==FNR{C[$1]=0; next}{for (i=1; i<=NF; i++) if ($i in C) C[$i]++} END{for(i in C) print i, C}' file1 FS=\| file2
You could also do it the other way around, but that will use a lot more memory, while the order will be preserved..:
awk 'NR==FNR {for (i=1; i<=NF; i++) C[$i]++; next} $1 in C{print $1, C[$1]}' FS=\| file2 file1
If possible, try to use mawk
since that will usually be fastest..
---
A very different approach:
tr -s '|' '\n' < file2 | grep -xFf file1 - | sort | uniq -c
or in bash / ksh93 you could also use
grep -xFf file1 <(tr -s '|' '\n' < file2) | sort | uniq -c
1 Like
mdkm
September 4, 2014, 4:50am
4
@ SriniShoo SriniShoo
It will not be the case
that will not happen in the real time , so dont consider that scenario
awk 'NR == FNR {a[$1]; next} {for(x in a) {if($0 ~ "(^||)" x "(||$)") {a[x]++; next}}} END {for(x in a) print x, a[x]}' file1 file2
1 Like
Hello,
Following may help in same.
awk 'NR==FNR{a[$1]=$1;next} {for(i=1;i<=NF;i++){if($i == a[$i]){v[$i]++}}} END{for(l in v){print l OFS v[l]}}' file FS="|" file2
Output will be as follows.
AAA 2
BBB 1
CCC 1
DDD 1
Thanks,
R. Singh
1 Like
mdkm
September 5, 2014, 12:19am
7
AAA
BBB
CCC
123
A12
32B
Z H4
M H4
Do you mean if there is an empty line in the first file?
It will produce an extra 0 value with the first suggestion. You can get rid of it like so:
awk 'NR==FNR{if(NF)C[$1]=0; next}{for (i=1; i<=NF; i++) if ($i in C) C[$i]++} END{for(i in C) print i, C}' file1 FS=\| file2
It will make no difference with the second suggestion.
With the grep / pipeline approaches one would need to get rid of the empty lines:
grep -xFf <(awk NF file1) <(tr -s '|' '\n' < file2) | sort | uniq -c
---
By the way, with the latter approaches a sort is also required. Added it to my post..
mdkm
September 5, 2014, 7:38am
9
sorry, forget about the empty lines.
your awk command is too fast , it just took 4 seconds thanks .
i am talking about the Z M4 , Z H4 kind data in file1 ,
because the for loop in the end part is not displaying all the values from the file1 like Z M4, and some data is missing.
OK, I see. Try with $0 :
awk 'NR==FNR{C[$0]=0; next}{for (i=1; i<=NF; i++) if ($i in C) C[$i]++} END{for(i in C) print i, C}' file1 FS=\| file2
1 Like
mdkm
September 5, 2014, 8:03am
11
superb, thanks alot scrutinizer.