search of common words in set of files

mala · May 31, 2009, 9:36am

Hi,

I have a set of simple, one columned text files (in thousands).
file1:
a
b
c
d
file 2:
b
c
d
e
and so on. There is a collection of words in another file:
b d
b c d e
I have to find out the set of words (in each row) is present or absent in the given set of files. So, the output would be in matrix form (file*set) like:
1 0
1 1
I have the following code in bash, which is working well, but it involves very high computational cost with the increase of the file and set size. Any suggestion for better checking for the words is much appreciated.

My code segment:

#!/bin/sh
rm -f feat.txt 
touch feat.txt 
rm -f tem.txt
touch tem.txt
#read the rows of set file s.txt and put into seperate files 
lables=1
while read myline
  do
   echo $myline > temp.txt
   cat temp.txt|awk '{for (i=1;i<=NF;i++) print $i}'|sort|uniq > l$lables.txt
   lables=`expr $lables + 1`
  done < s.txt
p=`wc -l s.txt| awk '{print $1}'`
q=`expr $p - 1`
a=1
c=`expr $a + $q`
while [ $a -le $c ]
  do
  rm -f a.txt
  touch a.txt
  fileno=1
  while [ $fileno -le 1000 ]
   do
     cat l$a.txt|fgrep -xvf $fileno.txt| awk '{printf ($1 " ")}' >> a.txt
     echo >>a.txt
     fileno=`expr $fileno + 1`
   done 
  cat a.txt |awk '{if ($0 != NULL) print "0"; if ($0 == null) print "1"}'>c.txt
  paste c.txt feat.txt > feat1.txt
  cat feat1.txt >feat.txt
  a=`expr $a + 1`
 done

Thank you in advance

cfajohnson · May 31, 2009, 12:28pm

You use far too many external commands where they are not encessary. They slow down the script by a large factor.

You are calling four external commands (rm and touch twice each); you don't need any:

> feat.txt
> tem.txt

#read the rows of set file s.txt and put into seperate files 
lables=1
while read myline
  do
   echo $myline > temp.txt
   cat temp.txt|awk '{for (i=1;i<=NF;i++) print $i}'|sort|uniq > l$lables.txt

You don't need cat and uniq. And tr will be faster than awk.

tr -s ' ' '\n' < temp.txt | sort -u

The shell can do integer arithmetic; you don't need expr:

lables=$(( $lables + 1 ))

You don't need awk:

p=$( wc -l < s.txt )

q=$(( $p - 1 ))

c=$(( $a + $q ))

You don't need rm and touch.

> a.txt

Get rid of all instances of cat and expr from the rest of the script.

  fileno=1
  while [ $fileno -le 1000 ]
   do
   cat l$a.txt|fgrep -xvf $fileno.txt| awk '{printf ($1 " ")}' >> a.txt
   echo >>a.txt
   fileno=`expr $fileno + 1`
   done 
  cat a.txt |awk '{if ($0 != NULL) print "0"; if ($0 == null) print "1"}'>c.txt
  paste c.txt feat.txt > feat1.txt
  cat feat1.txt >feat.txt
  a=`expr $a + 1`
 done

Thank you in advance

colemar · May 31, 2009, 12:33pm

The script has many unnecessary calls to external commands like sort, uniq, wc, expr...
Some blocks could be rewritten as a single call to awk. For example:

while read myline
  do
   echo $myline > temp.txt
   cat temp.txt|awk '{for (i=1;i<=NF;i++) print $i}'|sort|uniq > l$lables.txt
   lables=`expr $lables + 1`
  done < s.txt

awk '{
f = "l"NR".txt"
for (i=1;i<=NF;i++) if (!a[$i]++) print $i > f
close(f)
}' s.txt

I even believe that the entire work can be done in one single awk program, without the need to call external programs, but the requirements are not very clear.

devtakh · May 31, 2009, 12:49pm

awk 'FNR==NR{for(i=1;i<=NF;i++)a[$i]++;next}{if(a[$0]) print FILENAME,1
else print FILENAME,"0" }' coll file*

o/p:

file1 0
file1 1
file1 1
file1 1
file2 1
file2 1
file2 1
file2 1

-Devaraj Takhellambam

mala · May 31, 2009, 1:08pm

Thanks everybody. Your suggestions are excellent. I could improve the code in great scale. Thanks again