search of common words in set of files

Hi,

I have a set of simple, one columned text files (in thousands).
file1:
a
b
c
d
file 2:
b
c
d
e
and so on. There is a collection of words in another file:
b d
b c d e
I have to find out the set of words (in each row) is present or absent in the given set of files. So, the output would be in matrix form (file*set) like:
1 0
1 1
I have the following code in bash, which is working well, but it involves very high computational cost with the increase of the file and set size. Any suggestion for better checking for the words is much appreciated.

My code segment:

#!/bin/sh
rm -f feat.txt 
touch feat.txt 
rm -f tem.txt
touch tem.txt
#read the rows of set file s.txt and put into seperate files 
lables=1
while read myline
  do
   echo $myline > temp.txt
   cat temp.txt|awk '{for (i=1;i<=NF;i++) print $i}'|sort|uniq > l$lables.txt
   lables=`expr $lables + 1`
  done < s.txt
p=`wc -l s.txt| awk '{print $1}'`
q=`expr $p - 1`
a=1
c=`expr $a + $q`
while [ $a -le $c ]
  do
  rm -f a.txt
  touch a.txt
  fileno=1
  while [ $fileno -le 1000 ]
   do
     cat l$a.txt|fgrep -xvf $fileno.txt| awk '{printf ($1 " ")}' >> a.txt
     echo >>a.txt
     fileno=`expr $fileno + 1`
   done 
  cat a.txt |awk '{if ($0 != NULL) print "0"; if ($0 == null) print "1"}'>c.txt
  paste c.txt feat.txt > feat1.txt
  cat feat1.txt >feat.txt
  a=`expr $a + 1`
 done 
 

Thank you in advance

You use far too many external commands where they are not encessary. They slow down the script by a large factor.

You are calling four external commands (rm and touch twice each); you don't need any:

> feat.txt
> tem.txt

You don't need cat and uniq. And tr will be faster than awk.

tr -s ' ' '\n' < temp.txt | sort -u

The shell can do integer arithmetic; you don't need expr:

lables=$(( $lables + 1 ))

You don't need awk:

p=$( wc -l < s.txt )
q=$(( $p - 1 ))
c=$(( $a + $q ))

You don't need rm and touch.

> a.txt

Get rid of all instances of cat and expr from the rest of the script.

The script has many unnecessary calls to external commands like sort, uniq, wc, expr...
Some blocks could be rewritten as a single call to awk. For example:

while read myline
  do
   echo $myline > temp.txt
   cat temp.txt|awk '{for (i=1;i<=NF;i++) print $i}'|sort|uniq > l$lables.txt
   lables=`expr $lables + 1`
  done < s.txt
awk '{
f = "l"NR".txt"
for (i=1;i<=NF;i++) if (!a[$i]++) print $i > f
close(f)
}' s.txt

I even believe that the entire work can be done in one single awk program, without the need to call external programs, but the requirements are not very clear.

awk 'FNR==NR{for(i=1;i<=NF;i++)a[$i]++;next}{if(a[$0]) print FILENAME,1
else print FILENAME,"0" }' coll file*

o/p:

file1 0
file1 1
file1 1
file1 1
file2 1
file2 1
file2 1
file2 1

-Devaraj Takhellambam

Thanks everybody. Your suggestions are excellent. I could improve the code in great scale. Thanks again