I have a set of simple, one columned text files (in thousands).
file1:
a
b
c
d
file 2:
b
c
d
e
and so on. There is a collection of words in another file:
b d
b c d e
I have to find out the set of words (in each row) is present or absent in the given set of files. So, the output would be in matrix form (file*set) like:
1 0
1 1
I have the following code in bash, which is working well, but it involves very high computational cost with the increase of the file and set size. Any suggestion for better checking for the words is much appreciated.
My code segment:
#!/bin/sh
rm -f feat.txt
touch feat.txt
rm -f tem.txt
touch tem.txt
#read the rows of set file s.txt and put into seperate files
lables=1
while read myline
do
echo $myline > temp.txt
cat temp.txt|awk '{for (i=1;i<=NF;i++) print $i}'|sort|uniq > l$lables.txt
lables=`expr $lables + 1`
done < s.txt
p=`wc -l s.txt| awk '{print $1}'`
q=`expr $p - 1`
a=1
c=`expr $a + $q`
while [ $a -le $c ]
do
rm -f a.txt
touch a.txt
fileno=1
while [ $fileno -le 1000 ]
do
cat l$a.txt|fgrep -xvf $fileno.txt| awk '{printf ($1 " ")}' >> a.txt
echo >>a.txt
fileno=`expr $fileno + 1`
done
cat a.txt |awk '{if ($0 != NULL) print "0"; if ($0 == null) print "1"}'>c.txt
paste c.txt feat.txt > feat1.txt
cat feat1.txt >feat.txt
a=`expr $a + 1`
done
The script has many unnecessary calls to external commands like sort, uniq, wc, expr...
Some blocks could be rewritten as a single call to awk. For example:
awk '{
f = "l"NR".txt"
for (i=1;i<=NF;i++) if (!a[$i]++) print $i > f
close(f)
}' s.txt
I even believe that the entire work can be done in one single awk program, without the need to call external programs, but the requirements are not very clear.