Putting together all values from different files in one file

shoaibjameel123 · August 16, 2011, 1:58am

Hi All,

This is what I am trying to achieve but to no avail.

I have three sets of files which are:

One big dictionary file which looks like this:

apple
orange
computer
pear
country

Some thousands of text files which are named as 1.dat, 2.dat, 3.dat etc

The text files look like this (assume this to be 1.dat):

apple
computer
country

Another set of files (with extension .num and same in number as dat files above) but instead of words have some numbers. These numbers are some values corresponding to the words in the dat files. This means that for the above 1.dat, my 1.num would look like this:

0.33
2.3
0.84

Same thing goes for 2.dat and 3.dat and rest. Hence, 2.dat has a 2.num, 3.dat has a 3.num and so on.

Now, I want to bring everything together in one file so that I may see the values all at once rather than opening several files.

This is what I wish to achieve:

1 1 0.33
3 1 2.3
5 1 0.84

The above output says that: apple's position is 1 (the first one) in the dictionary file and it is from 1.dat (the second one, I have removed .dat) and its value is 0.33 (obtained from 1.num file)
Similarly, computer is at position 3 in the dictionary file and it is from 1.dat (that is the first document) and its value is 2.3 from the 1.num file. country is the 5th word in the dictionary file and it from 1.dat and its value if 0.84 in 1.num and same thing goes on for 2.dat and 2.num, 3.dat and 3.num and so on.

I have a code for this but it is able to do what I wish to achieve. What I am doing is that creating a huge matrix from the file and then converting it to my format. But when I create the matrix the entire memory blows off and my computer hangs. So, I want to bypass this matrix creation step.

Code to create matrix:

awk 'NR==FNR{
       A[$1]=NR
       next
     }
     !n{n=NR}
     FNR==1{
       ++m
       close(f)
       f=FILENAME
       sub(/\.dat/,x,f)
       k=f
       f=f".num"
     }
     {
       getline v<f
       B[A[$1],k]=v
     }
     END{
       for(i=1;i<=n;i++){
         for(j=1;j<=m;j++)printf "%s ",B[i,j]?B[i,j]:0
         print x
       }
     }' dictionary *.num

and then I convert it to my format like this:

awk -v nc=6000 -v nr=60000 '
{ for (col=1; col<=NF; col++) matrix[NR,col] = $col }  
END { 
    for (col=1; col<=nr; col++)
      for (row=1; row<=nc; row++)
        if (matrix[row,col])
          print row, col, matrix[row, col]
}
' matrix.mtx

where matrix.mtx is my huge matrix file.

I am using BASH on Linux.

Shell_Life · August 16, 2011, 12:04pm

You can avoid loading "X.dat" and "X.num" in arrays with the below code:

#!/bin/ksh
typeset -i mCnt=1
while [[ ${mCnt} -le Number_Files ]]; do
  echo "Now working on <${mCnt}> files:"
  paste -d' ' ${mCnt}.dat ${mCnt}.num > Tmp_Dat_Num

  <insert the dictionary look up code here>

  mCnt=${mCnt}+1
done

This will make it easier for you.

yazu · August 16, 2011, 12:42pm

This is very easy with perl. And it should be easy with awk if you have only 1.common, 2.common, 3.common, etc. files with, for example, such format (numbers as the first field allows to have words with spaces):

0.33 apple
2.3 computer
0.84 country

It is not hard to paste all your files:

for num in `seq 1 $max`; do
  paste $num.num $num.dat >$num.common
done

Then:

awk -F'\t' '
  NR == FNR { a[$1]=NR }
  NR != FNR {
    sub(".common", "", FILENAME)
    print a[$2], FILENAME,  $1
  }
' dictionary *.common
1 1 0.33
3 1 2.3
5 1 0.84