Creating a matrix from files.

shoaibjameel123 · January 31, 2011, 3:32am

I need to create a large matrix so that I can feed that matrix to MATLAB for processing. The problem is creating that matrix because my data is completely scattered around files.

I have one big dictionary file which has words in newlines, like

apple
orange
pineapple

I have some 400 files (all with *.txt) extension which contain words in newlines like:

apple
computer
orange
glass

I have another file 400 files (all with *.dat extension) which have numbers in them. These numbers correspond to the values of the words stored in each file with *.txt extension. This means a file named 1.txt has a corresponding file 1.dat. First word from 1.txt has a value stored in first line of 1.dat. This means if 1.txt has

apple
orange

and 1.dat has

2
3

The value of apple is 2 and orange is 3.

The Task: I need to search apple in the dictionary file and place 2 in front of apple and orange in dictionary and place 3 in front of orange and rest all 0's and then store the result in the matrix file. This way I create columns.

This means my first column are the words of file from 1.txt (and numbers from 1.dat), my second column are the words from 2.txt and numbers from 2.dat and so on for 400 files. hence last column is the 400th file. Hence there are 400 columns and number of words in the dictionary all form the number of rows.
The file matrix should look like this (a small example consisting of 3 files and 6 words in the dictionary):

Frankly, I have no idea how to do it with awk or sed.

Scrutinizer · January 31, 2011, 6:50am

Try something like this:

First read the contents of the dictionary file into a 1-dimensional array and mark its position within the dictionary. This will be the row number in the final 2-dimensional array. Then for every txt file for every line read the corresponding value. Use this value to lookup the row position in the the 1 dimensional array. The column number is the name of the file without .txt. Read the numerical value from the corresponding .dat file and store that value into a 2-dimensional array with the row and column that was calculated.

After the array is filled, enumerate the array and print the value, if no value is set, print a 0.

Klashxx · January 31, 2011, 12:46pm

If Perl is ok you can try:

#!/usr/bin/perl

$numFiles=400;

open(DIC,"<",shift) || die;
foreach ( <DIC> ) {
  if ( $_ !~ /^\s*$/  ) {
     chomp;
     $cont++;
     $dic[$cont]=$_ ;
     }
  }
close(DIC);

for $fic ( 1 .. $numFiles  ) {
   open(DAT,"<",$fic .".dat") || die;
   @dat=<DAT>;
   close(DAT);

   open(TXT,"<",$fic.".txt") || die;
   $cont=1;
   foreach ( <TXT> ) {
      chomp;
      $txt{$_}[$fic]=$dat[($cont-1)] if ( $_ );
      $cont++;
      }
   close (TXT);
   }   

for $contDic (1 .. $#dic) {
   for (1 .. $numFiles) {
      $v=$txt{$dic[$contDic]}[$_]+0;
      print $v." ";
      }
   print "\n";
   }

Usage:

script dicFile

Scrutinizer · January 31, 2011, 1:32pm

Or awk:

awk 'NR==FNR{
       A[$1]=NR
       next
     }
     !n{n=NR}
     FNR==1{
       ++m
       close(f)
       f=FILENAME
       sub(/\.txt/,x,f)
       k=f
       f=f".dat"
     }
     {
       getline v<f
       B[A[$1],k]=v
     }
     END{
       for(i=1;i<=n;i++){
         for(j=1;j<=m;j++)printf "%s ",B[i,j]?B[i,j]:0
         print x
       }
     }' dictionary *.txt