Creating matrix from folders and subfolders

newbie83 · December 17, 2012, 10:43pm

Hello,

Greetings!
please help me produce the following solution. I need
to produce one big matrix file from several files in different levels.
If it helps, the index folder provides information on chromosome index and
the data folder provides information on values for chromosomes.

there are 2 folders at the same level, index and data.
The index folder has multiple files named chr1, chr2 etc.
The data folder has many subfolders, Each subfolder has multiple files named chr1, chr2 etc. with the same names
as files in the index folder. A particular file and its namesake will have the same number of rows in it.
So if chr1 in index has 5 rows, chr1 in all subfolders within data will also have 5 rows.

The output should be a big matrix with a nested format, where the rownames(first col) starting row2 should be the file names
and the column names(first row) startng col3 should be the names of corresponding subfolders in data folder.

All files have 1 column and multiple rows with only integer numbers.

Index folder 

chr1

1
2
3
5
6

chr2

1
2
3
4
5
7

chr3

1
5
7


Data Folder 

Subfolder1

chr1

1
0
1
0
0

chr2

0
1
0
1
0
1

chr3

0
0
2

Subfolder2

chr1

1
1
2
2
3

chr2

1
3
4
6
0
0


chr3

1
0
0

Output

		Subfolder1	Subfolder2
chr1	1	1	1
chr1	2	0	1
chr1	3	1	2
chr1	5	0	2
chr1	6	0	3
chr2	1	0	1
chr2	2	1	3
chr2	3	0	4
chr2	4	1	6
chr2	5	0	0
chr2	7	1	0
chr3	1	0	1
chr3	5	0	0
chr3	7	2	0

Chubler_XL · December 18, 2012, 12:12am

How about using awk:

find index data -type f -print | awk '
/^index/ {
   FL=$0
   n=split(FL,p,"/");
   F[++files]=p[n]
   n=0
   while ((getline < FL) > 0) {
       I[F[files],++n]=$0
       C[F[files]]=n
   }
   close(FL)
}
/^data/ { FL=$0
   n=split(FL,p,"/");
   subdir=p[n-1]
   S[subdir]=1
   file=p[n]
   n=0
   while ((getline < FL) > 0)
      D[file,subdir,++n]=$0
   close(FL)
}
END{
    printf "\t"
    for(subdir in S) printf "\t%s",subdir;
    printf "\n"
    for(i=1;i<=files;i++) {
        for(c=1;c<=C[F];c++) {
            printf "%s\t%s",F,I[F,c]
            for(subdir in S) printf "\t%s",D[F,subdir,c];
            printf "\n"
        }
    }
}'

newbie83 · December 18, 2012, 8:29am

This works like a charm with the sample data, but with the actual data it is taking forever, ...40 mins and it hasn't written a single output row..i guess i will have to wait it out....thanks again...

if itsnt much trouble, is there a more efficient way?

---------- Post updated at 09:29 AM ---------- Previous update was at 02:38 AM ----------

Update : 6 hours in , still no output lines, the data code and the code are fine...
just that the data is too big , 22 gigs to be precise.

any suggestions on how to speed things up?

Chubler_XL · December 18, 2012, 12:29pm

Wow, 22Gb is a lot of data I'm assuming your using GNU awk or it would have probably fallen over by now.

The solution does need to read in all the data files before any output starts. I would assume the final output phase will be quite quick so dont get too worried that no output has appeared yet.

What is the total number of index files and the total number of subdirectories?

I can think of another method to solve this problem it those file/subdir counts aren't too massive, but I have some other stuff to do for the next 3 hours or so - I'll start working on it for you then.

newbie83 · December 18, 2012, 1:25pm

There are 13 files in index, 83 sub-folders in data with 13 files each.
The size of files is what is creating this hang-time.
please take your time,its not a matter of life and death to get done in the next few hours.
and cant thank you enough for your help.

Chubler_XL · December 18, 2012, 3:36pm

As you only have 83 subfolders gawk should be able to keep all the files open as it works ( ulimit -n controls how many files gawk can have open) this will reduce the memory requirements from tens of Gb to a few Kb (and you should get output pretty much straight away).

Many traditional awk have a 15 open file limit and if you only have this awk your out of luck with this solution.

find index data -type f -print | awk -F/ '
/^index/ { F[++files]=$NF }
/^data/  { S[$(NF-1)] }
END {
    printf "\t"
    for(subdir in S) printf "\t%s",subdir;
    printf "\n"
    for(i=1;i<=files;i++) {
        while((getline < ("index/"F)) > 0) {
            printf "%s\t%s",F,$0
            for(subdir in S) {
                getline < ("data/"subdir"/"F)
                printf "\t%s",$0
            }
            printf "\n"
        }
        close(F)
        for(subdir in S) close("data/"subdir"/"F);
    }
}'

newbie83 · December 18, 2012, 6:25pm

The code runs fine with the sample, but with the actual data it just prints the first row , the sub-folder names. let me play around the code a little bit and try to find out whats happening.

Chubler_XL · December 18, 2012, 6:39pm

Here is a little script I built to generate testing data (the solution also seems to work with my generated data).

#/bin/bash
for f in {1..13}
do
     c=$((RANDOM%8+2))
     echo chr$f - $c records
     for((l=0;l<c;l++))
     do
         echo $((RANDOM%10)) >> index/chr$f
     done
     for s in {1..83}
     do
         [ $f -eq 1 ] && mkdir data/Subfolder$s
         for((l=0;l<c;l++))
         do
             echo $((RANDOM%10)) >> data/Subfolder$s/chr$f
         done
     done
done

newbie83 · December 19, 2012, 9:45am

This is working fine now...I restarted the machine, sometimes these things work

awesome !!

Update: took 4 hours to complete 22gigs !

SOLVED!