Extract data according to keys from filename mentioned in file

ritakadm · March 3, 2015, 11:49am

Hello experts, I want to join a file with files whosE names are mentioned in one of the columns of the same file.
File 1

t1,a,b,file number 1
t1,a,c,file number 1
t2,c,d,file number 2
t2,c,e,file number 2
t2,c,f,file number 2
t2,c,g,file number 2
t3,e,f,file number 3

file number 1

t,h1,h2,c1,c2,c3
t1,a,b,v1,v2,v3
t1,x,y,g1,g2,g3
t1,a,b,v4,v5,v6
t1,x,y,h1,h2,h3
t1,a,b,v7,v8,v9
t1,a c,t1,t2,t3
t1,a,c,t2,t3,t4

file number 2

t,h1,h2,c11,c12,c13
t2,c,d,v11,v12,v13
t2,x,y,g11,g12,g13
t2,c,e,v14,v15,v16
t2,c,f,h11,h12,h13
t2,c,g,v17,v18,v19

file number 3

t,h1,h2,c12,c22,c23
t3,e,f,v31,v32,v33
t3,x,y,g11,g12,g13
t3,e,f,v34,v35,v36
t3,e,f,h31,h32,h33
t3,e,f,v37,v38,v39

I want to read each line of File, make a key of cols 2 and 3, and then extract data for the key from the filename mentioned in the 4th column. The filenames have spaces as I have shown. Also I want the headers in the output.

out_t1

t,h1,h2,c1,c2,c3
t1,a,b,v1,v2,v3
t1,a,b,v4,v5,v6
t1,a,b,v7,v8,v9
t1,a c,t1,t2,t3
t1,a,c,t2,t3,t4

out_t2

t,h1,h2,c11,c12,c13
t2,c,d,v11,v12,v13
t2,c,e,v14,v15,v16
t2,c,f,h11,h12,h13
t2,c,g,v17,v18,v19

out_t3

t,h1,h2,c12,c22,c23
t3,e,f,v31,v32,v33
t3,e,f,v34,v35,v36
t3,e,f,h31,h32,h33
t3,e,f,v37,v38,v39

This is what I want to do:-

awk -F, 'NR==FNR {k[$2$3]=$4;next} $2$3 in k { print }' OFS="," file1  $4  >>  out_$1

Not sure how to take out the 4th column and 1st column from first file, also to extract the header.

Please assist.

Scrutinizer · March 3, 2015, 1:07pm

I am assuming there is a missing comma at line 7 in file number one. Try:

awk -F, '
  NR==FNR {
    A[$0]
    next
  }
  FNR==1 {
    close(f)
    f="out_t" ++c
  }
  FNR==1 || $1 FS $2 FS $3 FS FILENAME in A {
    print>f
  }
' "file 1" "file number "*

RudiC · March 3, 2015, 1:39pm

Try also

awk     'FNR==1         {FNo++
                         if (FNo < 4) print > "out_t" substr(FILENAME, length(FILENAME))}

         FNo < 4        {DT[NR]=$0","FILENAME
                         MX=NR
                         next 
                        }

                        {for (i=1; i<=MX; i++)
                                if (DT ~ $2"[, ]"$3".*"$NF"$")
                                        {sub (/,[^,]*$/, "", DT)
                                         print DT > "out_t" substr ($4, length($4))
                                        }
                        }
        ' FS="," "file number "[123] file1

ritakadm · March 3, 2015, 3:11pm

Many thanks, and sorry about the missing comma.. Your code works great with the example files. My actual file names do not have pattern like "File number N", can I just point to the directory and replace the "file number "* by the absolute path of the directory like "./*" ?

Also, the output file name is generated from the first column of File 1, so I have modified your autoincrement file name to pick up from the first column..

Please let me know if you see any syntax or logical mistakes.

awk -F, '
  NR==FNR {
    A[$0]
    B[$2 FS $3]=$1
    next
  }
  FNR==1 || $1 FS $2 FS $3 FS FILENAME in A {
    print> out_B[$2 FS $3]
  }
' "file 1" "/DIRpath/*"