Extract columns where header matches a given string

flotsam · March 17, 2011, 4:05pm

Hi,

I'm having trouble pulling out columns where the headers match a file of key ID's I'm interested in and was looking for some help.

file1.txt

file2.txt

I Name 1 1 2 2 3 3 ... 34 34... 56 56... 84 84... 350 350... 
M 1      A A A A B B    A   A      A  B     B  A       A    A
M 2      A A B A B B    B   A      A  B     B  A       A    A
M 3      A A A A B B    A   A      A  B     A  A       B    A
M 4      A B A A B B    A   B      A  A     B  A       A    A

I would like to create a new file where the columns from file2 match ids from file 1, so:

I Name 34 34 56 56 84 84 350 350...9420 9420
M 1      A   A   A  B  B  A    A    A  ...  B     A
M 2      B   A   A  B  B  A    A    A  ...  B     A
M 3      A   A   A  B  A  A    B    A  ...  B     B

I initially tried the following:

% awk '{print $1}'  file1.txt | while read i; do awk 'NR=1;{for (j=0;j<=NF;j++) {if ($j == $i) break; }} {printf "%s ",$i}' file2.txt;done > file3.txt

But I'm obviously doing something wrong as I get the entire matrix "X" printed first then on subsequent lines I get the whole matrix printed twice

% vi file3.txt

X
X X
X X
X X
...
X

Thanks in advance.

vgersh99 · March 17, 2011, 4:45pm

nawk -f flot.awk file1.txt file2.txt

flot.awk:

FNR==NR {f1[$1];next}
FNR==1{
       for(i=1;i<=NF;i++) {
         f2=$i
         if ($i in f1)
            printf("%c%s", c++?OFS:"", $i)
       }
       print ""
       next
}
{
  b=0
  for(i=1;i<=NF;i++)
    if (f2 in f1)
       printf("%c%s", b++?OFS:"",$i)
  print ""
}

sk1418 · March 17, 2011, 5:26pm

it may be done with this trick too, you could try:

save following statement into a file, let's say, myawk.sh

awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }'

then execute this statement: assume file1 & file2 are your two files

sh myawk.sh<file2 |awk 'NR==FNR{a[$1]=1;} NR>FNR{if ($1 in a )print $0}' file1 - |sh myawk.sh

flotsam · March 17, 2011, 5:38pm

Thanks for the quick reply unfortunately your example doesn't work.

I tried to run flot.awk and got the following

A single entry for the column header ~100 columns and then the entire matrix ~10000 columns * n rows below the header. What I was hoping to get was a matrix of 200 columns * n rows.

Any idea why your script may be only matching the header once instead of the two times I expect and why it's printing the entire matrix following that? The input file2 is separated by whitespace, does it need -F ' ' somewhere?

Thanks again

vgersh99 · March 17, 2011, 5:41pm

Hmm..... strange - worked with my sample files.
Could you provide the sample files you were using?
Tailor it down to be small enough and be representative as well - say 5 records in file1 and 20 columns in file2.

flotsam · March 17, 2011, 6:34pm

Thanks for the replies

@sk1418
I tried your example, but never received anything back from stdout after running the scripts. I've tried to run the first awk command inputing file2.txt and received some output but from that point on the computer just hangs.

@vgersh99
Here's an example of the files. And I appologise, I'm not sure if a file should be code tagged?

file1.txt:

file2.txt

I Name 021071 021071 021215 021215 104069 104069 124236 124236 212439 212439 109706 109706 215398 215398 113889 113889 272728 272728 360976 360976 
M 10080_CO B B B B B B B B B B B B B B B B A A B B
M 10068_CO A A A A A A B B A A A A A A A A A A A A
M 12187_ND B B B B B B B B B B B B B B B B B B B B
M GGA_0061 A A A A A B B B A B B B A B A A A B A B
M GGA_0013 A B A B A B A B A B A B A B B B A A A B
M GGA_0024 A A A B A A A A A A A B A A A A A B A A
M GGA_0025 B B B B B B B B B B B B B B B B B B B B

sk1418 · March 17, 2011, 6:41pm

hi, i tried your examples:

ArchT60 23:39:39 /tmp
kent$ sh tawk.sh<f2 |awk 'NR==FNR{a[$1]=1;} NR>FNR{if ($1 in a )print $0}' f1 - |sh tawk.sh 
I Name 104069 104069 109706 109706 113889 113889 
M 10080_CO B B B B B B 
M 10068_CO A A A A A A 
M 12187_ND B B B B B B 
M GGA_0061 A B B B A A 
M GGA_0013 A B A B B B 
M GGA_0024 A A A B A A 
M GGA_0025 B B B B B B

if the myawk.sh doesn't work there, try this:

awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }' f2 |awk 'NR==FNR{a[$1]=1;} NR>FNR{if ($1 in a )print $0}' f1 - |awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }'

actually the myawk.sh just for save some words. as you can see above statement, same part of codes duplicated at the beginning and end.

vgersh99 · March 17, 2011, 6:46pm

here's the output I get - looks fine to me - unless I'm missing something in your explanation:

I Name 104069 104069 109706 109706 113889 113889
M 10080_CO B B B B B B
M 10068_CO A A A A A A
M 12187_ND B B B B B B
M GGA_0061 A B B B A A
M GGA_0013 A B A B B B
M GGA_0024 A A A B A A
M GGA_0025 B B B B B B

flotsam · March 18, 2011, 4:30pm

Interesting. that it's worked for you both, but not for me. I've tried to run your examples as shell scripts or using the awk -F option and I get really weird results if any results at all? However, when I tried to run the stand alone awk code that sk1418 gave as a second example it works fine on the small test example. I'll try to run this on the larger file and look into why the scripts aren't working when I use .awk and .sh
Thank you very much for your help!

---------- Post updated 03-18-11 at 03:30 PM ---------- Previous update was 03-17-11 at 06:17 PM ----------

Hi,

I've been mucking around with your solutions to my problem today and have noticed that in both cases the output is ordered from greatest to smallest. Is there anyway to keep the original order from file1.txt? It's not exactly clear to me where the sorting of id's occurs?

Thanks again.

sk1418 · March 18, 2011, 4:52pm

hi, check my output in previous post, the output did keep the file1 order, didn't it? what did u mean " the output is ordered from greatest to smallest."?

pravin27 · March 18, 2011, 4:52pm

Could this help you?

 awk 'NR==FNR{a[$1]++;next} {if(FNR==1){for(i=1;i<=NF;i++){if(a[$i]){printf $i" ";b=$i}}}else{printf "\n";for(j=1;j<=NF;j++){if(b[j]) {printf $j" "}}}}END {printf "\n"}' file1 file2

flotsam · March 18, 2011, 5:44pm

Hi,
Given the following input files:

file1.txt

file2.txt

I Name    41109297 41109297 41109706 41109706 41110441 41110441 41111192 41111192 41112086 41112086 41113889 41113889 41114003 41114003 41114656 41114656 41115162 41115162 41115561 41115561 41115979 41115979 41116248 41116248 41130607 41130607 41130611 41130611 41131240 41131240 41132167 41132167 41133800 41133800 41134462 41134462 41134623 41134623 42135335 42135335 42137664 42137664 42143490 42143490 42144170 42144170 42144339 42144339 42144650 42144650 42145389 42145389 42146088 42146088 42146090 42146090 42146879 42146879 42148154 42148154 43161219 43161219 43162207 43162207 43163011 43163011 43163878 43163878 43164830 43164830 43165768 43165768 43166228 43166228 43166330 43166330 43167557 43167557 43180900 43180900 43181675 43181675 43182287 43182287 43184255 43184255 43184401 43184401 
M 1080_COI    B B B B B B B B B B B B B B B B B B B B B B B B 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 10668_CO    B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 1218_ND    B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 1546_CY    B B B B B B B B B B B B B B B B B B B B B B B B 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 1626_ND    B B B B B B B B B B B B B B B B B B B B B B B B 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 1637_ND    B B B B B B B B B B B B B B B B B B B B B B B B A A B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 5831_ND2    A A A A A A A A A A A A A A A A A A A A A A A A 0 0 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 
M 8472_CO2    B B B B B B B B B B B B B B B B B B B B B B B B 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M GGal006    A A A A A B A B A B A A A A A B A B A A A B A B A A A B A B A A A B B B B B A B A A B B A B A A A B A B A A A A B B A B A B A B A A B B A B A B B B A B A B A A A B A B A B A A

I get the following output:

@sk1418 & vgersh99

41109297 41109297 41109706 41109706 41110441 41110441 41111192 41111192 43162207 43162207 43163011 43163011 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
A A A A A A A A A A A A 
B B B B B B B B B B B B 
A A A A A B A B A B A A

@pravin27

41110441 41110441 41111192 41111192 43162207 43162207 43163011 43163011 
B B B B B B B B 
B B B B B B B B 
B B B B B B B B 
B B B B B B B B 
B B B B B B B B 
B B B B B B B B 
A A A A A A A A 
B B B B B B B B 
A B A B A B A A

In the first case it seems like it's using file2 to order the output? Any thoughts on how I can keep the output in the order of file1?

vgersh99 · March 18, 2011, 5:47pm

file1 defines the list of columns to be matched.
file2's header lines defines the column names.
The script simply matches the the list from file1 to the header names in file2 in WHATEVER order the columns are defined in files2.

There's no "sorting" that's being done. The 'valid/matched' columns from file2 are being output in the SAME original order they appeared in file2.

flotsam · March 18, 2011, 6:23pm

Thanks vgersh99,

I think I realised that and pointed it out in the last line of my post. I hadn't anticipated the problem when I originally posted, but require output to match the order of file1

sk1418 · March 18, 2011, 6:32pm

flotsam:

Hi,
Given the following input files:

file1.txt
41109297 
41109706 
43162207
41109808
41109377
41110441
41111192
43163011
43162367
file2.txt
I Name    41109297 41109297 41109706 41109706 41110441 41110441 41111192 41111192 41112086 41112086 41113889 41113889 41114003 41114003 41114656 41114656 41115162 41115162 41115561 41115561 
In the first case it seems like it's using file2 to order the output? Any thoughts on how I can keep the output in the order of file1?

hi there, i think i understood what you mean. You want to get those relevant columns AND change the column order against the order defined in file1. E.g in your pasted output, 43162207 column should be before 41110441. right?

then take a look this, if it is what you want. (with output)

ArchT60 /tmp/test
kent$ awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }' f2 |awk 'BEGIN{i=j=m=n=1;}
NR==FNR{a[$1]=$1;aa[i++]=$1;} 
NR>FNR{if ($1 in a ){b[n++]=$0;bb[m++]=$1;}} 
END{
la=length(a); lb=length(b); for(i=1;i<=la;i++){ for(j=1;j<=lb;j++){ if (bb[j] == aa) print b[j]; } } 
}' f1 - |awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }' 
b=length(b); for(i=1;i<=la;i++){ for(j=1;j<=lb;j++){ if (bb[j] == aa) print b[j]; } } }' f1 - |awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }' 

41109297 41109297 41109706 41109706 43162207 43162207 41110441 41110441 41111192 41111192 43163011 43163011 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
A A A A A A A A A A A A 
B B B B B B B B B B B B 
A A A A A B A B A B A A

flotsam · March 18, 2011, 6:49pm

Hi,
Given the following input files:

file1.txt

file2.txt

I Name    41109297 41109297 41109706 41109706 41110441 41110441 41111192 41111192 41112086 41112086 41113889 41113889 41114003 41114003 41114656 41114656 41115162 41115162 41115561 41115561 41115979 41115979 41116248 41116248 41130607 41130607 41130611 41130611 41131240 41131240 41132167 41132167 41133800 41133800 41134462 41134462 41134623 41134623 42135335 42135335 42137664 42137664 42143490 42143490 42144170 42144170 42144339 42144339 42144650 42144650 42145389 42145389 42146088 42146088 42146090 42146090 42146879 42146879 42148154 42148154 43161219 43161219 43162207 43162207 43163011 43163011 43163878 43163878 43164830 43164830 43165768 43165768 43166228 43166228 43166330 43166330 43167557 43167557 43180900 43180900 43181675 43181675 43182287 43182287 43184255 43184255 43184401 43184401 
M 1080_COI    B B B B B B B B B B B B B B B B B B B B B B B B 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 10668_CO    B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 1218_ND    B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 1546_CY    B B B B B B B B B B B B B B B B B B B B B B B B 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 1626_ND    B B B B B B B B B B B B B B B B B B B B B B B B 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 1637_ND    B B B B B B B B B B B B B B B B B B B B B B B B A A B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 5831_ND2    A A A A A A A A A A A A A A A A A A A A A A A A 0 0 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 
M 8472_CO2    B B B B B B B B B B B B B B B B B B B B B B B B 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M GGal006    A A A A A B A B A B A A A A A B A B A A A B A B A A A B A B A A A B B B B B A B A A B B A B A A A B A B A A A A B B A B A B A B A A B B A B A B B B A B A B A A A B A B A B A A

I get the following output:

@sk1418 & vgersh99

41109297 41109297 41109706 41109706 41110441 41110441 41111192 41111192 43162207 43162207 43163011 43163011 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
A A A A A A A A A A A A 
B B B B B B B B B B B B 
A A A A A B A B A B A A

@pravin27

41110441 41110441 41111192 41111192 43162207 43162207 43163011 43163011 
B B B B B B B B 
B B B B B B B B 
B B B B B B B B 
B B B B B B B B 
B B B B B B B B 
B B B B B B B B 
A A A A A A A A 
B B B B B B B B 
A B A B A B A A

In the first case it seems like it's using file2 to order the output? Any thoughts on how I can keep the output in the order of file1?

---------- Post updated at 05:49 PM ---------- Previous update was at 05:41 PM ----------

Here is a python script that will do what I was after if anyone is interested, it's cumbersome and will probably be incredibly slow, but should work

#!/usr/local/python
import sys


def argChk(input):
        warn='Takes id file and extracts matching columns from file 2 if headers match... USAGE: ./grabCol.py file1 file2 outfile'
        if '-h' in input:
                print warn
        elif len(input) == 3:
                print warn
        else:
                return input

def getPos(id,col):
        coLst=col.split()
        posLst=[]
        for i in id:
                for j in range(0,len(coLst),1):
                        if i.strip() == coLst[j]:
                                posLst.append(j)
        return posLst


############################MAIN#############################
if __name__ == '__main__':
        args=argChk(sys.argv)
        idFile=open(args[1],'rU')
        id=idFile.readlines()
        idFile.close()
        colFile=open(args[2],'rU')
        col=colFile.readlines()
        colFile.close()
        oFile=open(args[3],'w')
        posLst=getPos(id,col[0])
        oStr=''
        for i in col:
                iLst=i.split()
                for j in posLst:
                        oStr=oStr+iLst[j]+' '
                oStr=oStr+'\n'
        oFile.write(oStr)
        oFile.close()

sk1418 · March 18, 2011, 6:50pm

hi, have you tried my solution? does it work there? the post before your python post.

btw, your python codes lost all indents.

flotsam · March 18, 2011, 7:06pm

@sk1418
Yeah that's exactly what I was after! Sorry I didn't articulate it well when I first posted. I tried to just copy and paste your example, but am coming up with an error message.

Is there something missing on the second line around b=length?

The python script seems to be doing it's job, but I think it's going to be very slow and these are some very large files.. Thanks again for all your help.

awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }' file2.txt |awk 'BEGIN{i=j=m=n=1;}NR==FNR{a[$1]=$1;aa[i++]=$1;} NR>FNR{if ($1 in a ){b[n++]=$0;bb[m++]=$1;}} END{la=length(a); lb=length(b); for(i=1;i<=la;i++){ for(j=1;j<=lb;j++){ if (bb[j] == aa) print b[j]; } } }' file1.txt - |awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }'
b=length(b); for(i=1;i<=la;i++){ for(j=1;j<=lb;j++){ if (bb[j] == aa) print b[j]; } } }' file1.txt - |awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }'


-bash: syntax error near unexpected token `('

sk1418 · March 18, 2011, 7:09pm

ok, i just reformatted a little bit on my code, maybe something went wrong then. I paste again, the one line version. looks ugly but should work. just try it:

awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }' f2 |awk 'BEGIN{i=j=m=n=1;}NR==FNR{a[$1]=$1;aa[i++]=$1;} NR>FNR{if ($1 in a ){b[n++]=$0;bb[m++]=$1;}} END{la=length(a); lb=length(b); for(i=1;i<=la;i++){ for(j=1;j<=lb;j++){ if (bb[j] == aa) print b[j]; } } }' f1 - |awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }'

flotsam · March 18, 2011, 7:18pm

Works perfectly thank you very much! I'll race it against the python script!