Extract columns based on header

aec · April 22, 2013, 2:06pm

Hi to all,

I have two files. File1 has no header, two columns:

sample1 A
sample2 B
sample3 B
sample4 C
sample5 A
sample6 D
sample7 D

File2 has a header, except for the first 3 columns (chr,start,end). "sample1" is the header for the 4th ,5th ,6th columns, "sample2" is the header for 7th ,8th 9th columns and so on.

sample1 sample2 sample3 sample4 sample5 sample6 sample7
chr start end a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a6 b6 c6 a7 b7 c7
chr start end d1 e1 f1 d2 e2 f2 d3 e3 f3 d4 e4 f4 d5 e5 f5 d6 e6 f6 d7 e7 f7

If I would like to extract from File2 just those samples that in File1 have B and D, the output would be:

sample2 sample3 sample6 sample7
chr start end a2 b2 c2 a3 b3 c3 a6 b6 c6 a7 b7 c7 
chr start end d2 e2 f2 d3 e3 f3 d6 e6 f6 d7 e7 f7

Thanks,

---------- Post updated at 08:06 PM ---------- Previous update was at 06:32 PM ----------

thanks for editing my post and write it in code tags

Don_Cragun · April 22, 2013, 5:10pm

aec:

Hi,

I have a file like (no header, two columns):
sample1 A
sample2 B
sample3 B
sample4 C
sample5 A
sample6 D
sample7 D
....
I have another file like (with header, multiple rows and cols, each sample has 3 values a,b,c):

(no header first 3 cols)
sample1 sample2 sample3 
chr start end a1 b1 c1 a2 b2 c2 a3 b3 c3 ....
chr start end ....
If second field in file1 matches B or D, then extract columns with that sample.

Example of output:
(no header first 3 cols)
sample2 sample3 sample6 sample7
chr start end a2 b2 c2 a3 b3 c3 a6 b6 c6 a7 b7 c7 
chr start end ....
Thanks,

---------- Post updated at 08:06 PM ---------- Previous update was at 06:32 PM ----------

thanks for editing my post and write it in code tags

I don't understand your sample. Your "another file" is described two ways:

(with header, multiple rows and cols, each sample has 3 values a,b,c)

and

(no header first 3 cols)

and I don't see any way to turn:

sample2 B
sample3 B
sample6 D
sample7 D

into:

sample2 sample3 sample6 sample7
chr start end a2 b2 c2 a3 b3 c3 a6 b6 c6 a7 b7 c7 
chr start end ....

nor why you describe this as:

(no header first 3 cols)

when the 1st line sort of looks like a header and the 2nd line has 15 columns.

The .... doesn't appear to be a literal string, but I can't figure out what additional data you want if the .... is intended to be an ellipsis indicating that more data is to follow.

Please try again to clearly explain how the input files are to be transformed into the output you want.

hanson44 · April 22, 2013, 5:18pm

Yes, I also find it hard to understand exactly what you are trying to do. Please try again.

aec · April 23, 2013, 9:51am

Hope now is clearer

Don_Cragun · April 23, 2013, 3:45pm

Yes, it is much clearer now.

I think the following shell and awk script does what you want. It was written and tested using a Korn shell, but should work with any other POSIX conforming shell (such as bash). If you are using a Solaris/SunOS system, change the script to use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of awk :

#!/bin/ksh
# Usage: tester key...
# This script will look for any "key" given on the command line as a value in
# field 2 of File1 and then extract data from File2 based on the headings
# specified by field 1 of File1.
if [ $# -lt 1 ]
then    echo "Usage; ${0##*/} key..." >&2
        exit 1
fi
errf=stderr.$$
if ! printf "%s\n" "$@" | awk -v errf="$errf" '
FNR == 1 {
        # Increment input file count.
        # f == 1 => List of keys.
        # f == 2 => File1
        # f == 3 => File2
        f++
}
f == 1 {keys[$0]        # Save the keys
        next
}
f == 2 && $2 in keys {  # Save headers to be extracted from File2.
        headers[$1]
        next
}
f == 3 && FNR == 1 {    # 1st line of File2; create list of sets of fields to
                        # extract from remaining lines
        fields[1]       # Always keep 1st three fields
        fields[2]
        fields[3]
        for(i = 1; i<= NF; i++) {
                if($i in headers) {
                        # We have a matching header, add corresonding fields to
                        # the extraction list.
                        fields[3 * i + 1]
                        fields[3 * i + 2]
                        fields[lf = 3 * i + 3]  # Save last field #.
                        printf("%s%s", nh++ ? " " : "", $i)
                }
        }
        if(nh == 0) {   # No headers found.  Print diagnostic message.  It is
                        # being saved in a file since there is no portable way
                        # to direct output from awk to stderr.  The calling
                        # shell script will copy the message to stderr and add
                        # the script name to the start of the diagnostic.
                print ": Headings identified by keys not found in File2." > errf
                exit 1
        }
        print ""        # Add a trailing newline to the output file headings.
        next
}
{       for(i = 1; i <= NF; i++)        # Print selected fields from remaining
                if(i in fields) # lines in File2.
                        printf("%s%s", $i, i == lf ? "\n" : " ")
}' - File1 File2 > File3
then    # awk failed, print diagnostic message.
        printf "%s" ${0##*/} >&2
        cat $errf >&2
        rm $errf
        exit 1
fi

If you save the above code in a file named tester , make it executable with:

chmod +x tester

and execute it with the desired keys ( B and D ) in your example:

./tester B D

or:

./tester D B

when File1 and File2 contain the data specified in your example; File3 will contain:

sample2 sample3 sample6 sample7
chr start end a2 b2 c2 a3 b3 c3 a6 b6 c6 a7 b7 c7
chr start end d2 e2 f2 d3 e3 f3 d6 e6 f6 d7 e7 f7

which matches the output you said you wanted except that there is no space at the end of the 2nd line of output here even though there was a trailing space at the end of the 2nd line of File3 in you example.

I hope the comments are sufficient to explain how it works. If it doesn't work or you still have questions, let us know what diagnostics are produced or what shell or awk commands you can't decipher.