I have two files. File1 has no header, two columns:
sample1 A
sample2 B
sample3 B
sample4 C
sample5 A
sample6 D
sample7 D
File2 has a header, except for the first 3 columns (chr,start,end). "sample1" is the header for the 4th ,5th ,6th columns, "sample2" is the header for 7th ,8th 9th columns and so on.
when the 1st line sort of looks like a header and the 2nd line has 15 columns.
The .... doesn't appear to be a literal string, but I can't figure out what additional data you want if the .... is intended to be an ellipsis indicating that more data is to follow.
Please try again to clearly explain how the input files are to be transformed into the output you want.
I think the following shell and awk script does what you want. It was written and tested using a Korn shell, but should work with any other POSIX conforming shell (such as bash). If you are using a Solaris/SunOS system, change the script to use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of awk :
#!/bin/ksh
# Usage: tester key...
# This script will look for any "key" given on the command line as a value in
# field 2 of File1 and then extract data from File2 based on the headings
# specified by field 1 of File1.
if [ $# -lt 1 ]
then echo "Usage; ${0##*/} key..." >&2
exit 1
fi
errf=stderr.$$
if ! printf "%s\n" "$@" | awk -v errf="$errf" '
FNR == 1 {
# Increment input file count.
# f == 1 => List of keys.
# f == 2 => File1
# f == 3 => File2
f++
}
f == 1 {keys[$0] # Save the keys
next
}
f == 2 && $2 in keys { # Save headers to be extracted from File2.
headers[$1]
next
}
f == 3 && FNR == 1 { # 1st line of File2; create list of sets of fields to
# extract from remaining lines
fields[1] # Always keep 1st three fields
fields[2]
fields[3]
for(i = 1; i<= NF; i++) {
if($i in headers) {
# We have a matching header, add corresonding fields to
# the extraction list.
fields[3 * i + 1]
fields[3 * i + 2]
fields[lf = 3 * i + 3] # Save last field #.
printf("%s%s", nh++ ? " " : "", $i)
}
}
if(nh == 0) { # No headers found. Print diagnostic message. It is
# being saved in a file since there is no portable way
# to direct output from awk to stderr. The calling
# shell script will copy the message to stderr and add
# the script name to the start of the diagnostic.
print ": Headings identified by keys not found in File2." > errf
exit 1
}
print "" # Add a trailing newline to the output file headings.
next
}
{ for(i = 1; i <= NF; i++) # Print selected fields from remaining
if(i in fields) # lines in File2.
printf("%s%s", $i, i == lf ? "\n" : " ")
}' - File1 File2 > File3
then # awk failed, print diagnostic message.
printf "%s" ${0##*/} >&2
cat $errf >&2
rm $errf
exit 1
fi
If you save the above code in a file named tester , make it executable with:
chmod +x tester
and execute it with the desired keys ( B and D ) in your example:
./tester B D
or:
./tester D B
when File1 and File2 contain the data specified in your example; File3 will contain:
which matches the output you said you wanted except that there is no space at the end of the 2nd line of output here even though there was a trailing space at the end of the 2nd line of File3 in you example.
I hope the comments are sufficient to explain how it works. If it doesn't work or you still have questions, let us know what diagnostics are produced or what shell or awk commands you can't decipher.