Hi guys,
I couldn't find solution to this problem. If anyone knows please help me out.
your guidance is highly appretiated.
I have two files -
FILE1 has the following 7 columns ( - has been added to make columns visible enough else columns are separated by single space)
155.34 - leg - 1 - 344 - TC200232 - 292 - 930
152.88 - leg - 1 - 344 -TC215306 - 2 - 743
123.94 - leg - 1 - 344 -TC210135 - 423 - 1148
FILE2
>TC200232.pep
AYNGFNNSNIIRDGVAIINSSGALKLTNRSYNVIGHAFHPNPVPIFNSSTKNVTSFSTYF
VFAIVPLEKTSGGFGFA
>TC210135.pep
GFGDFGKDSNFESQIALYGDAKVVNGGIQMSGSMGFSAGRILNKKPFKLIDGNPRKMVSF
SLHFVFSLSRENGDGFAFVMVPIGYPFDVFDGGSFGLLGNRKMKFLAVEFDTFMDEKYGD
VNDNHVGVDLSS
>TC215306.pep
PRLKQDLTLVGSVIVSDEKKSVQIPDPEREGDDLKHLVGRAIYSSPIR
I want an output like this - FILE3 - which is same as FILE2 but the line starting with '>' should also contain (region 292 to 930 of SEQ) where 292 and 930 are the corresponding columns 6 and 7 of FILE1 for the common id i.e. TC200232 (present in both the files)
>TC200232.pep (region 292 to 930 of SEQ)
AYNGFNNSNIIRDGVAIINSSGALKLTNRSYNVIGHAFHPNPVPIFNSSTKNVTSFSTYF
VFAIVPLEKTSGGFGFA
>TC210135.pep (region 423 to 1148 of SEQ)
GFGDFGKDSNFESQIALYGDAKVVNGGIQMSGSMGFSAGRILNKKPFKLIDGNPRKMVSF
SLHFVFSLSRENGDGFAFVMVPIGYPFDVFDGGSFGLLGNRKMKFLAVEFDTFMDEKYGD
VNDNHVGVDLSS
>TC215306.pep (region 2 to 743 of SEQ)
PRLKQDLTLVGSVIVSDEKKSVQIPDPEREGDDLKHLVGRAIYSSPIR
Use nawk or /usr/xpg4/bin/awk on Solaris:
awk 'NR == FNR { _[$5] = $6" to "$7; next }
($2 in _ && $0 = $0" (region "_[$2]" of SEQ)")||1
' file1 FS='[>.]' file2
I must say it was brilliant.. Thanks..
I'll be happy if u can explain the code a little bit.
Thanks in advance.
I'll try.
While reading the first non-empty input file (NR == FNR, check the man pages for the meaning of these internal AWK variables) build the associative array named _ : the $5 value as key, the value of $6, the string " to " and the value of $7 as array value.
While reading the next input file if $2 is one of the keys of the _ associative array, then append to the current record the string " (region ", the value of _[$2](remeber the _ array?) and the string " of SEQ)". Otherwise just print the record: || 1 (logical OR and 1 witch in the AWK language means true and therefor triggers the default action which is print the current record).
Hope this helps.
Thanks for making a tough effort. It has really helped me.
Hi,
There is a problem as im getting same header (line begining with >) for two records in file2.
But this time the order of IDs ( eg - TC200232 ) will remain same in both the files. So I just want to add column6 and 7 of file1 into file2 one after the other without matching the IDs as shown in the output file.
FILE1
155.34 - leg - 1 - 344 - TC200232 - 292 - 930
152.88 - leg - 1 - 344 -TC200232 - 2 - 743
123.94 - leg - 1 - 344 -TC215306 - 423 - 1148
FILE2
>TC200232.pep
AYNGFNNSNIIRDGVAIINSSGALKLTNRSYNVIGHAFHPNPVPIFNSSTKNVTSFSTYF
VFAIVPLEKTSGGFGFA
>TC200232.pep
GFGDFGKDSNFESQIALYGDAKVVNGGIQMSGSMGFSAGRILNKKPFKLIDGNPRKMVSF
SLHFVFSLSRENGDGFAFVMVPIGYPFDVFDGGSFGLLGNRKMKFLAVEFDTFMDEKYGD
VNDNHVGVDLSS
>TC215306.pep
PRLKQDLTLVGSVIVSDEKKSVQIPDPEREGDDLKHLVGRAIYSSPIR
OUTPUT FILE
>TC200232.pep (region 292 to 930 of SEQ)
AYNGFNNSNIIRDGVAIINSSGALKLTNRSYNVIGHAFHPNPVPIFNSSTKNVTSFSTYF
VFAIVPLEKTSGGFGFA
>TC200232.pep (region 2 to 743 of SEQ)
GFGDFGKDSNFESQIALYGDAKVVNGGIQMSGSMGFSAGRILNKKPFKLIDGNPRKMVSF
SLHFVFSLSRENGDGFAFVMVPIGYPFDVFDGGSFGLLGNRKMKFLAVEFDTFMDEKYGD
VNDNHVGVDLSS
>TC215306.pep (region 423 to 1148 of SEQ)
PRLKQDLTLVGSVIVSDEKKSVQIPDPEREGDDLKHLVGRAIYSSPIR
Thanks
awk 'NR == FNR { _[++c] = $6" to "$7; next }
(/^>/ && $0 = $0" (region "_[++c]" of SEQ)")||1
' file1 FS='[>.]' c=0 file2
sed 's/ //g' file1 > file1.tmp
nawk -F"-" '{
if(NR==FNR)
arr[$5]=sprintf("(region %s to %s of SEQ)",$6,$7)
else
{
FS="[>|.]"
print NF
print $0""arr[$2]
}
}' file1.tmp file2
rm file1.tmp