To get an output by combining fields from two different files

smriti_shridhar · September 26, 2008, 6:56am

Hi guys,
I couldn't find solution to this problem. If anyone knows please help me out.
your guidance is highly appretiated.

I have two files -

FILE1 has the following 7 columns ( - has been added to make columns visible enough else columns are separated by single space)

155.34 - leg - 1 - 344 - TC200232 - 292 - 930
152.88 - leg - 1 - 344 -TC215306 - 2 - 743
123.94 - leg - 1 - 344 -TC210135 - 423 - 1148

FILE2
>TC200232.pep
AYNGFNNSNIIRDGVAIINSSGALKLTNRSYNVIGHAFHPNPVPIFNSSTKNVTSFSTYF
VFAIVPLEKTSGGFGFA
>TC210135.pep
GFGDFGKDSNFESQIALYGDAKVVNGGIQMSGSMGFSAGRILNKKPFKLIDGNPRKMVSF
SLHFVFSLSRENGDGFAFVMVPIGYPFDVFDGGSFGLLGNRKMKFLAVEFDTFMDEKYGD
VNDNHVGVDLSS
>TC215306.pep
PRLKQDLTLVGSVIVSDEKKSVQIPDPEREGDDLKHLVGRAIYSSPIR

I want an output like this - FILE3 - which is same as FILE2 but the line starting with '>' should also contain (region 292 to 930 of SEQ) where 292 and 930 are the corresponding columns 6 and 7 of FILE1 for the common id i.e. TC200232 (present in both the files)

>TC200232.pep (region 292 to 930 of SEQ)
AYNGFNNSNIIRDGVAIINSSGALKLTNRSYNVIGHAFHPNPVPIFNSSTKNVTSFSTYF
VFAIVPLEKTSGGFGFA
>TC210135.pep (region 423 to 1148 of SEQ)
GFGDFGKDSNFESQIALYGDAKVVNGGIQMSGSMGFSAGRILNKKPFKLIDGNPRKMVSF
SLHFVFSLSRENGDGFAFVMVPIGYPFDVFDGGSFGLLGNRKMKFLAVEFDTFMDEKYGD
VNDNHVGVDLSS
>TC215306.pep (region 2 to 743 of SEQ)
PRLKQDLTLVGSVIVSDEKKSVQIPDPEREGDDLKHLVGRAIYSSPIR

radoulov · September 26, 2008, 7:06am

Use nawk or /usr/xpg4/bin/awk on Solaris:

awk 'NR == FNR { _[$5] = $6" to "$7; next }
($2 in _ && $0 = $0" (region "_[$2]" of SEQ)")||1
' file1 FS='[>.]' file2

smriti_shridhar · September 26, 2008, 7:21am

I must say it was brilliant.. Thanks..

I'll be happy if u can explain the code a little bit.

Thanks in advance.

radoulov · September 26, 2008, 11:39am

I'll try.
While reading the first non-empty input file (NR == FNR, check the man pages for the meaning of these internal AWK variables) build the associative array named _ : the $5 value as key, the value of $6, the string " to " and the value of $7 as array value.
While reading the next input file if $2 is one of the keys of the _ associative array, then append to the current record the string " (region ", the value of _[$2](remeber the _ array?) and the string " of SEQ)". Otherwise just print the record: || 1 (logical OR and 1 witch in the AWK language means true and therefor triggers the default action which is print the current record).

Hope this helps.

smriti_shridhar · October 3, 2008, 7:15am

Thanks for making a tough effort. It has really helped me.

smriti_shridhar · October 22, 2008, 1:22am

Hi,

There is a problem as im getting same header (line begining with >) for two records in file2.

But this time the order of IDs ( eg - TC200232 ) will remain same in both the files. So I just want to add column6 and 7 of file1 into file2 one after the other without matching the IDs as shown in the output file.

FILE1

155.34 - leg - 1 - 344 - TC200232 - 292 - 930
152.88 - leg - 1 - 344 -TC200232 - 2 - 743
123.94 - leg - 1 - 344 -TC215306 - 423 - 1148

FILE2

>TC200232.pep
AYNGFNNSNIIRDGVAIINSSGALKLTNRSYNVIGHAFHPNPVPIFNSSTKNVTSFSTYF
VFAIVPLEKTSGGFGFA
>TC200232.pep
GFGDFGKDSNFESQIALYGDAKVVNGGIQMSGSMGFSAGRILNKKPFKLIDGNPRKMVSF
SLHFVFSLSRENGDGFAFVMVPIGYPFDVFDGGSFGLLGNRKMKFLAVEFDTFMDEKYGD
VNDNHVGVDLSS
>TC215306.pep
PRLKQDLTLVGSVIVSDEKKSVQIPDPEREGDDLKHLVGRAIYSSPIR

OUTPUT FILE

>TC200232.pep (region 292 to 930 of SEQ)
AYNGFNNSNIIRDGVAIINSSGALKLTNRSYNVIGHAFHPNPVPIFNSSTKNVTSFSTYF
VFAIVPLEKTSGGFGFA
>TC200232.pep (region 2 to 743 of SEQ)
GFGDFGKDSNFESQIALYGDAKVVNGGIQMSGSMGFSAGRILNKKPFKLIDGNPRKMVSF
SLHFVFSLSRENGDGFAFVMVPIGYPFDVFDGGSFGLLGNRKMKFLAVEFDTFMDEKYGD
VNDNHVGVDLSS
>TC215306.pep (region 423 to 1148 of SEQ)
PRLKQDLTLVGSVIVSDEKKSVQIPDPEREGDDLKHLVGRAIYSSPIR

Thanks

radoulov · October 22, 2008, 4:52am

awk 'NR == FNR { _[++c] = $6" to "$7; next }
(/^>/ && $0 = $0" (region "_[++c]" of SEQ)")||1
' file1 FS='[>.]' c=0 file2

summer_cherry · October 22, 2008, 6:03am

sed 's/ //g' file1 > file1.tmp
nawk -F"-" '{
if(NR==FNR)
	arr[$5]=sprintf("(region %s to %s of SEQ)",$6,$7)
else
{
	FS="[>|.]"
	print NF
	print $0""arr[$2]
}
}' file1.tmp file2
rm file1.tmp

smriti_shridhar · October 22, 2008, 6:21am

Thanks radoulov.

and

Thanks summer_cherry.