hi ctsgnb and zaxxon thanks a lot for replying. what i need is group the id basing in the identity sequence for exemple :
ID1 ID2 Identity
A B 70
A C 50
A D 90
A E 80
B C 95
B D 66
B E 47
C D 35
C E 25
D E 98
The output will be like that :
A D E
B C
Note : It means that the sequence A D and E are together because they share more than 80 of identity . In the same way B and C are closed because of their identity.
Sorry for my bad english ! :o
If the line order of the output doesn't matter you can give a try with something like:
(you may want to change your 0.8 to 80 depending on the format of your input file)
awk 'NR>1&&$3>=0.8{A[$1]=(A[$1]?A[$1]:$1)" "$2}END{for(i in A) print A}' yourfile
$ cat f2
ID1 ID2 Identity
A B 70
A C 50
A D 90
A E 80
B C 95
B D 66
B E 47
C D 35
C E 25
D E 98
$ awk 'NR>1&&$3>=80{A[$1]=(A[$1]?A[$1]:$1)" "$2}END{for(i in A) print A}' f2
A D E
B C
D E
$
Thanks to take time re reply me I am very grateful.
Now it's seem better ! but the third line DE I have to ignore it because my original file is very very big ! I will have repeated information in my output.
awk 'NR>1&&$3>=80{A[$1]=$1;B[A[$1]]=(B[A[$1]]?B[A[$1]]:$1)" "$2}END{for(i in A) print B[A]}' test.tttt
And this is the output :
A D E
B C
D E
--> The DE is not a single group it's normally a part of the group 1 (ADE) I don't now if I'm clear
What i want to do after it's to get every group ID and using Bioperl to check the corresponding fasta files in a database. So i need just a output with two line (for this exemple).
Thanks
you can get all single pairs belonging to at least one group that is 80 or more with the following :
$ cat f2
ID1 ID2 Identity
A B 70
A C 50
A D 90
A E 80
B C 95
B D 66
B E 47
C D 35
C E 25
D E 98
A B 70
A C 50
A D 90
A E 40
$ awk 'NR>1&&$3>=80{i=$1" "$2;j=$2" "$1;t=i<j?i:j;C[t]}END{for(k in C) print k}' f2
A D
A E
B C
D E
$
NOTE that this code assume that an A D association is just another D A association, letters are just displayed from lower to higher :
consider the following example :
$ cat f3
A B 10
B A 80
C D 70
E D 90
D B 80
A D 10
D A 93
$ awk 'NR>1&&$3>=80{i=$1" "$2;j=$2" "$1;t=i<j?i:j;C[t]}END{for(k in C) print k}' f3
A B
A D
B D
D E
$
---------- Post updated at 04:48 PM ---------- Previous update was at 04:24 PM ----------
you can also try the following code
$ cat f2
ID1 ID2 Identity
A B 70
A C 50
A D 90
A E 80
B C 95
B D 66
B E 47
C D 35
C E 25
D E 98
A B 70
A C 50
A D 90
A E 40
$ awk 'NR>1&&$3>=80{x=$1" "$2;for(i in A) {if (A~x) next};A[$1]=(A[$1]?A[$1]:$1)" "$2}END{for(i in A) print A}' f2
A D E
B C
---------- Post updated at 05:00 PM ---------- Previous update was at 04:48 PM ----------
To avoid that a same $2 appear more than once within a group you can also try :
awk 'NR>1&&$3>=80{A[$1]=(A[$1]?A[$1]:$1)(A[$1]~$2?z:" "$2)}END{for(i in A) print A}' yourfile
Thanks a lot it works !! but when i use the code for my initial file that i post in the first message it don't work ): ! I never use before the awk code i must learn it. It is possible to just change the A in your code with the noun of my first column ? Other thing this code can work with a very big data ? or just adapted for this specific case ?
Could you please repost an example of input file as well as an example of the corresponding output file you expect ?
Should we assume that link between 2 chromosome have no "order" (A-D could be considered like D-A) ?
(or should it be considered like a vector so that the way A-D vs D-A does matter ?)
This is one group even if the IDs in bold charachter don't share more than 80% of identity
a very simple case is when you have A--B--C association but the A and C don't share enough identity to be considered together but is one continue group . I don't now if i'm clear ctsgnb
Thanks again for your help
I'm sorry I think that sometimes I'm not very clear !
What I want is to group together chromosome sequences that are very closed basing on the identity sequence. The number of lines will depend of the number of group that the code will defined. Did you understand me or not ?
In the last example the output must be one line