Matrix parsing help !

Hello every body ! I'm a new in this forum and beginner in Perl scripting and I have some problems :(:(:(! I have a big file like that :

ID1                   ID2                       Identity 
chromosome07_194379   chromosome01_168057       0.975
chromosome01_100293   chromosome01_168057       0.969
chromosome01_100293   chromosome07_194379       0.969
chromosome01_29385    chromosome01_168057       0.856
chromosome01_29385    chromosome07_194379       0.856
chromosome01_29385    chromosome01_100293       0.861
chromosome08_116839   chromosome01_168057       0.78
chromosome08_116839   chromosome01_100293       0.786
chromosome08_116839   chromosome01_293853       0.946

The three column are separated by tabulation (\t)

I want to cluster the IDs that share a identity more than 0.8 using Perl scripting, can someone help me ?
Thanks a lot in advance for your help

What does "cluster" stand for? sort? print only those? ?? Maybe post an example of the expected output using code tags.

1 Like

Something like :

nawk 'NR==1||$3>0.8' yourfile

?

If not, please give more clue as requested by Zaxxon.

1 Like

hi ctsgnb and zaxxon thanks a lot for replying. what i need is group the id basing in the identity sequence for exemple :

ID1 ID2  Identity
A    B      70
A    C      50
A    D      90
A    E      80
B    C      95
B    D      66
B    E      47
C    D      35
C    E      25
D    E      98

The output will be like that :

A D E
B C

Note : It means that the sequence A D and E are together because they share more than 80 of identity . In the same way B and C are closed because of their identity.
Sorry for my bad english ! :o

If the line order of the output doesn't matter you can give a try with something like:

(you may want to change your 0.8 to 80 depending on the format of your input file)

awk 'NR>1&&$3>=0.8{A[$1]=(A[$1]?A[$1]:$1)" "$2}END{for(i in A) print A}' yourfile
$ cat f2
ID1 ID2 Identity
A B 70
A C 50
A D 90
A E 80
B C 95
B D 66
B E 47
C D 35
C E 25
D E 98
$ awk 'NR>1&&$3>=80{A[$1]=(A[$1]?A[$1]:$1)" "$2}END{for(i in A) print A}' f2
A D E
B C
D E
$

Thanks ctsgnb I try your code and the output is like that :

A C D E
B C D E
C D E
D E

I updated my previous post : you should adapt the threshold from 0.8 to 80 and give another try

Thanks to take time re reply me I am very grateful.
Now it's seem better ! but the third line DE I have to ignore it because my original file is very very big ! I will have repeated information in my output.

-- deleted --

This is your code :

awk 'NR>1&&$3>=80{A[$1]=$1;B[A[$1]]=(B[A[$1]]?B[A[$1]]:$1)" "$2}END{for(i in A) print B[A]}' test.tttt 

And this is the output :

A D E
B C
D E

--> The DE is not a single group it's normally a part of the group 1 (ADE) I don't now if I'm clear
What i want to do after it's to get every group ID and using Bioperl to check the corresponding fasta files in a database. So i need just a output with two line (for this exemple).
Thanks

you can get all single pairs belonging to at least one group that is 80 or more with the following :

$ cat f2
ID1 ID2 Identity
A B 70
A C 50
A D 90
A E 80
B C 95
B D 66
B E 47
C D 35
C E 25
D E 98
A B 70
A C 50
A D 90
A E 40
$ awk 'NR>1&&$3>=80{i=$1" "$2;j=$2" "$1;t=i<j?i:j;C[t]}END{for(k in C) print k}' f2
A D
A E
B C
D E
$

NOTE that this code assume that an A D association is just another D A association, letters are just displayed from lower to higher :
consider the following example :

$ cat f3
A B 10
B A 80
C D 70
E D 90
D B 80
A D 10
D A 93
$ awk 'NR>1&&$3>=80{i=$1" "$2;j=$2" "$1;t=i<j?i:j;C[t]}END{for(k in C) print k}' f3
A B
A D
B D
D E
$

---------- Post updated at 04:48 PM ---------- Previous update was at 04:24 PM ----------

you can also try the following code

$ cat f2
ID1 ID2 Identity
A B 70
A C 50
A D 90
A E 80
B C 95
B D 66
B E 47
C D 35
C E 25
D E 98
A B 70
A C 50
A D 90
A E 40
$ awk 'NR>1&&$3>=80{x=$1" "$2;for(i in A) {if (A~x) next};A[$1]=(A[$1]?A[$1]:$1)" "$2}END{for(i in A) print A}' f2
A D E
B C

---------- Post updated at 05:00 PM ---------- Previous update was at 04:48 PM ----------

To avoid that a same $2 appear more than once within a group you can also try :

awk 'NR>1&&$3>=80{A[$1]=(A[$1]?A[$1]:$1)(A[$1]~$2?z:" "$2)}END{for(i in A) print A}' yourfile

Not sure to get what final result you expected.

Thanks a lot it works !! but when i use the code for my initial file that i post in the first message it don't work ): ! I never use before the awk code i must learn it. It is possible to just change the A in your code with the noun of my first column ? Other thing this code can work with a very big data ? or just adapted for this specific case ?

Did you make sure you've used the right threshold in your code (depending on your input file) ?
0.8 vs 80

you are right sir I dont !! the output is not good unfortunately :

chromosome01_100293 chromosome01_168057 chromosome07_194379
chromosome01_29385 chromosome01_168057 chromosome07_194379 chromosome01_100293
chromosome08_116839 chromosome01_293853

---------- Post updated at 04:30 PM ---------- Previous update was at 04:26 PM ----------

the chromosome01_100293 is present for exemple in the line 1 and the line 2 in the same time

Which code did you use ?

Could you please repost an example of input file as well as an example of the corresponding output file you expect ?

Should we assume that link between 2 chromosome have no "order" (A-D could be considered like D-A) ?
(or should it be considered like a vector so that the way A-D vs D-A does matter ?)

OK sir ctsgnb ! The input is (just the beginning because the original file contain more than 100,000 lines ! ):

chromosome07_194379   chromosome01_168057       0.975
chromosome01_100293   chromosome01_168057       0.969
chromosome01_100293   chromosome07_194379       0.969
chromosome01_29385    chromosome01_168057       0.856
chromosome01_29385    chromosome07_194379       0.856
chromosome01_29385    chromosome01_100293       0.861
chromosome08_116839   chromosome01_168057       0.78
chromosome08_116839   chromosome01_100293       0.786
chromosome08_116839   chromosome01_293853       0.946

and the output file must be like that :

chromosome07_194379 chromosome01_168057 chromosome01_100293 chromosome01_29385 chromosome08_116839 chromosome01_293853

This is one group even if the IDs in bold charachter don't share more than 80% of identity
a very simple case is when you have A--B--C association but the A and C don't share enough identity to be considered together but is one continue group . I don't now if i'm clear ctsgnb
Thanks again for your help

... so now the output must only be 1 line ?

Could a D-A sequence be considered the same as a A-D sequence or do the order matter ?

Do the order in which the line appear in the input file matter ?

or could we re-arrange the sorting ?

Can you write us the pseudo-code just to clarify what define the "grouping" and the logic behind ?

I'm sorry I think that sometimes I'm not very clear !
What I want is to group together chromosome sequences that are very closed basing on the identity sequence. The number of lines will depend of the number of group that the code will defined. Did you understand me or not ?
In the last example the output must be one line

Let's say that you have

  E-F
  | |
A-B-C-D
  |
  G

If we consider that all the
X-Y
and
X
|
Y
means that they match 80% and more.

what should be considered as a group ?

---------- Post updated at 07:05 PM ---------- Previous update was at 07:02 PM ----------

How shoud that group be defined : in one line ?

A B C D E F G ?

Or in 4 lines :
B C F E
A B
B G
C D

Or in another way ?

Exactly A B C D E F G are a group ! and the group should be defined in one line .
Thanks sir