Grouping matches by cols

gbalsu · September 9, 2008, 6:50pm

Dear all
I have a large file w. ~ 10 million lines.
The first two cols have matching partners.
For example:
A A
A B
B B

or

A A
B A
B B

The matches may be separated by an unknown number of lines.

My intention is to group them and add a "group" value in the last col.

For example

A A A
A B A
B B A

or

A A A
B A A
B B A

Rest assured that only one of A B and B A will be present and not both.
Any help will be highly appreciated.
A may have matches in addition to B and any number of of them. But in all cases I would like to name the group with the first partner of the first instance, i.e. A in this case.
Any help will be highly appreciated.

cfajohnson · September 9, 2008, 7:33pm

How do you determine the group value? Why is the third line not B B B?

It would be helpful if you provided more examples from the file.

It might also help if you posted some real data in addition to the abbreviated, single-letter data.

gbalsu · September 9, 2008, 7:39pm

Group value is determined by the first pair to be detected by the script.

If A A was the first pair, A is the first group value.
If A B was the first pair, A is the first group value.
If B A was the first pair, B is the first group value.
If B B was the first pair, B is the first group value.

I am sorting a large gene comparison data set, to us it hardly matters who the "group" is as far as the members are highly identical as the results indicate. This is only one of several analysis steps in my project.

Here is one set of instances of my data.

NC_002662.1|:1000271-1001206 NC_002662.1|:1000271-1001206 100.00 936 0 0 1 936 1 936 0.0 1814
NC_002662.1|:1000271-1001206 NC_008527.1|:1000752-1001687 88.60 947 86 21 1 936 1 936 0.0 957
NC_008527.1|:1000752-1001687 NC_008527.1|:1000752-1001687 100.00 936 0 0 1 936 1 936 0.0 1754

Annihilannic · September 9, 2008, 7:53pm

So it seems like the "group value" is always the same as the first field? If that's the case, why do you need to add another field?

gbalsu · September 9, 2008, 8:03pm

No, if it was the first field all the time, I would never have posted this.
I kindly request you to look at my input again - if A B was encountered previously, when you next see B B it needs to be assigned to A.

I wanted to only provide a simple example but I guess I made it too simple and now appear not so smart.

Lets add some more.

Input

A A
A X
C D
E F
X L
A B
O O
P P
M N
B B

Output

A A A
A X A
C D C
E F E
X L X
A B A
O O O
T X X
E E E
P P P
M N M
B B A

My apologies, this is literally the first time I am posting questions in a programming forum. Please help me with further queries as you deem necessary.

Annihilannic · September 9, 2008, 9:00pm

Try this:

awk '
        $1 in group {
                print $0,group[$1]
                if ($2 in group) {
                        if (group[$1] != group[$2]) {
                                print $1" and "$2" are already in different groups!"
                        }
                } else {
                        group[$2]=group[$1]
                }
                next
        }
        $2 in group {
                print $0,group[$2]
                group[$1]=group[$2]
                next
        }
        {
                group[$1]=$1
                group[$2]=$1
                print $0,group[$1]
        }
' inputfile

I think you forgot to include the "T X" and "E E" lines in your example input data.

Note that the output is slightly different, e.g. T X A, not T X X because X is already in group A:

A A A
A X A
C D C
E F E
X L A
A B A
O O O
T X A
E E E
P P P
M N M
B B A

cfajohnson · September 9, 2008, 9:20pm

So what is the rule for determining the group?

When I "next see B B"? I haven't seen it before.

Why is that T X X and not T X T?

Why is that last line B B A and not B B B?

Does this do what you want?

awk '
{ group = (x[$1]) ? x[$1] : (x[$2]) ? x[$2] : $1 }
x[$1] || x[$2] { group = (x[$1]) ? x[$1] : x[$2] }
{print $0, group }
!x[$1] { x[$1] = group}
!x[$2] { x[$2] = group }
' "$FILE"

##

gbalsu · September 10, 2008, 2:09am

Thank you both, Annihilannic and cfajohnson. I cannot try your code now but will do so first thing in the morning.

The results are from pairwise comparisons of genes - a table of gene1 (A or B or first column) matching gene2 (B or A or second col) by a particular cutoff % identity. I filtered an analysis of highly similar genes. Based on many years of doing gene analysis daily I have a reasonable idea that above this cutoff the gene functions are either very similar or identical.

The rule is that the first time a pair is seen, the first element of the pair becomes the name of the group. I am just using a FIFO scheme here. It really does not matter scientifically whether A or B (gene1 or gene2) gets assigned here. It matters however that once a group label has been identified that label is consistently used so that the same gene is not assigned to a different group. (Annihilannic trapped my mistakes smartly, very nice of you, special thanks. I will learn to stop working when my eyes are really blurry and my brain is fried.)

In my first example, we have A A, A B, and B B. This is 'coz A matches B, else we will only have A A and B B.
Since we see A A first in the list as A matches itself, we assign the group to be A. Now when we read further we get to either of B B or A B. But if A and B match A B or B A will come before B B. So B will be assigned to group A as A was seen before and got a label assigned and when we see B matching itself again we need to assign B to group A.
Alternately B B will be seen w/o either of A B or B A (if B does not match A, in which case we only have A A and B B) and hence will be assigned to group B.
So even if B matches itself (B B) it also matches A (when you see either of A B or B A) and A B is already assigned to A, so B's group will be A. If in a real example it appears in the order B B, A B, A A, no harm done, B will be the group label. So it will not scientifically matter even if we reverse the process and use the second col match as label but we need to then use the same grouping (and/or process of determining the grouping) consistently for other matches to both A and B as we move along.

Sorry, this "lecture" was unintended.
More questions? I will be very happy to answer.

gbalsu · September 10, 2008, 2:15am

In the third paragraph in my previous post, in the last sentence (prior to the parentheses), by "same gene" I intend to say "a gene previously assigned in a pair to a group". Sorry again.

Annihilannic · September 10, 2008, 2:22am

You can edit posts here to correct them you know. I'd be lost without that facility...

It makes sense to me... personally I think I would use an entirely different name for the groups to avoid confusing myself, maybe a group number. So, for example, A, B and X would be in group 1. That way you don't associate A with group 1 any more than you would X. But the end result is the same...

gbalsu · September 10, 2008, 11:58am

Yeah, using a number for the group was where I originally wanted to be but the problem is that it tells a biologist little beyond the fact that "you just reached Group_10000289".
My next step (which I know how to do) is to substitute the group ID's w. the name of the gene. Now a biologist knows (s)he can look for "geneX" of his interest to see how many groups and how many members per group the gene has. And once I have merged that information I can use it to create interfaces for my analysis where people can come and query or manipulate or add to or analyze the results, etc.
Again, thanks for the valuable discussions. This helps me keep my project ideas clear in both the near and long term.
More after trying the code.

gbalsu · September 10, 2008, 2:09pm

Ok, now I have tried both codes on my initial inputs (the alphabet pairs) and they both produced identical results.
The logical net step was to do the same on a much smaller subset of my comparison data that I am familiar with to make sure I am getting what I expected from the codes.
I hit a snag here. I saw that the outputs wee slightly different and had errors. The issue was that the data had names like NC_008527.1|:1225155-1226045 and NC_008527.1|:c900661-899771. Note the difference in the two names |: and |:c. This was causing issues when the groups were being assigned. I hypothesized that it was only this difference causing troubles, based on other group assignments to pair w/o |:c in the names.
I was then able to write a Perl script to move the c from |:c to the end of the name and resort the data. When I used this new sorted data both scripts produced accurate and identical results.
Considering that I am a newbie to awk (I have used it for less than 10 h now) I earnestly appreciate the favor that you both, cfajohnson and annihilanic, have done me.
Call me old-fashioned, but I never forget the least tidbit of help anybody has ever done to me. Many, many grateful thanks!!
One last question, is there a way to use "tab" as the column separator before adding the group name in place of space?

cfajohnson · September 10, 2008, 2:51pm

{ printf "%s\t%s\n", $0, group }