Awk command to get unique lines in a file with multiple column

sangeeta · February 8, 2024, 1:39pm

I have two file file1 and file2 respectively.
File1:
Protein1 Protein2
Streb.10G021010.1 : Streb.9G023710.1
Streb.9G019140.1 : Streb.7G013440.1
Streb.10G021010.1 : Streb.9G023710.1
Streb.2G015700.1 : Streb.9G023710.1
Streb.3G019820.1 : Streb.7G013440.1
Streb.3G008920.1 : Streb.1G025210.1
Streb.9G019140.1 : Streb.3G014030.1
Streb.1G034750.1 : Streb.9G009640.1
Streb.1G035920.1 : Streb.3G016240.1
Streb.2G040440.1 : Streb.7G013440.1
Streb.1G041180.1 : Streb.7G013440.1
Streb.2G035340.1 : Streb.10G024960.1
Streb.3G008920.1 : Streb.9G028230.1
Streb.1G040670.1 : Streb.2G014140.1
Streb.3G019820.1 : Streb.3G014030.1
Streb.1G000350.1 : Streb.2G032090.1
Streb.2G022000.1 : Streb.4G006020.1
Streb.10G022300.1 : Streb.9G018870.1
Streb.1G040670.1 : Streb.2G014140.1

File2:
|protein|domain|

|Streb.9G000290.1|PF00574.18|
|Streb.9G000290.1|PF01343.13|
|Streb.9G025660.1|PF00069.20|
|Streb.9G025660.1|PF07714.12|
|Streb.9G011140.1|PF00388.14|
|Streb.9G011140.1|PF00387.14|
|Streb.9G011140.1|PF00168.25|
|Streb.9G011140.1|PF09279.6|
|Streb.9G011140.1|PF03998.8|
|Streb.9G023250.1|PF13976.1|
|Streb.9G023250.1|PF00665.21|
|Streb.9G024400.1|PF03619.11|
|Streb.9G014700.1|PF05078.7|
|Streb.9G014700.1|PF12430.3|
|Streb.9G008200.1|PF01926.18|
|Streb.9G008200.1|PF02421.13|
|Streb.9G008200.1|PF08701.6|
|Streb.9G008200.1|PF00009.22|
|Streb.9G008200.1|PF03193.11|

Expected result:

I have to extract all the protein domain interaction for Protein1:protein2 with respect to protein1. protein-domain interaction may have more than one interaction for protein1. we have to extract all of them.i have use this awk command :
awk 'FNR==NR {a[$1]++; next} a[$1]' protein protein-domain > me.txt
final result has been shorted and protein1:protein2 interaction get disturb. i want the domain interaction with respect to protein1 without disturbing the protein protein interaction.
kindly help me through this.

munkeHoller · February 8, 2024, 2:20pm

@sangeeta , welcome. Please, show the 'Expected result' not a verbalisation (at least some of them).
Also, enclose your code/data in triple back ticks - use the menu options in the dialog that you type into for this ``` your code / data here ```

thks

sangeeta · February 9, 2024, 6:33am

protein-protein interaction file 1:

Protein1	Protein2
Streb.10G021010.1	Streb.9G023710.1
Streb.9G019140.1	Streb.7G013440.1
Streb.2G015700.1	Streb.9G023710.1
Streb.3G019820.1	Streb.7G013440.1
Streb.3G008920.1	Streb.1G025210.1
Streb.9G019140.1	Streb.3G014030.1
Streb.1G034750.1	Streb.9G009640.1
Streb.1G035920.1	Streb.3G016240.1
Streb.2G040440.1	Streb.7G013440.1
Streb.1G041180.1	Streb.7G013440.1
Streb.2G035340.1	Streb.10G024960.1
protein-domain file 2:
protein	domain
---	---
Streb.9G000290.1	PF00574.18
Streb.9G000290.1	PF01343.13
Streb.9G025660.1	PF00069.20
Streb.9G025660.1	PF07714.12
Streb.9G011140.1	PF00388.14
Streb.9G011140.1	PF00387.14
Streb.9G011140.1	PF00168.25
Streb.9G011140.1	PF09279.6
Streb.9G011140.1	PF03998.8
Streb.9G023250.1	PF13976.1
Streb.9G023250.1	PF00665.21
Streb.9G024400.1	PF03619.11

Below format is my expected result with respect to above file i mention:

Column 1	Column 2	Column 3	Column 4
domain	Protein1	Protein2	domain
PF00651.26	Streb.10G021010.1	Streb.9G023710.1	PF00176.18
PF04614.7
PF00240.18	Streb.9G019140.1	Streb.7G013440.1	NA
PF11976.3
PF14560.1
PF13019.1
PF10302.4
PF13881.1
PF11069.3
PF13860.1
PF11470.3
PF13750.1
PF13291.1
PF03587.9
PF05213.7
PF11620.3
PF08337.7
PF07708.6
PF12636.2
PF00651.26	Streb.2G015700.1	Streb.9G023710.1	PF00176.18
PF02731.10
PF00240.18	Streb.3G019820.1	Streb.7G013440.1	NA
PF11976.3
PF14560.1
PF13019.1
PF10302.4
PF13881.1
PF11069.3
PF04452.9
PF13860.1
PF11470.3
PF13291.1
PF13750.1
PF01556.13
PF08337.7
PF11620.3
PF07708.6
PF02933.12
PF00651.26 Streb.10G021010.1 Streb.9G023710.1 PF00176.18
PF04614.7
PF00240.18 Streb.9G019140.1 Streb.7G013440.1 NA
PF11976.3
PF14560.1
PF13019.1
PF10302.4
PF13881.1
PF11069.3
PF13860.1
PF11470.3
PF13750.1
PF13291.1
PF03587.9
PF05213.7
PF11620.3
PF08337.7
PF07708.6
PF12636.2
PF00651.26 Streb.2G015700.1 Streb.9G023710.1 PF00176.18
PF02731.10
PF00240.18 Streb.3G019820.1 Streb.7G013440.1 NA
PF11976.3
PF14560.1
PF13019.1
PF10302.4
PF13881.1
PF11069.3
PF04452.9
PF13860.1
PF11470.3
PF13291.1
PF13750.1
PF01556.13
PF08337.7
PF11620.3
PF07708.6
PF02933.12

Paul_Pedant · February 9, 2024, 9:47am

One issue is probably that the "test" a[$1] is not a test: it has the undesirable side-effect of creating the entry, as it would with a[$1] = "";.

The correct test is ($1 in a).

I don't see anything in your code which deals with multi-column output. You seem to be outputting each line from protein-domain as it comes in. If you want multi-column output, you need to accumulate data for each row and output them later, either in an END section, or on a break in some column (which means your inputs need to be sorted).

sangeeta · February 9, 2024, 10:03am

Thank you for your valuable output. The main issue i found after shorting is that i lost the protein protein interaction in the file. Any suggestion how to sort without disturbing the protein1-protein2 interaction?. My coding is not that much good . please help me through this. thank you

sangeeta · February 9, 2024, 10:04am

Can you improve this command to extract my final output?

Paul_Pedant · February 9, 2024, 9:41pm

Thanks for the detailed new data. It shows the format of the output you wish, and the input format. But there are many issues which mean that your precise requirements are undefined.

(a) The initial post implied that File 1 used a colon as a separator, and File 2 used a vertical bar as a separator (and had an empty first field too). The more recent post shows text in a grid format without knowing the separators.

(b) I did some analysis of the second data set. Naturally, I looked at the first output record, first field, which contains PF00651.26, wondering by what logic that entry was constructed. I found that the string PF00651.26 does not occur anywhere in the input files. So I don't know where that output comes from. Some of the grid output rows also seem to be folded over two lines.

I then ran a frequency count for every value of the form PF.... in all the data. There are 12 such values in File 2, and 24 unique entries in Results. The Results actually repeats 19 PF-type values 4 times over, and there are four rows where the link is made to a column 4 value.

I discovered that none of the 12 input values appears in the output, and none of the 19 output values appears in the input. That means (i) this data in incapable of demonstrating the logic behind any requirements, and (ii) it also is incapable of being used to test any actual code. Basically, it seems to be randomly generated.

(c) As a side question: is the protein file symmetrical: that is, for Streb.9G019140.1 : Streb.3G014030.1, does the domain interaction work as both Streb.9G019140.1 => Streb.3G014030.1 and Streb.3G014030.1 => Streb.9G019140.1.

(d) By what logic can some Column 4 entries be NA (Presumably not Applicable)? If a pair of proteins do not link two domains, surely there is nothing to report.

A better description of the logical steps that join any two domains, and some internally consistent test data, would be appreciated.

EDIT: I took another look at the protein-domain input. Although it has 12 Streb entries, there are only five different values: the other seven are duplicates. None of the values matches any line in the protein-protein file. That adds a few more questions:

(e) Are multiple paths between the same PF domains valid and/or expected? Your explanation seems to expect the outcome to be unique, but the data do not support that.

(f) Should the output be sorted or grouped to clarify the results?

(g) Are we restricted to cases where there are exactly two proteins in the chain? What stops one, three or more proteins forming a chain?

(h) Should there be some level of validation on the inputs?

Obviously I have no idea what your basic field of research might be, but I am (way too) familiar with tracing connectivity in mesh-connected high-voltage networks. Change domain to substation, and protein to cable, and the underlying path-finding problem is identical.

sangeeta · February 13, 2024, 7:10am

File1:
ID
1
2
3
4
5
6
7
File2:
5 AC
2 EE
1 DC
3 HG
5 LS
2 MH
6 UV
Expected Result:
1 DC
2 MH
2 EE
3 HG
5 AC
5 LS
6 UV
Sorting file2 according to file 1 ID
Any command through which i can performed this?
Thanks in advance

MadeInGermany · February 13, 2024, 7:33am

That means you first process file 2, put it in memory for fast lookup, then process file 1 and lookup/print.

Paul_Pedant · February 13, 2024, 10:29am

That is fairly easy, and it does not require sorting. We can just set up a dictionary and pull back the letters as we need them.

Notice this preserves the order of the keys in File1, and then the order of the selected text for that key within File2. It also fails to report that there are no matches for File1 4 and 7: the result of the split() is an empty array, so the for loop does zero iterations.

$ head File1 File2
==> File1 <==
ID
1
2
3
4
5
6
7

==> File2 <==
5 AC
2 EE
1 DC
3 HG
5 LS
2 MH
6 UV
$ cat Merge
#! /bin/bash --

Awk='
FNR == NR { X[$1] = X[$1] " " $2; next; }
{
    split (X[$1], V);
    for (j = 1; j in V; ++j) printf ("%s %s\n", $1, V[j]);
}
'
    awk "${Awk}" File2 File1

$ ./Merge
1 DC
2 EE
2 MH
3 HG
5 AC
5 LS
6 UV
$

So, exactly how does that relate to your protein/domain problem? Both the inputs to that already have two columns. You cannot expect to use those PF domains to match with, because they only occur in one of the files. So you presumably want to match the Streb values from protein-domain with the first column of protein-protein (or maybe the second column, or both columns -- who can tell ?)

Given your expected results posted 4 days ago, I assume you are trying to make 3 columns at this point, like:

PF00574.18 Streb.9G000290.1 Streb.3G016240.1

So we change the key column from $1 to $2, and index both protein columns, and try that.

#! /bin/bash --

Awk='
FNR == NR {
#.. Index both columns to each other.
    X[$1] = X[$1] " " $2;
    X[$2] = X[$2] " " $1;
    next;
}
{
    domain = $2; protein = $1;
    if (protein in X) {
        split (X[$1], V);
        for (j = 1; j in V; ++j)
            printf ("%s\t%s\t%s\n", domain, protein, V[j]);
    } else {
        printf ("%s\t%s\t%s\n", domain, protein, "No Matches");
    }
}
'
    awk "${Awk}" protein-protein.orig protein-domain.orig

$ ./Merger
domain	protein	No Matches
PF00574.18	Streb.9G000290.1	No Matches
PF01343.13	Streb.9G000290.1	No Matches
PF00069.20	Streb.9G025660.1	Streb.1G034750.1
PF07714.12	Streb.9G025660.1	Streb.1G034750.1
PF00388.14	Streb.9G011140.1	No Matches
PF00387.14	Streb.9G011140.1	No Matches
PF00168.25	Streb.9G011140.1	No Matches
PF09279.6	Streb.9G011140.1	No Matches
PF03998.8	Streb.9G011140.1	No Matches
PF13976.1	Streb.9G023250.1	No Matches
PF00665.21	Streb.9G023250.1	No Matches
PF03619.11	Streb.9G024400.1	No Matches
$

Why only two matches? Actually, the test data you provided produces no matches: it is completely inadequate. I sneaked an extra line into protein-protein.orig:

Streb.1G034750.1    Streb.9G025660.1

just to test that the script works.

Now would be a good time to re-read my post about why your results cannot be produced from your input, and maybe answer some of the questions in there.