Molecular biologist requires help re: search / replace script

Monday April 07, 2008

Hello - I was wondering if someone could help me? I have some basic knowledge of awk, etc., and can create simple scripts (e.g. a search_replace.awk file) that can be called from the command line:

$ awk -f search_replace.awk <file to be searched>

I have a tab-delimited table of data (text), essentially as follows (for simplicity),

a pp b
a pp c
a pp d
a pp e
a pp b
a pp e
a gi b
a pp a
b pp a
d pp a
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z

I want to be able take each line, in succession, and search it against the entire file, removing duplicates. I know that I can easily do this using the uniq command (on a *sorted* file), but I also need to be able to identify mirror-image or reverse duplicates, e.g.

a pp b
a pp b
a pp b
b pp a
a pp b
b pp a

should be reduced to a single line,

a pp b

(since "b pp a" is 'the same' as "a pp b").

Is this clear?

Additionally, my actual file contains additional columns (fields, per row); I would like to ignore (but keep) these additional fields, just searching and replacing based on the (in the example above) fields $1, $2, $3. I think that it is possible to specify fields with regard to search / replace operations, etc.

Lastly (I know that I am asking a lot), it would be ideal if the output could also keep track of how many duplicated lines there were, adding a column of "weights" (1; 2; 3; 4; etc.) indicating the numbers of duplicates in the source file, with 1 = no duplicates, 2 - one duplicate, etc.

In the six-line example above, this would be

6 a pp b

I have played around with the command line and some simple scripts, but this is a little beyond my grasp. I'm guessing one solution would be a grep operation, piped to / from an awk or sed command, perhaps?

FYI, I am a molecular biologist / geneticist; I am trying to sort a file of perhaps 150-200,000 lines each containing 7-8 fields, for loading into a data visualization / analysis program. In the example above, the first and third columns represent specific genes, with the middle (2nd) column establishing the relationship between the first and the second gene. Note that the relationship "pp" is different than "gi", thus

a pp b

is different from

a gi b

The reason for all of this is that I do not remove duplicate mappings (including the reverse or "mirror images," e.g. "a pp b" = "b pp a"), then I get extra lines appearing in my analysis program (Cytoscape), that complicates the display (relationships between groups of genes). The reason that I asked for the "weights" is that I want to weight the edges (lines) connecting my nodes (genes) in Cytoscape, according to how many time this relationship was reported, from various assays (different type of experiments; independent analyses).

If anyone could suggest some solutions, that would be *very* much appreciated!

Thanking you all in advance,

Sincerely, Greg S. :slight_smile:

Can you come up with a canonicalization for all these fields? Even if you can't keep them in canonical form, that would basically solve the immediate problem.

Let's say, for example, always sort them so $1 is alphabetically before $3. So swap any fields where $3 is before $1, then sort -u and all that.

awk '$3 < $1 { t = $1; $1 = $3;  $3 = t } { print }'

Extending this to keep counts for each canonical form in awk itself should be fairly trivial.

awk '$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } '

Addendum: If you have additional fields you need to key on just $1$2$3 instead of $0, of course. Then you can't keep them in the weighted summary, though.

Hi era - Thank you for your reply. In my example (above), the middle column (field $2) will be either "pp" or "gi" or "tf" (I didn't mention "tf," for simlplicity) - at least for now.

Fields $1 and $3 will be genes that are uniquely named, sometimes containing a dash (e.g., YPR016W-A) but mostly of the form YGR110W, etc. (I work with yeast.) Each name is a unique "systematic" name - one per gene). None of the gene names contain spaces, thankfully.

I believe this what you meant by canonicalization (Canonicalization) - Wikipedia, the free encyclopedia

Could you kindly elaborate on the method(s) that you alluded to in your reply?

Thanks, era! Greg :slight_smile:

Did you try the code I posted? Does it do roughly what you asked for? If not, can you tell what more you need?

Canonical representation basically means you come up with a "standard" which all other representations can be converted into. (I wrote that before looking at the first sentence in the Wikipedia article -- honest.)

What I suggested was to simply use a canonicalization which says they should be in alphabetic order, so you can compare them head to head; in other words, find a way to mask off anything which introduces an artificial difference when you know two forms to be identical. Now if you rewrite them in this way, then simple string equality can be used to test for equivalence.

The code I posted already does this, but doesn't handle lines with more than three fields. Just add a print in the ++a[stuff] part if you want to keep them and add a frequency count at the end. You need a way to separate the frequency count from the raw data, though, so I'd just keep them in separate files, and not add that print statement.

PS. The wikipedia link is slightly broken. You want Canonicalization - Wikipedia, the free encyclopedia

Hello again ... Yes, I'm following what you're saying (era) - I tried out the code you suggested. I'll explore this approach, further - thanks!

I'd also be interested in additional ideas (for comparison) ...

Greg :slight_smile:

script:

awk '{
     	if($3 > $1) {printf("%s %s %s", $3, $2, $1)}     	
     	else {printf("%s %s %s", $1, $2, $3)}     	
     	for(i=4; i<=NF; i++) {printf(" %s", $i)}
     	printf("\n")
     } ' filename  | \
     awk '{ 
     		arr[$1 $2 $3]++;
     		if(m[$1 $2 $3]=="") {m[$1 $2 $3]=$0}
     	  }
     	  END {
     	     for(i in arr)
     	       {print arr, m}
     	  }'	>  newfilename

Just to finish that dangling remark about more fields after $3

awk -F "\t" '$3 < $1 { t = $1; $1 = $3; $3 = t }
{ a[$1 "\t" $2 "\t" $3]++ }
END { for (k in a) { print a[k], k } } '

I added the -F "\t" to explicitly make this tab-delimited, and changed the array to key on $1 $2 $3 joined by tabs. Hope your awk can do that too.

Like I wrote above, this summarizes the counts only, and I'd keep it that way. It might be useful to convert all of your data to the canonical format, which the first script I posted should already do for you. (Maybe add the -F "\t" there too.)

Hope this helps.

jim: I don't understand. What happened to z gi t 60 for example?

greg: just to clarify, do you want weights for the triplets, or for the entire lines? I assumed triplets, but I guess jim undertood differently.

Hi jim & era: I appreciate both your solutions very much! The summary below is rather long, but it illustrates my working through the problem - please be patient! :wink:

I think that we are close; however, the final output is still not quite what I want, as described below. Basically, I need to find duplicates (a pp b = b pp a, etc.) and summarize and count them (2 a pp b), including the unique lines (1 a pp a). However, I'd like this sorting, etc. to be based on three specific columns (here, $1, $2, $3 - in the source file), independently of any of the other fields.

I managed to save jim's awk script as a perl script/file, �search_replace.awk.sh�, executable from the command line. For reference, I am using Cygwin on Windows XP Pro at work; I am a Ubuntu linux user at home.

#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)} else {printf("%s %s %s", $1, $2, $3)} for(i=4; i<=NF; i++) {printf(" %s", $i)} printf("\n") }' dummy_test_duplicates_file_2.txt | awk '{ arr[$1 $2 $3]++; if(m[$1 $2 $3]=="") {m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

which is executed as follows:

$ sh search_replace.awk.sh

(Question: Is it possible to specify the input file from the command line, rather than within the script?)

This also works (same code, split over different lines)

#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

but not this (identical code, split differently)

#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt |
awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

$ sh search_replace.awk.sh
search_replace.awk.sh: line 5: $'\r': command not found

... Adding a backslash ( printf("\n") }' dummy_test_duplicates_file_2.txt | \ ) should - but does not - correct the problem ...

Ultimately, I modified this �search_replace.awk.sh� script by adding

BEGIN {OFS=FS="\t"}

to try to ensure tab-delimited input/output:

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

I also compared the output form jim's code versus that from era's awk command,

$ awk '$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt

Using my source file (to be searched) - �dummy_test_duplicates_file.txt� (21 lines), I get the following:

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

$ cat dummy_test_duplicates_file.txt | sort
a gi b
a pp a
a pp b
a pp b
a pp c
a pp d
a pp e
a pp e
b pp a
d pp a
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
t pp z
v gi t
y gi t
y gi t
z gi t

$ sh search_replace.awk.sh
3 y gi t
1 u gi t
2 d pp a
1 z pp t
2 z gi t
2 v gi t
2 e pp a
1 a pp a
1 w gi t
1 b gi a
3 b pp a
1 x gi t
1 c pp a

Compare to:

$ awk '$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt
1 t gi x
1 t gi v
1 t gi y
1 t gi z
1 a pp a
2 t gi y
2 a pp b
1 t gi z
1 a pp c
1 a pp d
1 a pp b
2 a pp e
1 a gi b
1 a pp d
1 t gi u
1 t pp z
1 t gi v
1 t gi w

This also works:

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt

Interestingly, adding a second ( BEGIN {OFS=FS="\t"} ),

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file.txt | awk ' BEGIN {OFS=FS="\t"} { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

gives the same output, formatted slightly differently and in a different line order:

$ sh search_replace.awk.sh
2 z gi t
2 d pp a
3 y gi t
1 c pp a
1 x gi t
3 b pp a
1 w gi t
1 a pp a
2 v gi t
1 b gi a
1 z pp t
1 u gi t
2 e pp a

$ sh search_replace.awk.sh | sort
1 a pp a
1 b gi a
1 c pp a
1 u gi t
1 w gi t
1 x gi t
1 z pp t
2 d pp a
2 e pp a
2 v gi t
2 z gi t
3 b pp a
3 y gi t

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt | sort
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

====================

So far, this "appears" to be working great! :slight_smile:

However, when I add a fourth column (to see the effect - remember I want to sort, count "duplicates," etc. based only on three defined columns / fields), saved as "dummy_test_duplicates_file_3.txt"

$ cat dummy_test_duplicates_file_3.txt
a pp b This
a pp c column
a pp d contains
a pp e some
a pp b text
a pp e that
a gi b I
a pp a don't
b pp a want
d pp a to
t gi u affect
t gi v the
t gi w search/replace
t gi x operations.
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z

$ cat dummy_test_duplicates_file_3.txt | sort
a gi b I
a pp a don't
a pp b This
a pp b text
a pp c column
a pp d contains
a pp e some
a pp e that
b pp a want
d pp a to
t gi u affect
t gi v the
t gi w search/replace
t gi x operations.
t gi y
t gi z
t pp z
v gi t
y gi t
y gi t
z gi t

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b I
1 a pp a don't
1 a pp b This
1 a pp b text
1 a pp b want
1 a pp c column
1 a pp d contains
1 a pp d to
1 a pp e some
1 a pp e that
1 t gi u affect
1 t gi v
1 t gi v the
1 t gi w search/replace
1 t gi x operations.
1 t pp z
2 t gi z
3 t gi y

[Here I changed source file in perl script from �dummy test_duplicates_file.txt� to �dummy_test_duplicates_file_3.txt�

$ sh search_replace.awk.sh | sort
1 a pp a don't
1 b gi a I
1 b pp a This
1 b pp a text
1 b pp a want
1 c pp a column
1 d pp a contains
1 d pp a to
1 e pp a some
1 e pp a that
1 u gi t affect
1 v gi t
1 v gi t the
1 w gi t search/replace
1 x gi t operations.
1 z pp t
2 z gi t
3 y gi t

This (above) really isn't what I am looking for - Note that �a pp b� (or �b pp a� in the second example) are listed separately, 3 times, in the two different solutions (outputs), above. The contents of the fourth column ($4 in the input file) is affecting the search / replace operation.

Additionally - What is the purpose of the "%s� (ASCII) string printf specification? It looks like it is being set to the field variables (e.g. printing fields $1, $2, $3 as ASCII strings)?

I also decided to change the �order� of the search fields $3, $1:

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($1 > $3) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $3, $2, $1)}
for(i=4; i<=NF; i++) {printf("%s", $i)}
printf("\n") }' dummy_test_duplicates_file_3.txt | awk ' BEGIN {OFS=FS="\t"} { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

e.g. if $1 (b) > $3 (a), print $3 (a) $2 $1 (b)

Thank you both once again for your different, unique solutions - I'd like to get this sorted out using one (or both - this is educational) methods!

Sincerely, Greg :slight_smile:

Absolutely. The command-line arguments are available in the shell script in $1, $2, $3 etc; "$@" (with the quotes) is all the arguments. This is confusing if you've been playing a lot with awk, where it's the fields of the current line.

I would speculate that this is a Cygwin problem. \r is the DOS carriage return character. Maybe you could save as a Unix file and try again.

Like I think I was saying before, replace $0 with $1 $2 $3 (or even $1 "\t" $2 "\t" $3) in a[$0]++

Correct.

I hope I managed to address all your questions. Maybe an executive summary would help (-:

run dos2unix on that file that may have been constructed under a DOS/windows environment. you may find it react better.

Wednesday April 09, 2008

Quote: Originally Posted by gstuart
Code:

$ awk 'BEGIN {OFS=FS="\t"}
$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ }
END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b I
1 a pp a don't
1 a pp b This
1 a pp b text
1 a pp b want

[ Reply - era: ] "Like I think I was saying before, replace $0 with $1 $2 $3 (or even $1 "\t" $2 "\t" $3) in a[$0]++"
---------------------
Wow - Thanks era! This is wonderful - I'm not sure that I would have figured this out, in a timely manner! I'm going to read up a bit on awk and arrays. I think that this will be *very* helpful!

$ cat dummy_test_duplicates_file_3.txt
a pp b This
a pp c column
a pp d contains
a pp e some
a pp b text
a pp e that
a gi b I
a pp a don't
b pp a want
d pp a to
t gi u affect
t gi v the
t gi w search/replace
t gi x operations.
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b I
1 a pp a don't
1 a pp b This
1 a pp b text
1 a pp b want
1 a pp c column
1 a pp d contains
1 a pp d to
1 a pp e some
1 a pp e that
1 t gi u affect
1 t gi v
1 t gi v the
1 t gi w search/replace
1 t gi x operations.
1 t pp z
2 t gi z
3 t gi y

If I understand this correctly, increasing the array to four or more elements (in this example) �screws up� the search for duplicates, due to non-identity of the remaining fields. Short of concatenating these fields, e.g. from the above output,

1 a pp b This
1 a pp b text
1 a pp b want

as

3 a pp b This; text; want

I don't seen any �simple� work-around (?!) ... However, I can live with this, as I am critically interested in finding / counting / parsing duplicates among the $1, $2, $3 fields (in the input file: Gene A, interaction type, Gene B), and having the output list the unique relationships with the duplicate counts (�weights�)! This is perfectly acceptable to me, and the code provided by era (modified slightly, here),

awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort > out.tab

is simple and elegant! :slight_smile:

This is my understanding of era's code:

$ awk 'BEGIN
{OFS=FS="\t"} # output, input field separators = tabs
if $3 < $1 then
{ t = $1; $1 = $3; $3 = t } # switch the order of the variables if $3 < $1, e.g. "b gi a" ---> "a gi b"
{ a[$1 "\t" $2 "\t" $3 "\t" $4]++ } # ++ increments the index (i.e. table elements) in
# a[$1 "\t" $2 "\t" $3 "\t" $4] by 1, for the following "for" loop
END

{ for (k in a) { print a[k], k } } ' # for statement: for (initialization; condition; increment)
# k = array elements; incrementally printed, until the end of the array
# a = the array, defined above (three index elements, in this example)
dummy_test_duplicates_file_3.txt | sort # input (source) file; dump to screen, sorted via pipe
# (or " > output_file.txt" - to save the output directly to a file)

How is this thing actually counting the �duplicates� - *following* this declaration, the array itself must be able to keep track of duplicates? ... I need to read up on awk and arrays ...

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print a[k] } } ' dummy_test_duplicates_file_3.txt | sort
1
1
1
1
1
1
1
2
2
2
2
3
3

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print k } } ' dummy_test_duplicates_file_3.txt | sort
a gi b
a pp a
a pp b
a pp c
a pp d
a pp e
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
t pp z

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort > out.tab

$ cat out.tab
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

NOTE that the print command (above) adds a �new� (fourth field, in the first column position) - the frequency (duplicates) count, or �weight�:

$ awk ' { print ($0) } ' out.tab
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

$ awk ' { print ($1) } ' out.tab
1
1
1
1
1
1
1
2
2
2
2
3
3

$ awk ' { print ($2) } ' out.tab
a
a
a
t
t
t
t
a
a
t
t
a
t

$ awk ' { print ($3) } ' out.tab
gi
pp
pp
gi
gi
gi
pp
pp
pp
gi
gi
pp
gi

$ awk ' { print ($4) } ' out.tab
b
a
c
u
w
x
z
d
e
v
z
b
y

$ awk ' { print ($5) } ' out.tab

$ awk ' { print ($"[1-NF]") } ' out.tab
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

THANK YOU all, once again for your help -very much appreciated.

Sincerely, Greg S. :slight_smile: