Search, replace string in file1 with string from (lookup table) file2?

gstuart · April 10, 2008, 3:24pm

Hello: I have another question. Please consider the following two sample, tab-delimited files:

File_1:

Abf1 YKL112w
Abf1 YAL054c
Abf1 YGL234w
Ace2 YKL150w
Ace2 YNL328c
Cup9 YDR441c
Cup9 YDR442w
Cup9 YEL040w
...

File 2:

...
ABF1 YKL112W
ACE2 YLR131C
CUP9 YPL177C
...

File_2 is a �lookup table;� I want to replace $1 in File_1 with the matching $2 field in File_2, additionally adding a middle column containing the string �tf�, and a column of �ones� (�1� in the first column position), all tab-delimited.

Additionally, it would be ideal if the case could be ignored for the search / replace, but that the alphabetical output be all uppercase [a-z] converted to [A-Z].

FYI, these are yeast genes; in addition to numbers and letters, some of the genes will contain dashes (e.g., YBR162W-A), but none will contain commas, semicolons, spaces, etc.

Output File_3:

1 YKL112W tf YKL112W
1 YKL112W tf YAL054C
1 YKL112W tf YGL234W
1 YLR131C tf YKL150W
1 YLR131C tf YNL328C
1 YLR131C tf YLR439W
1 YPL177C tf YDR441C
1 YPL177C tf YDR442W
1 YPL177C tf YEL040W
...

This is related to (but different from) my earlier query,

http://www.unix.com/shell-programming-scripting/60159-molecular-biologist-requires-help-re-search-replace-script.html\#post302183287

Here, the first column is a �dummy� weight value, to maintain �field compatibility,� with my earlier file, as shown in this example:

1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y
...

Ultimately, I will end up with a file like this, with $1 = weight, $2 = gene1, $3 = association, $4 = gene2:

1 YKL112W tf YKL112W
1 YKL112W tf YAL054C
1 YKL112W tf YGL234W
1 YLR131C tf YKL150W
1 YLR131C tf YNL328C
1 YLR131C tf YLR439W
1 YPL177C tf YDR441C
1 YPL177C tf YDR442W
1 YPL177C tf YEL040W
...
1 YBL012C gi YCL045C
1 YBL012C pp YBL012C
5 YBL012C pp YHR039C-A
1 YLR363W-A gi YNL143C
4 YLR363W-A gi YPR123C
1 YLR363W-A gi YLR467W
1 YLR363W-A pp YNR073C
2 YBL012C pp YGL232W
2 YBL012C pp YOR102W
2 YLR363W-A gi YFL066C
2 YLR363W-A gi YNR073C
3 YBL012C pp YCL045C
3 YLR363W-A gi YKL100C
...

Thank you - Once again, *very* much appreciated!

Sincerely, Greg S.

Franklin52 · April 11, 2008, 10:08am

gstuart:

Hello: I have another question. Please consider the following two sample, tab-delimited files:

File_1:

Abf1 YKL112w
Abf1 YAL054c
Abf1 YGL234w
Ace2 YKL150w
Ace2 YNL328c
Cup9 YDR441c
Cup9 YDR442w
Cup9 YEL040w
...

File 2:

...
ABF1 YKL112W
ACE2 YLR131C
CUP9 YPL177C
...

File_2 is a �lookup table;� I want to replace $1 in File_1 with the matching $2 field in File_2, additionally adding a middle column containing the string �tf�, and a column of �ones� (�1� in the first column position), all tab-delimited.

Additionally, it would be ideal if the case could be ignored for the search / replace, but that the alphabetical output be all uppercase [a-z] converted to [A-Z].

FYI, these are yeast genes; in addition to numbers and letters, some of the genes will contain dashes (e.g., YBR162W-A), but none will contain commas, semicolons, spaces, etc.

Output File_3:

1 YKL112W tf YKL112W
1 YKL112W tf YAL054C
1 YKL112W tf YGL234W
1 YLR131C tf YKL150W
1 YLR131C tf YNL328C
1 YLR131C tf YLR439W
1 YPL177C tf YDR441C
1 YPL177C tf YDR442W
1 YPL177C tf YEL040W
...

This should give the desired output:

 awk '
FNR==NR{a[tolower($1)]=$2;next} 
tolower($1) in a{print "1 " a[tolower($1)] " tf " toupper($2)}
' "File_2" "File_1"

Regards

gstuart · April 11, 2008, 2:32pm

This is absolutely wonderful! ...

Here is my understanding of Franklin52's code:

Unix Manuals - AWK Reference

# == is �is equal�

tolower(string): Return the string with all upper case characters replaced with their lower case equivalents.

toupper(string): Return the string with all lower case characters replaced with their upper case equivalents.

FNR: Record number in input file.

NR: Number of records processed.

Thus, the above script translates (? - please correct me if I am mistaken) as

awk'
FNR==NR{a[tolower($1)]=$2;next}

while the record number (line) equals the total number of records (is true), do all of the following:
get $1 (the common gene name - converted to LOWERcase - required since the corresponding field in File_1 is lowercase; otherwise, it will fail to �match� - linux is case-sensitive) in the lookup file (File_2), set (change it) to the (already uppercase) systematic gene name ($2) in the same lookup table, then read the next record number (line);

tolower($1) in a{print "1 " a[tolower($1)] " tf " toupper($2)}

now, for each $1 in File_2 (now set to uppercase $2, from the lookup table), in the second file (File_1, the one to be converted), print
�1�, $2 from File_2; �tf�, $2 from File_1 (returned as uppercase, to convert the trailing lowercase c, w, -a, etc.)

' "File_2" "File_1"

File_1 = file to be processed (converted)
File_2 = �lookup file� ("common_to_systematic.tab)

?!

This works brilliantly!! Thank you so much, Franklin52!!

Have a super weekend! ... Greg

RickR · February 6, 2009, 9:37am

Is it possible to modify the script above to rename files based on a lookup table?

e.g.:
Current New
A87324.jpg A1372365.jpg
A89732.jpg A98274.jpg
A130347.jpg A73689.jpg
...

Thanks,

Rick

vgersh99 · February 6, 2009, 9:49am

#!/bin/ksh

while read current new x
do
   mv "${current}" "${new}"
done < /path/to/lookupFile

allrise123 · May 23, 2009, 4:10pm

hi guys!!

I am new to shell script.. i wanted to know abt sed command and how it work?

here is what i want do, i want to search original string in export.txt file which is:
export mibs =\opt\mymibs\

i want to replace it by
export mibs =\opt\new_mibs\

Please help with it

thanks in advance

allrise123 · May 23, 2009, 4:12pm

hi vgersh99,
could u solve my above query? thanks

devtakh · May 23, 2009, 11:22pm

Duplicate post:

sed 's|export mibs =\opt\mymibs\|export mibs =\opt\new_mibs\|g' export.txt

-Devaraj Takhellambam

aenagy · June 7, 2009, 1:41pm

I am trying to modify the code above for a similar situation. I have two input files. The first file (Datastores) is a CSV with friendly names in the first column and UUIDs in the second column. The second file (VMs) is a list of files with the full path using the UUID. For example:

----- datastores.csv -----
friendly name 1, UUID1
friendly name 2, UUID2
friendly name 3, UUID3
etc
----- datastores.csv -----

----- VMs.txt -----
/folder/UUID3/vm1.vmx
/folder/UUID2/vm2.vmx
/folder/UUID1/vm3.vmx
/folder/UUID3/vm4.vmx
etc
----- VMs.txt -----

What I am looking for is output that looks like this:

----- output.txt -----
/folder/friendly name 3/vm1.vmx
/folder/friendly name 2/vm2.vmx
/folder/friendly name 1/vm3.vmx
/folder/friendly name 3/vm4.vmx
etc
----- output.txt -----

The sample awk is not intuitive for me even after reading the other explanation and going over the O'Rielly pocket reference. The case of either input does not need to be changed -- if there is a problem with case matching then I have other issues to deal with.

Thanks for your help in advance.

summer_cherry · June 8, 2009, 6:11am

nawk 'NR==FNR {
_[toupper($1)]=$2
next
}
NR!=FNR{
	printf("1 %s tf %s\n",_[toupper($1)],$2)
}' file1 file2