Help with awk program

dietmar13 · October 22, 2013, 7:37am

i have two files,
one looks like this (file1):

novelMiR_892    novelMiR_891,
novelMiR_852    
novelMiR_893    
novelMiR_1661    
novelMiR_854    
novelMiR_1210    
novelMiR_1251    
novelMiR_855    
novelMiR_1252    
novelMiR_897    novelMiR_2336,novelMiR_2335,

and the second like this (file2):

>novelMiR_891
AAAABBBCCCDDD
>novelMiR_892
BBBCCCDDDEEEF
>novelMiR_852
HHHHGGGFFFDD

now I want rename all ">headers" which are in file 1 in the same line with the first name in file1. this is what I want (file3):

>novelMiR_892 (renamed)
AAAABBBCCCDDD
>novelMiR_892
BBBCCCDDDEEEF
>novelMiR_852
HHHHGGGFFFDD

the first renamed, because it is the same as 891 (seen from file 1)

my solution is (BUT DOES NOT WORK):

awk 'NR==FNR{n[$1]=$1","$2;next} { $1 ~ ">" ;
name=substr($1,2,length($1)-1); getline seq; 
{for (i in n) if(n ~ /'"$name"'/) names=i} print names "\n" seq > "file3" }' file1 file2

explanation:

first I create an array with all names concatenated by "," and indexed with the names I want to be used.

n[novelMiR_892] = novelMiR_892,novelMiR_891,

now I get line for line all names (without ">") and the corresponding sequences and compare if the name is one of the n-array. if yes the index should be kept and printed.

But I always get only the first name for all sequences:

>novelMiR_892
AAAABBBCCCDDD
>novelMiR_892
BBBCCCDDDEEEF
>novelMiR_892
HHHHGGGFFFDD

where is ma fallacy....

Akshay_Hegde · October 22, 2013, 8:09am

Try

$ cat file1
novelMiR_892    novelMiR_891,
novelMiR_852    
novelMiR_893    
novelMiR_1661    
novelMiR_854    
novelMiR_1210    
novelMiR_1251    
novelMiR_855    
novelMiR_1252    
novelMiR_897    novelMiR_2336,novelMiR_2335,

$ cat file2
>novelMiR_891
AAAABBBCCCDDD
>novelMiR_892
BBBCCCDDDEEEF
>novelMiR_852
HHHHGGGFFFDD

$ awk -F'[ ,]' 'FNR==NR{a=$1;$1="";gsub(" ",">");Arr[">"$0]=a;next}{for(i in Arr)$1=(i~$1)?">"Arr:$1}1'  file1 file2

Resulting

>novelMiR_892
AAAABBBCCCDDD
>novelMiR_892
BBBCCCDDDEEEF
>novelMiR_852
HHHHGGGFFFDD

Scrutinizer · October 22, 2013, 8:49am

@OP: You are not using a next statement and it seems you have the two files reversed and this: /'"$name"'/ uses a shell variable "$name" .

Alternatively try:

awk 'NR==FNR{for(i=2; i<=NF; i++) if($i)A[">" $i]=$1; next} $1 in A{$1=A[$1]}1' FS='[ \t]*|,' file1 file2

dietmar13 · October 22, 2013, 9:52am

Scutinizers work's nearly perfect. only the ">" is missing on all renamed headers.

as i don't understand the solution at all, i can't add it!

what means the 1 at the end - to print it?

Akshay works nearly.

this are only test-data, but in structure completely correct.

the renamed headers start with a unusual character and looks like this:

 >novelMiR_11    novelMiR_10

Akshay_Hegde · October 22, 2013, 10:00am

I think you missed > suppose if we take one more file say file3

$ cat file3
>novelMiR_891
AAAABBBCCCDDD
>novelMiR_892
BBBCCCDDDEEEF
>novelMiR_852
HHHHGGGFFFDD
>novelMiR_2336
Test1 - Check
>novelMiR_2335
Test2 -Check

it's resulting

novelMiR_892
AAAABBBCCCDDD
>novelMiR_892
BBBCCCDDDEEEF
>novelMiR_852
HHHHGGGFFFDD
novelMiR_897
Test1 - Check
novelMiR_897
Test2 -Check

Modified version of Scrutinizer's code

$ awk 'NR==FNR{for(i=2; i<=NF; i++) if($i)A[">" $i]=$1; next} $1 in A{$1=">"A[$1]}1' FS='[ \t]*|,' file1 file3

Scrutinizer · October 22, 2013, 10:16am

Thanks, yes I left out a ">", so Akshay posted a correction. This would be an alternative:

awk 'NR==FNR{for(i=2; i<=NF; i++) if($i)A[">" $i]=">" $1; next} $1 in A{$1=A[$1]}1' FS='[ \t]*|,' file1 file2

Perhaps this is a bit clearer:

awk 'NR==FNR{for(i=2; i<=NF; i++) if($i)A[$i]=$1; next} $2 in A{$2=A[$2]}1' FS='[ \t]*|,' file1 FS=\> OFS=\> file2

--

Yes it means print the record (in this case print the entire line)...