Text processing using awk

dovah · July 16, 2014, 2:15pm

I dispose of two tab-delimited files (the first column is the primary key):

File 1 (there are multiple rows sharing the same key, I cannot merge them)

A    28,29,30,31
A    17,18,19
B    11,13,14,15
B    8,9

File 2 (there is one only row beginning with a given key)

A    2,8,18,30,31
B    3,11

I'd like to put a star symbol (tab-separated) in File 1 if there is a corresponding element in the second column of File 2.

The output should look like:

A    28,29,30,31        **
A    17,18,19        *
B    11,13,14,15        *
B    8,9

I'm trying an awk solution, but I cannot find my way out. Please let me know if you have an idea of how I could deal with this issue.

RudiC · July 16, 2014, 2:27pm

Please show us your awk approach.

dovah · July 16, 2014, 5:43pm

Something like this. But it really need a fix, it doesn't give the expected output.

 	 	 	PRE.cjk \{ font-family: "WenQuanYi Micro Hei",monospace; \}PRE.ctl \{ font-family: "Lohit Hindi",monospace; \}P \{ margin-bottom: 0.1in; line-height: 120%; \}CODE.cjk \{ font-family: "WenQuanYi Micro Hei",monospace; \}CODE.ctl \{ font-family: "Lohit Hindi",monospace; \}A:link \{  \}

$ awk '     FNR == NR {         a[$1] = $2;         next;     }     {         split($2,b,",");         split(a[$1],c,",");         for (i in b) {             if (b in c) {                 printf("%s %s\t*\n",$1,a[$1]);next;             }}                 print $1, a[$1];      } ' file1 file2

Thanks.

Scrutinizer · July 17, 2014, 1:23am

You were on the right track. Here is an approach with two-dimensional arrays :

awk '{split($2,F,/,/)} NR==FNR{for(i in F) A[$1,F]; next} {for(i in F) if(($1,F) in A) $3=$3 "*"}1' FS='\t' OFS='\t' file2 file1

or

awk '{split($2,F,/,/); for(i in F) if(NR==FNR){A[$1,F]} else if(($1,F) in A) $3=$3 "*"}NR>FNR' FS='\t' OFS='\t' file2 file1