awk to update file with partial matching line in another file and append text

cmccabe · April 13, 2019, 7:58am

In the awk below I am trying to cp and paste each matching line in f2 to $3 in f1 if $2 of f1 is in the line in f2 somewhere. There will always be a match (usually more then 1) and my actual data is much larger (several hundreds of lines) in both f1 and f2 . When the line in f2 is pasted to $3 in f1 , the value in $1 is appended to it at the end of the line with a /test/id/$1_raw.file_fixed.txt . Most of this is static text, except the value from $1 is after the third / . Thank you :).

f1

xyxy_0268 0000-yyyy
xyxy_0270 1111-xxxx
R_0000_00_02_00_45_32_xxxx_x0-0000-100-x0.0_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx

f2

xyxy_0268 0000-yyyy /path/to/the/xxx/data/0000-yyyy_v1_0000-yyyy_RNA_v1/190326-Control_v1_20190328071906449 /path/to/the/xxx/data/00-0000_xxxx-03_v1/00-0000_xxxx-03_v1_20190322115521953
xyxy_0270 1111-xxxx /path/to/the/xxx/data/1111-xxxx-03_v1/1111-xxxx-03_v1_20190322115521953

desired

xyxy_0268 0000-yyyy /path/to/the/xxx/data/0000-yyyy_v1_0000-yyyy_RNA_v1/190326-Control_v1_20190328071906449/test/id/xyxy_0268_raw.file_fixed.txt
xyxy_0270 1111-xxxx /path/to/the/xxx/data/1111-xxxx-03_v1/1111-xxxx-03_v1_20190322115521953/test/id/xyxy_0270_raw.file_fixed.txt
R_0000_00_02_00_45_32_xxxx_x0-0000-100-x0.0_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx

awk

awk 'NR==FNR {id[$2]; next} $2 in id' f1 f2 | awk '1;$NF=/$[id]/{ print "/test/id/$[id]_raw.file_fixed.txt"}' > out

Don_Cragun · April 13, 2019, 9:46pm

I'm not sure that I am following what you are trying to do.

We can easily produce the output you say you want from your two sample input files with just:

awk 'FNR == NR{print; next} NF == 1' f2 f1

or even:

awk 'FNR == NR; NF == 1' f2 f1

Scrutinizer · April 14, 2019, 2:00am

Hi,

Try:

awk 'NR==FNR {id[$2]=$3; next} $2 in id{$3=id[$2] "/test/id/" $1 "_raw.file_fixed.txt"}1' f2 f1

But you wrote:

So in that case you may need to try something like:

awk 'NR==FNR {for(i=1; i<=NF; i++) id[$i]=$3; next} $2 in id{$3=id[$2] "/test/id/" $1 "_raw.file_fixed.txt"}1' f2 f1

cmccabe · April 16, 2019, 5:47pm

I made an typo in f2
Should just be one column of multiple strings.

/path/to/the/xxx/data/0000-yyyy_v1_0000-yyyy_RNA_v1/190326-Control_v1_20190328071906449 
/path/to/the/xxx/data/00-0000_xxxx-03_v1/00-0000_xxxx-03_v1_20190322115521953
/path/to/the/xxx/data/1111-xxxx-03_v1/1111-xxxx-03_v1_20190322115521953

So I thought i adjusted the script below correctly to capture the partial match. That is the $2 will be in that long string. Thank you :).

Awk

awk 'NR==FNR {for(i=1; i<=NF; i++) id[$i]=$1; next} $2 in id{$3=id/[$2]/ "/test/id/" $1 "_raw.file_fixed.txt"}1' f2 f1
awk 'NR==FNR {for(i=1; i<=NF; i++) id[$i]=$1; next} $2 in id{$3=id/[$i]/ "/test/id/" $1 "_raw.file_fixed.txt"}1' f2 f1

Don_Cragun · April 19, 2019, 12:40am

cmccabe:

I made an typo in f2
Should just be one column of multiple strings.

/path/to/the/xxx/data/0000-yyyy_v1_0000-yyyy_RNA_v1/190326-Control_v1_20190328071906449 
/path/to/the/xxx/data/00-0000_xxxx-03_v1/00-0000_xxxx-03_v1_20190322115521953
/path/to/the/xxx/data/1111-xxxx-03_v1/1111-xxxx-03_v1_20190322115521953

So I thought i adjusted the script below correctly to capture the partial match. That is the $2 will be in that long string. Thank you :).

Awk

awk 'NR==FNR {for(i=1; i<=NF; i++) id[$i]=$1; next} $2 in id{$3=id/[$2]/ "/test/id/" $1 "_raw.file_fixed.txt"}1' f2 f1
awk 'NR==FNR {for(i=1; i<=NF; i++) id[$i]=$1; next} $2 in id{$3=id/[$i]/ "/test/id/" $1 "_raw.file_fixed.txt"}1' f2 f1

It still isn't clear to me what you are trying to do, but I assume that the above code is giving errors for trying to divide by an element of an unnamed array and then trying to divide the result of that by a non-numeric string. But, that problem may be hidden by the fact that no value in $2 in f1 (i.e., 0000-yyyy , 1111-xxxx , or the empty string from the 3rd line in f1 ) ever appears as a field in f2 . And, therefore, the condition $2 in id is never true.

And, since there is only one field in every line in f2 , the only array subscripts in the id[] array are complete lines from f2 . Why run a loop from 1 through NF when NF is always 1?

Since you say that the corrected f2 only contains one string, the first part of the code in both awk scripts above:

NR==FNR {for(i=1; i<=NF; i++) id[$i]=$1; next}

could more simply be written as:

NR==FNR {id[$0]=$0; next}

or as:

NR==FNR {id[$1]=$1

which would both produce exactly the same id[] arrays.

Since there are no matching lines, I still can't make any sense of the description of what you are trying to combine from matching lines in the two files???