awk match to update contents of file

cmccabe · September 6, 2016, 4:31pm

I am trying to match $1 in file1 with $2 in file2 . If a match is found then $3 and $4 of file2 are copied to file1 . Both files are tab-delimeted and I am getting a syntax error and would also like to update file1 in-place without creating a new file, but am not sure how. Thank you :).

file1

file2

...    789    yyy    xxx    zzz    11111
...    111    xxx    yyy    aaa    22222
...    222    zzz    xxx    yyy    33333

awk

awk -F'\t' 'NR==FNR{A[$1]=$2; next} $2 in A{$3=A[$3] $3 in A{$4=A[$4]}}1' file1 file2 > output
awk: cmd. line:1: NR==FNR{A[$1]=$2; next} $2 in A{$3=A[$3] $3 in A{$4=A[$4]}}1
awk: cmd. line:1:

desired output (file1 updated to)

...    789    yyy    xxx
...    111    xxx    yyy
...    222    zzz    xxx

Don_Cragun · September 6, 2016, 5:10pm

So, to be clear, your desired output seems to indicate that lines in file1 that do not have a match in file2 are to be deleted and lines that do have a match are to be replaced by the first four fields in file2??? I didn't see anything about deleting non-matching lines in your description and I didn't see anything about copying field 1 from file2 to file1 in your description.

cmccabe · September 6, 2016, 5:50pm

The lines in file1 that do not have a match in file2 just need to be copied over. They do not need to be deleted.

updated output

123
456
...    789    yyy    xxx
...    111    xxx    yyy
...    222    zzz    xxx
333
444
555

Thank you and I apologize for that over-sight :).

RudiC · September 6, 2016, 6:07pm

Try

awk 'NR == FNR {T[$2] = $1 FS $2 FS $3 FS $4; next} $1 in T {$1 = T[$1]} 1' FS="\t" file2 file1
123
456
...	789	yyy	xxx
...	111	xxx	yyy
...	222	zzz	xxx
333
444
555

awk can't modify in place, you need to go via a temp file.

Don_Cragun · September 6, 2016, 8:01pm

At the expense of a little bit of memory during runtime, we can avoid a temp file:

awk '
BEGIN {	FS = OFS = "\t"
}
NR == 1 {
	outfile = FILENAME
}
FNR == NR {
	o[i[++ic] = $1] = $1
}
{	if($2 in o)
		o[$2] = $1 OFS $2 OFS $3 OFS $4
}
END {	for(j = 1; j <= ic; j++)
		print o[i[j]] > outfile
}' file1 file2

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

cmccabe · September 7, 2016, 11:00am

Thank you both very much :).

---------- Post updated at 10:00 AM ---------- Previous update was at 09:08 AM ----------

Since it is possible for the value in $1 to be non-unique I added o[i[++ic] = $1,$2,$3] =$1 multiple fields to the array, but the file is empty. Thank you :).

file1

123     1     2
456     a     b
456     x     y
789     x     y

file2

456     x     y     z     1
789     x     y     z     2

awk

awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1,$2,$3] =$1        
}
{if($2 in o)
o[$2] = $1 OFS $2 OFS $3 OFS $4
}
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' file1 file2

desired result

123     1     2
456     a     b
456     x     y     z
789     x     y     z

looney · September 7, 2016, 11:04am

Both the codes giving me output below

I have recently installed Ubuntu in my laptop , do i need to update anything? . Please help. And the last code by cmccabe is giving blank lines in file1.

vgersh99 · September 7, 2016, 11:10am

cmccabe:

Thank you both very much :).

---------- Post updated at 10:00 AM ---------- Previous update was at 09:08 AM ----------

Since it is possible for the value in $1 to be non-unique I added o[i[++ic] = $1,$2,$3] =$1 multiple fields to the array, but the file is empty. Thank you :).

file1
123     1     2
456     a     b
456     x     y
789     x     y
file2
456     x     y     z     1
789     x     y     z     2
awk
awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1,$2,$3] =$1        
}
{if($2 in o)
o[$2] = $1 OFS $2 OFS $3 OFS $4
}
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' file1 file2
desired result
123     1     2
456     a     b
456     x     y     z
789     x     y     z

what do you think is happening here?

o[i[++ic] = $1,$2,$3] =$1

cmccabe · September 7, 2016, 11:23am

o[i[++ic] = $1,$2,$3] =$1

$1,$2,$3 are being stored in array o , I think? Thank you

or should that be

o[i[++ic] = $1,$2,$3] ={[$1,$2,$3]}

.

RavinderSingh13 · September 7, 2016, 11:34am

Hello cmccabe,

You could split the arrays like as follows to make it easier.
o[i[++ic] = $1,$2,$3] =$1 could be written as:

i[++ic] = $1,$2,$3   
o[i[ic]]=$1

Following is the explanation too for o[i[++ic] = $1,$2,$3] =$1 (Only for understanding I have split it into 2 steps).

i[++ic] = $1,$2,$3       
array named i whose value is increasing number of variable named i and it's value is $1,$2,$3.
o[i[++ic] = $1,$2,$3] =$1
array named o whose index is value of array i (whose index is increasing value if variable named ic(each time it will be increased by 1.)) and it's value is the value of field 1st.

Thanks,
R. Singh

cmccabe · September 7, 2016, 12:55pm

Thanks you, but I am getting a syntax error... do I need to enclose the array in {} ? Thank you :).

awk '
BEGIN {    FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
i[++ic] = $1,$2,$3
o[i[++ic] = $1,$2,$3] =$1
}
{    if($2 in o)
o[$2] = $1 OFS $2 OFS $4 OFS $5 OFS $50 OFS $51 OFS $52 OFS $53
}
END {    for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' match.txt M28189_val
awk: cmd. line:8: i[++ic] = $1,$2,$3
awk: cmd. line:8:             ^ syntax error

Don_Cragun · September 7, 2016, 12:56pm

Hi cmccabe,
Let us go back to post #1 in this thread where you said that you were matching field #1 in file1 with field #2 in file2 . Then let us go to post #6 where your new data has all numeric values in field #1 in file1 and all alphabetic values in field #2 in file2 . Do you think that part of your problem might be that there are no matching fields?

In my code, the indexes in the array i[] are input line numbers and the values assigned to elements in array i[] are the keys (the field #1 value in file1 which will also be the field #2 value in matching lines in file2 ) used to lookup values in the array o[]. Since there are no cases in any of your sample input where field #2 in file2 matches fields 1, 2, AND 3 in file1 ; you get an empty output line for each line in the starting contents of file1 .

Given that the format of your input files no longer matches the format described in post #1 in this thread, I have no idea what you are trying to do with the new index values you are using. Maybe if you go back to the start and describe the contents of both input files, the keys used to match lines in the two files, and the fields to be included in the updated file1 for lines that have no matching line in file2 and the fields to be included in the updated file1 for lines that do have a matching line in file2 .

If keys in file1 are not unique and you do not want identical output for those lines that have a matching key in file2 , you need to clearly describe what output is supposed to be produced for each line in file1 with an identical key.

And, for the record, the standard way to assign a multi-dimensional index value to an array element would be:

o[i[++ic] = $1 SUBSEP $2 SUBSEP $3] = "whatever you want to print if no match is found for the 3 fields in file2 that match the 1st 3 fields of file1"

cmccabe · September 7, 2016, 1:15pm

Here are a few lines of the tab-delimited input that I use (the actual files are several million lines).

file1

Match:
68521889    C    T
167099158    A    G
18122506    G    A

Basically, I just am trying to use $1,$2, and $3 as the unique key, and use that to lookup in file2 , also tab-delimeted. I apologize for the confusion and appreciate the help. Since my files are rather large I was trying to be brief, but I can see that's no help

awk

awk '
BEGIN {    FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1 FS $2 FS $3] =$1
}
{    if($2 in o)
o[$2] = $1 OFS $2 OFS $4 OFS $5 OFS $50 OFS $51 OFS $52 OFS $53
}
END {    for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' file1 file2

The awk runs but the result is just $1 of file1

RavinderSingh13 · September 7, 2016, 1:51pm

cmccabe:

Thanks you, but I am getting a syntax error... do I need to enclose the array in {} ? Thank you :).

awk '
BEGIN {    FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
i[++ic] = $1,$2,$3
o[i[++ic] = $1,$2,$3] =$1
}
{    if($2 in o)
o[$2] = $1 OFS $2 OFS $4 OFS $5 OFS $50 OFS $51 OFS $52 OFS $53
}
END {    for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' match.txt M28189_val
awk: cmd. line:8: i[++ic] = $1,$2,$3
awk: cmd. line:8:             ^ syntax error

Hello cmccabe,

Off course it will give an error because it is allowed to seperate variables inside the array's index not while defining their values(though I am not at all sure what you are trying to do here), so in case you want to save an array's value like above then you could do following like:

i[++ic] = $1","$2","$3

I would like to request you here, please rephrase your requirement here as it is not clear at all this time what you are trying to do(for me at least), Solutions were given for your previous reuirement, so now if you have a different requirement then please rephrase it with sample Input_file and all rules with expected output too, I hope this helps you.

Thanks,
R. Singh

cmccabe · September 7, 2016, 2:22pm

Basically, I just am trying to use $1,$2, and $3 as the unique key, and use that to lookup in $2, $3, $4 in file2 , also tab-delimited. If a match is foung then it prints out the selected fields, if it does not match then it can just skip it. I apologize for the confusion and appreciate the help. Since my files are rather large I was trying to be brief, but I can see that's no help

file1

Match:
68521889    C    T
167099158    A    G
18122506    G    A

file2

....
....
.... 68521889    C    T     1   2
.... 167099158    A    G  1    2
.... 18122506    G    A    1    2

awk

awk '
BEGIN {    FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
i[++ic] = $1","$2","$3 
}
{    if($2 in o)
o[$2] =  $2 OFS $3 OFS $4
}
END {    for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' file1 file2

desired result (updated file1)

68521889    C    T     1
167099158    A    G  1
18122506    G    A    1

RavinderSingh13 · September 7, 2016, 3:00pm

Hello cmccabe,

Let's say we have following Input_files:

cat Input_file1
Match:
68521889    C    T
167099158    A    G
18122506    G    A
212121313  Q    W
234324234  t    w

cat Input_file2
.... 68521889    C    T     1   2
.... 167099158    A    G  1    2
.... 18122506    G    A    1    2

Then following is the code on same.

awk 'NR==1{out_file=FILENAME;next} FNR==NR{A[$1,$2,$3]=$0;next} (($2,$3,$4) in A){Q=Q?Q ORS A[$2,$3,$4] OFS $5:A[$2,$3,$4] OFS $5} END{print Q > out_file}' OFS="\t" Input_file1  Input_file2

Output will be stored into Input_file1 as follows(you could set field seprator to tab if you have tab delimited Input_files).

68521889    C    T	1
167099158    A    G	1
18122506    G    A	1

Thanks,
R. Singh

cmccabe · September 7, 2016, 4:08pm

I have been trying to understand how the awk @Don Cragun posted. I have modified it slightly and it is very close to working, but if a match is not found then nothing needs to happen or that line is skipped and left unchanged. Currently the lines that do not match are removed and replaced with a null value. I think the line in bold needs to have something added to it, but I am not sure what.

awk

awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1 FS $2 FS $3]     
}
{if($2 FS $4 FS $5 in o)
o[$2 FS $4 FS $5] = $2 OFS $4 OFS $5 OFS $50 OFS $51 OFS $52 OFS $53
}
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' file1 file2

Original file1 before awk was run

Match:
68521889    C    T
167099158    A    G
18122506    G    A

current output

68521889    C    T    GOOD    50    het    4
167099158    A    G    GOOD    210    hom    55
18122506    G    A    GOOD    189    het    8

desired output

Match:
68521889    C    T    GOOD    50    het    4
167099158    A    G    GOOD    210    hom    55
18122506    G    A    GOOD    189    het    8

Thank you all for your help, explanations, and patience

Don_Cragun · September 7, 2016, 6:17pm

cmccabe:

I have been trying to understand how the awk @Don Cragun posted. I have modified it slightly and it is very close to working, but if a match is not found then nothing needs to happen or that line is skipped and left unchanged. Currently the lines that do not match are removed and replaced with a null value. I think the line in bold needs to have something added to it, but I am not sure what.

awk
awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1 FS $2 FS $3]     
}
{if($2 FS $4 FS $5 in o)
o[$2 FS $4 FS $5] = $2 OFS $4 OFS $5 OFS $50 OFS $51 OFS $52 OFS $53
}
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' file1 file2
Original file1 before awk was run
Match:
68521889    C    T
167099158    A    G
18122506    G    A
current output
68521889    C    T    GOOD    50    het    4
167099158    A    G    GOOD    210    hom    55
18122506    G    A    GOOD    189    het    8
desired output
Match:
68521889    C    T    GOOD    50    het    4
167099158    A    G    GOOD    210    hom    55
18122506    G    A    GOOD    189    het    8
Thank you all for your help, explanations, and patience

Hi cmccable,
I would have hoped that by now you would know that if you can't decipher my code and you tell me you can't understand it, I would be happy to supply a commented version....

Yes, if you do not assign any value to o[key] , lines for keys that are not found in file2 will be printed as empty lines. If you want unmatched lines to be left unchanged, then o[i[line#] = key] must be set to the original input line (i.e., $0 ) from file1 .

With such abbreviated data from file1 and no sample file2 data, I can only make wild guesses. But, I think you're getting close. Try:

awk '
BEGIN {	FS = OFS = "\t"
}
NR == 1 {
	outfile = FILENAME
}
FNR == NR {
	o[i[++ic] = $1 OFS $2 OFS $3] = $0
	next
}
{	if($2 OFS $4 OFS $5 in o)
		o[$2 OFS $4 OFS $5] = $2 OFS $4 OFS $5 OFS $50 OFS $51 OFS $52 OFS $53
}
END {	for(j = 1; j <= ic; j++)
		print o[i[j]] > outfile
}' file1 file2

or, slightly more compactly, but with commentary added:

awk '
BEGIN {	FS = OFS = "\t"
}
NR == 1 {
	# Save the output file pathname from the pathname naming the 1st input
	# file.
	outfile = FILENAME
}
FNR == NR {
	# For each line in the 1st input file set i[line#] to the three fields
	# in the 1st input file (and keep a count of the number of lines in
	# that file (ic) that make up that key, and set o[key] to the entire
	# input line.
	o[i[++ic] = $1 OFS $2 OFS $3] = $0
	next	# This line was missing in my original script, but since there
		# was only one field in file1 and there were no empty lines in
		# file1, it didn't affect the final output.  With the new input
		# file formats, this line must be included to guarantee correct
		# results.
}
{	# Set key to the key fields from file2.  If this key is present in the
	# o[] array, set o[key] to the desired otuput for this key.
	if((key = $2 OFS $4 OFS $5) in o)
		o[key] = key OFS $50 OFS $51 OFS $52 OFS $53
}
END {	# Now that we have processed all of the data in both input files,
	# overwrite the 1st input file with the desired output replacing each
	# line in that file with its original contents (if there was no match)
	# or the collected fields from the 2nd input file (if there was a
	# match).
	for(j = 1; j <= ic; j++)
		print o[i[j]] > outfile
}' file1 file2

since we have no file2 to use as sample data, both of these are obviously untested. I know that all of the tabs I have in my code don't matter to awk , but for my sanity (and for the mental health of anyone who may try to decipher this code later), please do not remove them.

cmccabe · September 10, 2016, 8:50am

Thank you all for your help and explanations.

Mannu2525 · September 13, 2016, 4:09pm

If I assume your sample data then below code might work in this case-

awk -F '\t' 'NR==FNR{A[$1]=$0; next} $2 in A{print $0;}1' file1 file2 | uniq -c | awk '$1 != "1" {print $4"\t"$5}' > outfile

HTH!!

Regards,
Mannu