I am trying to match $1 in file1 with $2 in file2 . If a match is found then $3 and $4 of file2 are copied to file1 . Both files are tab-delimeted and I am getting a syntax error and would also like to update file1 in-place without creating a new file, but am not sure how. Thank you :).
So, to be clear, your desired output seems to indicate that lines in file1 that do not have a match in file2 are to be deleted and lines that do have a match are to be replaced by the first four fields in file2??? I didn't see anything about deleting non-matching lines in your description and I didn't see anything about copying field 1 from file2 to file1 in your description.
---------- Post updated at 10:00 AM ---------- Previous update was at 09:08 AM ----------
Since it is possible for the value in $1 to be non-unique I added o[i[++ic] = $1,$2,$3] =$1 multiple fields to the array, but the file is empty. Thank you :).
I have recently installed Ubuntu in my laptop , do i need to update anything? . Please help. And the last code by cmccabe is giving blank lines in file1.
You could split the arrays like as follows to make it easier. o[i[++ic] = $1,$2,$3] =$1 could be written as:
i[++ic] = $1,$2,$3
o[i[ic]]=$1
Following is the explanation too for o[i[++ic] = $1,$2,$3] =$1 (Only for understanding I have split it into 2 steps).
i[++ic] = $1,$2,$3
array named i whose value is increasing number of variable named i and it's value is $1,$2,$3.
o[i[++ic] = $1,$2,$3] =$1
array named o whose index is value of array i (whose index is increasing value if variable named ic(each time it will be increased by 1.)) and it's value is the value of field 1st.
Hi cmccabe,
Let us go back to post #1 in this thread where you said that you were matching field #1 in file1 with field #2 in file2 . Then let us go to post #6 where your new data has all numeric values in field #1 in file1 and all alphabetic values in field #2 in file2 . Do you think that part of your problem might be that there are no matching fields?
In my code, the indexes in the array i[] are input line numbers and the values assigned to elements in array i[] are the keys (the field #1 value in file1 which will also be the field #2 value in matching lines in file2 ) used to lookup values in the array o[]. Since there are no cases in any of your sample input where field #2 in file2 matches fields 1, 2, AND 3 in file1 ; you get an empty output line for each line in the starting contents of file1 .
Given that the format of your input files no longer matches the format described in post #1 in this thread, I have no idea what you are trying to do with the new index values you are using. Maybe if you go back to the start and describe the contents of both input files, the keys used to match lines in the two files, and the fields to be included in the updated file1 for lines that have no matching line in file2 and the fields to be included in the updated file1 for lines that do have a matching line in file2 .
If keys in file1 are not unique and you do not want identical output for those lines that have a matching key in file2 , you need to clearly describe what output is supposed to be produced for each line in file1 with an identical key.
And, for the record, the standard way to assign a multi-dimensional index value to an array element would be:
o[i[++ic] = $1 SUBSEP $2 SUBSEP $3] = "whatever you want to print if no match is found for the 3 fields in file2 that match the 1st 3 fields of file1"
Here are a few lines of the tab-delimited input that I use (the actual files are several million lines).
file1
Match:
68521889 C T
167099158 A G
18122506 G A
Basically, I just am trying to use $1,$2, and $3 as the unique key, and use that to lookup in file2 , also tab-delimeted. I apologize for the confusion and appreciate the help. Since my files are rather large I was trying to be brief, but I can see that's no help
Off course it will give an error because it is allowed to seperate variables inside the array's index not while defining their values(though I am not at all sure what you are trying to do here), so in case you want to save an array's value like above then you could do following like:
i[++ic] = $1","$2","$3
I would like to request you here, please rephrase your requirement here as it is not clear at all this time what you are trying to do(for me at least), Solutions were given for your previous reuirement, so now if you have a different requirement then please rephrase it with sample Input_file and all rules with expected output too, I hope this helps you.
Basically, I just am trying to use $1,$2, and $3 as the unique key, and use that to lookup in $2, $3, $4 in file2 , also tab-delimited. If a match is foung then it prints out the selected fields, if it does not match then it can just skip it. I apologize for the confusion and appreciate the help. Since my files are rather large I was trying to be brief, but I can see that's no help
file1
Match:
68521889 C T
167099158 A G
18122506 G A
file2
....
....
.... 68521889 C T 1 2
.... 167099158 A G 1 2
.... 18122506 G A 1 2
cat Input_file1
Match:
68521889 C T
167099158 A G
18122506 G A
212121313 Q W
234324234 t w
cat Input_file2
.... 68521889 C T 1 2
.... 167099158 A G 1 2
.... 18122506 G A 1 2
I have been trying to understand how the awk@Don Cragun posted. I have modified it slightly and it is very close to working, but if a match is not found then nothing needs to happen or that line is skipped and left unchanged. Currently the lines that do not match are removed and replaced with a null value. I think the line in bold needs to have something added to it, but I am not sure what.
Hi cmccable,
I would have hoped that by now you would know that if you can't decipher my code and you tell me you can't understand it, I would be happy to supply a commented version....
Yes, if you do not assign any value to o[key] , lines for keys that are not found in file2 will be printed as empty lines. If you want unmatched lines to be left unchanged, then o[i[line#] = key] must be set to the original input line (i.e., $0 ) from file1 .
With such abbreviated data from file1 and no sample file2 data, I can only make wild guesses. But, I think you're getting close. Try:
or, slightly more compactly, but with commentary added:
awk '
BEGIN { FS = OFS = "\t"
}
NR == 1 {
# Save the output file pathname from the pathname naming the 1st input
# file.
outfile = FILENAME
}
FNR == NR {
# For each line in the 1st input file set i[line#] to the three fields
# in the 1st input file (and keep a count of the number of lines in
# that file (ic) that make up that key, and set o[key] to the entire
# input line.
o[i[++ic] = $1 OFS $2 OFS $3] = $0
next # This line was missing in my original script, but since there
# was only one field in file1 and there were no empty lines in
# file1, it didn't affect the final output. With the new input
# file formats, this line must be included to guarantee correct
# results.
}
{ # Set key to the key fields from file2. If this key is present in the
# o[] array, set o[key] to the desired otuput for this key.
if((key = $2 OFS $4 OFS $5) in o)
o[key] = key OFS $50 OFS $51 OFS $52 OFS $53
}
END { # Now that we have processed all of the data in both input files,
# overwrite the 1st input file with the desired output replacing each
# line in that file with its original contents (if there was no match)
# or the collected fields from the 2nd input file (if there was a
# match).
for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' file1 file2
since we have no file2 to use as sample data, both of these are obviously untested. I know that all of the tabs I have in my code don't matter to awk , but for my sanity (and for the mental health of anyone who may try to decipher this code later), please do not remove them.