I have two files as below. I need to compare field 2 of file 1 against field 1 of file 2 and field 5 of file 1 against filed 2 of file 2. If both matches , then create a result file 1 with first file data and if not matches , then create file with first fie data. Please help me in achieving this.
Please use code tags - not ICODE tags - for code AND DATA as required by forum rules!
I'm a bit surprised by your result files - Madras Bangalore should show up in yes.txt as it perfectly fits, and Chennai blr should not as the upper case C does not exist in file1.
Trying to adapt your code snippet as far as possible and correcting your input files, I came up with
awk 'NR==FNR {a[$1,$2]; next}
($2 SUBSEP $5 in a) {print $0 > "yes.txt"; next}
{print $0 > "no.txt" }
' file2 file1
cf *.txt
no.txt:
5467 Chennai city 2456 ooty
yes.txt:
1234 Chennai kovai 1256 blr
2331 Madras Coimbatore 3234 Bangalore
Thank you rudiC and Aksay for such a quick reply..
Yes, the files are case sensitive and I did a typo mistake..
Now the actual issue over here is file 1 will have 20 million records and file 2 will have 30 thousand records... So will it cause performance issue if we go with this comparison?
Earlier I have achieved this with two while loops however it took hell lot of time. So now came to awk part.. Actually the if condition will be checking like below..
The suggestion I provided was first posted in your other thread where the example you provided seemed to require case insensitive matching.
Now that you have clearly specified that you want case sensitive matching, RudiC's suggestion seems to be what you need (assuming you don't really want a space in the filename no.txt ). Although the sample code you provided used ~ for matching instead of testing for equality; so I'm not sure what you mean by match. If file2 contains the line:
York oo
should that match the line:
123 Yorkshire x y boot
in file1 in addition to matching the line:
456 York a b oo
? If so, RudiC's suggestion won't work, and anything we come up with will be significantly slower.
You say that you have code that works, but runs too slow; but you haven't shown us any code that would work at all. So, it is hard to guess at how many orders of magnitude faster RudiC's suggestion will run than your current code. If you have a shell script with a nested loop that is executing awk 600 billion times (30,000 * 20,000,000) and you want exact matches on fields (rather than substring matches), I wouldn't be surprised if RudiC's suggestion will run faster by a factor of well over a billion.
Why haven't you tried his code to see how well it works for you???
The elegant solution is to use tools that make delimited files act like SQL tables and do an SQL join.
You can scan one file putting the concatenated keys into an associative array (awk, bash, ksh, etc.), then process the second file looking up the keys in the array. Associative arrays are hash lookup tables, keyed with strings.
A more UNIX flavored solution is to cut out the columns to compare into two files or bash/ksh pipe streams '<(...)', sort them unique and match them with 'comm'. Now you have the key values to filter the input file. Sort the input file on the same columns so matching is a simple merge.
If you create modified input files with the keys copied or moved on the front and then a delimiter and sort them, you can use 'join'. Since join only handles one join column, you need a different delimiter to make it join on multiple columns. If concatenating columns, make sure each column has values formatted all to the same length.
Above doesn't work. Neither awk nor shell have "and" or "or". Replace by "&&" and "||" and try again. Why do you "need to impelement the if" ? Was your specification not describing the whole picture?
RudiC gave you code that you say works. Now you want us to redo that working code to use code that is incomplete and contains many syntax errors.
WHY do you NEED to change what RudiC suggested?
Is this a homework assignment?
You implied earlier that you had code using nested loops that gives you the correct results but is too slow to work on your actual data. Please show us your working code!
And, please clarify what you mean by match. Do you mean the two strings being matched are equal? Or, do you mean that a string in file1 matches a string in file2 if the string in file1 is contained in the string in file2 ?