Retrieving the relevant search from search file in the main file

csim_mohan · July 31, 2014, 1:37pm

I have two files:
file 1:

hello.com    neo.com,japan.com,example.com
news.net    xyz.com, telecom.net, highlands.net, software.com
example2.com    earth.net, abc.gov.uk

file 2:

neo.com
example.com
abc.gov.uk

file 2 are the search keys to search in file 1 if any of the search key is found in file 1 it should return the row with the found search key else just the row of the file 1 like this:

hello.com    neo.com, example.com
news.net
example2.com    abc.gov.uk

I tried this:

awk 'NR==FNR{a[NR]=$0;next}      {l=$1;for(x in a)if($0~a[x]){l=$0;break}print l}' file2 file1

it gave me the result like this:

hello.com   neo.com,japan.com,example.com 
news.net 
example2.com    earth.net, abc.gov.uk

Any idea to fix this since ? file 1 is very huge like 2 GB what can be efficient way to write this ?

Corona688 · July 31, 2014, 1:45pm

Checking the entire array for the right key is kind of overkill, awk can do that automatically -- if you store it correctly. Use "hello.com" as the index, not the entire line, and ("hello.com" in A) will evaluate true or false depending.

You don't want to store the entire 2GB file in memory... I'm guessing file2 is the smaller one, let's store that.

awk 'NR==FNR { A[$1] ; next } $1 in A { print A[$1] }' file2 file1

If this doesn't work, your input data may not be quite what it looks like. Already I can see that it has inconsistent spacing all over.

csim_mohan · July 31, 2014, 2:04pm

column 1 and column 2 are tab separated. The code you provided doesn't produce any output.

Corona688 · July 31, 2014, 2:09pm

Sorry, I made a mistake.

I should point out, you gave me example data which produces no output, though! None of the headers in file2 match file1.

awk 'NR==FNR { A[$1] ; next } $1 in A' file2 file1

csim_mohan · July 31, 2014, 2:56pm

If you see file2 (search key) at file1 column2 you will see the match. I need to retrieve those.

Don_Cragun · July 31, 2014, 4:04pm

Note that neither the sample input nor the sample output you provided contained any tab characters. The following will work with fields separated by any combination of one or more commas, spaces, and tabs and produce output that separates the 1st two columns of the output with a tab and separates subsequent fields with a comma followed by a space.

awk -F '[ \t,]' '
FNR == NR {
	a[$1]
	next
}
{	o = $1
	c = 0
	for(i = 2; i <= NF; i++)
		if($i in a)
			o = o (c++ ? ", " : "\t") $i
	print o
}' file2 file1

With the sample input files you provided, it produces the output:

hello.com	neo.com, example.com
news.net
example2.com	abc.gov.uk

Note that if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

Corona688 · July 31, 2014, 4:28pm

I see what you mean now. You don't want to match the first column, but any of the columns after.

The inconsistent format of the data you posted becomes a problem again, then. Is it tab-separated, comma-separated, space-seperated, or -- as it seems here -- all three?

csim_mohan · July 31, 2014, 6:21pm

@Don Cragun it worked like charm and thank you very much for such a clear explanation with the code, amazing !