How to get unique of file1 from file2 and save the output?

richmac · June 13, 2013, 2:55pm

Please help,

file1.txt

file2.txt

All I need is this result.txt

1
2

This code below is working with only with few records, but the problem is when I run with a large file like the the file is 2million records the result is theysame with file1

awk 'NR==FNR{a[$0];next}!($0 in a)' file1.txt file2.txt | xargs echo > result.txt

Question:
Why is it happen that it works in few records but not in more than 2million records?
Is it the command set it to timeout?
What could be the solution on this?

Thank you so much.

Yoda · June 13, 2013, 3:02pm

If this is what you want, then you have to compare file2.txt with file1.txt

Also if you have only one field in each record, better use $1 instead of $0 to avoid blank spaces if any:

awk 'NR==FNR{a[$1];next}!($1 in a)' file2.txt file1.txt

Also note that awk does has a memory limitation, so it might throw an error when it exceeds the limit.

doganaym · June 13, 2013, 3:04pm

simply use comm (be sure that both files are sorted, otherwise comm produces wrong output):

comm -23 file1.txt file2.txt

richmac · June 13, 2013, 3:26pm

Hi thanks for the reply but still theysame i try this command:

comm -2 -3 <(sort file1.txt) <(sort file2.txt) | xargs > result.txt

How to set limit timeout of this command? Or is their any way to solve my problem.

Thanks so much,

doganaym · June 13, 2013, 3:30pm

did you try this?

sort file1.txt > file1
sort file2.txt > file2
comm -23 file1 file2 > result

richmac · June 13, 2013, 7:48pm

Hi doganaym,

Yes, but still theysame.

Thanks,

---------- Post updated at 06:48 PM ---------- Previous update was at 02:43 PM ----------

Anyone can help me on this?

Thank you so much,
Richie

rajamadhavan · June 13, 2013, 11:57pm

Did this help ?

sort file1 file2 | uniq -u

richmac · June 14, 2013, 12:43pm

Thanks on the reply,

Yes, but the result I need is unique of file1 that are not in file2.

DGPickett · June 14, 2013, 2:39pm

Sometimes you need "export LC_ALL=C" to get sort to do binary order for comm.

rveri · June 14, 2013, 2:46pm

richmac,
check this out, with the above data,

grep -v -w -f file2 file1
1
2

Enjoy..

DGPickett · June 14, 2013, 3:23pm

The comm and sort are large data stable, where grep gets slower with more stored lines and may blow up if it hits a 4G address limit putting file2 into VM. grep also has to check for regex at some stage, a waste on pure data if not a threat to data integrity; fgrep / grep -F is faster and more data-stable.

Awk and bash can hash search, which does not have speed problems with large files and can save the sorting step, but still has to put file2 into VM.

doganaym · June 14, 2013, 4:25pm

diff file1 file2 --old-line-format="%L" --new-line-format="" --unchanged-line-format="" -h

or

diff file1 file2 --old-line-format="%L" --new-line-format="" --unchanged-line-format="" --speed-large-files

DGPickett · June 14, 2013, 4:43pm

Since diff does not assume order, it will search around for missing lines, even half heartedly, which might not scale well, performance-wise. It should be durable with large files, though.

richmac · June 14, 2013, 4:57pm

Hi All,

I tried

grep -v -w -f file2 file1 | xargs > result.txtthe

Problem is its already 2hours since i start run on that command. because as I said its a 2million records. and until now its not finish yet.

I tried also

diff file1 file2 --old-line-format="%L" --new-line-format="" --unchanged-line-format="" \
  -h  or  diff file1 file2 --old-line-format="%L" --new-line-format="" --unchanged-line-format="" \
  --speed-large-filesbut

Unfortunately the timeout still exist the result is still theysame as file1.

Thanks so much on the reply. But still no luck.

DGPickett · June 14, 2013, 5:04pm

PS: comm expects unique lines, too.

comm -23 <(
  LC_ALL=C
  sort -u file1
 ) <(
  LC_ALL=C
  sort -u file2
 )

richmac · June 14, 2013, 5:29pm

Hi DGPickett,

Still not success.

Thanks,

rveri · June 14, 2013, 5:38pm

>> because as I said its a 2million records.
2 million record going to take time and depending on the horse-power of the system, unless you figure out a better code, that may speed it up.

richmac · June 14, 2013, 5:51pm

My server specs 4core and 16gig memory. Yes thats why I open up on this forum,

Thanks so much rveri.