How to get unique of file1 from file2 and save the output?

Please help,

file1.txt

1
2
3
4
5

file2.txt

3
4
5
6
7

All I need is this result.txt

1
2

This code below is working with only with few records, but the problem is when I run with a large file like the the file is 2million records the result is theysame with file1

1
2
3
4
5
awk 'NR==FNR{a[$0];next}!($0 in a)' file1.txt file2.txt | xargs echo > result.txt

Question:
Why is it happen that it works in few records but not in more than 2million records?
Is it the command set it to timeout?
What could be the solution on this?

Thank you so much.

If this is what you want, then you have to compare file2.txt with file1.txt

Also if you have only one field in each record, better use $1 instead of $0 to avoid blank spaces if any:

awk 'NR==FNR{a[$1];next}!($1 in a)' file2.txt file1.txt

Also note that awk does has a memory limitation, so it might throw an error when it exceeds the limit.

simply use comm (be sure that both files are sorted, otherwise comm produces wrong output):

comm -23 file1.txt file2.txt

Hi thanks for the reply but still theysame i try this command:

comm -2 -3 <(sort file1.txt) <(sort file2.txt) | xargs > result.txt

How to set limit timeout of this command? Or is their any way to solve my problem.

Thanks so much,

did you try this?

sort file1.txt > file1
sort file2.txt > file2
comm -23 file1 file2 > result

Hi doganaym,

Yes, but still theysame. :frowning:

Thanks,

---------- Post updated at 06:48 PM ---------- Previous update was at 02:43 PM ----------

Anyone can help me on this?

Thank you so much,
Richie

Did this help ?

sort file1 file2 | uniq -u

Thanks on the reply,

Yes, but the result I need is unique of file1 that are not in file2.

Sometimes you need "export LC_ALL=C" to get sort to do binary order for comm.

richmac,
check this out, with the above data,

grep -v -w -f file2 file1
1
2

Enjoy..

The comm and sort are large data stable, where grep gets slower with more stored lines and may blow up if it hits a 4G address limit putting file2 into VM. grep also has to check for regex at some stage, a waste on pure data if not a threat to data integrity; fgrep / grep -F is faster and more data-stable.

Awk and bash can hash search, which does not have speed problems with large files and can save the sorting step, but still has to put file2 into VM.

diff file1 file2 --old-line-format="%L" --new-line-format="" --unchanged-line-format="" -h

or

diff file1 file2 --old-line-format="%L" --new-line-format="" --unchanged-line-format="" --speed-large-files

Since diff does not assume order, it will search around for missing lines, even half heartedly, which might not scale well, performance-wise. It should be durable with large files, though.

Hi All,

I tried

grep -v -w -f file2 file1 | xargs > result.txtthe

Problem is its already 2hours since i start run on that command. because as I said its a 2million records. and until now its not finish yet.

I tried also

diff file1 file2 --old-line-format="%L" --new-line-format="" --unchanged-line-format="" \
  -h  or  diff file1 file2 --old-line-format="%L" --new-line-format="" --unchanged-line-format="" \
  --speed-large-filesbut 

Unfortunately the timeout still exist the result is still theysame as file1.
:frowning:

Thanks so much on the reply. But still no luck. :frowning:

PS: comm expects unique lines, too.

comm -23 <(
  LC_ALL=C
  sort -u file1
 ) <(
  LC_ALL=C
  sort -u file2
 )

Hi DGPickett,

Still not success.

Thanks,

>> because as I said its a 2million records.
2 million record going to take time and depending on the horse-power of the system, unless you figure out a better code, that may speed it up.

My server specs 4core and 16gig memory. Yes thats why I open up on this forum,

Thanks so much rveri.