Script to find NOT common strings in two files

Hi all,

I'd like you to help or give any advise about the following:

I have two (2) files, file1 and file2, both files have information common to each other. The contents of file1 is a subset of the contents of file2:

file1:
errormsgadmin
esdp
esgservices
esignipa
iprice
ipvpn
irm
ishare
james
jasper
jcms

file2:
esolutions
mboards
metric
metrics
mib
mms
errormsgadmin
esdp
esgservices
esign
esolutions
esolutions
ess
ewebsync
framework
gchwf
global

I would like to make a script that displays all content that is in file2 but is not in file1. Or rather, to erase the contents from file2 that is included in the file1.

the output would look like this:

esolutions
ess
ewebsync
framework
gchwf
global
mboards
metric
metrics
mib
mms

Any sugestion is welcome, thank you.

Try:

comm -13 <(sort file1) <(sort file2) | uniq

Can nearly get your sample output despite the duplicate records in file2 . Have to clean up file2 first.

sort <file2 | uniq > file3
grep -F -v -f file1 file3

esign
esolutions
ess
ewebsync
framework
gchwf
global
mboards
metric
metrics
mib
mms

I get a record called "esign" in my output.

diff file1 file2

Both files must be sorted.

OR

awk 'NR==FNR{a[$1]++;next} !a[$1]' file1 file2 | sort | uniq

it does not work as expected

[root@linux ~]# cat file1
errormsgadmin
esdp
esgservices
esignipa
iprice
ipvpn
irm
ishare
james
jasper
jcms
[root@linux ~]# cat file2
esolutions
mboards
metric
metrics
mib
mms
errormsgadmin
esdp
esgservices
esign
esolutions
esolutions
ess
ewebsync
framework
gchwf
global
[root@linux ~]# comm -13 <(sort file1) <(sort file2) | uniq
esign
esolutions
ess
ewebsync
framework
gchwf
global
mboards
metric
metrics
mib
mms

What output are you getting? What operating system are you using?

@bartus11
What Operating System and Shell are you using?
Ps. I'm glad someone else gets the extra record in their output!

Oracle Enterprise Linux 5.4 and bash.

Oracle Enterprise Linux == RHEL, correct?

Hi hnux,

Maybe:

nawk 'NR==FNR{a[$1]++}{if(!a[$1]) b[$0]=$0}END{for (c in b) {print c}}' file1 file2
ewebsync
global
metrics
mms
framework
ess
esign
mib
gchwf
mboards
esolutions
metric

Or printing alphabetically:

nawk 'NR==FNR{a[$1]++}{if(!a[$1]) b[$0]=$0}END{for (c in b) {print c}}' file1 file2 | sort
esign
esolutions
ess
ewebsync
framework
gchwf
global
mboards
metric
metrics
mib
mms

Hope it helps.

1 Like

Close, comm needs unique:

comm -13 <(sort -u file1) <(sort -u file2)

Of course, the <(...) only works on Solaris and similar /dev/fd/# flavored LINUX/UNIX, which is a real shame, I love it! You can use named pipe "mknod <path> p" or maybe coshell.

It was for some time, recently OEL started to go on it's own way - Oracle Linux 6 and the Red Hat compatible kernel - c0t0d0s0.org :slight_smile:

Thank you, it works good

---------- Post updated at 05:45 PM ---------- Previous update was at 05:38 PM ----------

I'm using SunOS 5.10.

I re-try again it woks right,

Thank you very much for your help

It works on any system with bash installed. It is a feature of bash, not the system.

I wonder how bash does that! For ksh users, it comes and goes, and if it was easy to have all the time, I'd think David K would go that way. Using truss/tusc/strace, I see bash is managing named pipes for this (and too many /, no pipe cleanup? -- I just emailed the bash devs and DGK.):

$ truss -faelpo /tmp/bash.tr bash -c 'comm -13 <(sort -u .profile) <(sort -u .profile)'
view /tmp/bash.tr
 .
 .
 .
[12884]{17507} lstat64("/var/tmp//sh-np-1300956853", 0x7b0f6418) ERR#2 ENOENT
[12884]{17507} mknod("/var/tmp//sh-np-1300956853", S_IFIFO|0600, 0 0x000000) = 0
 .
 .
 .
[12884]{17507} execve(0x4003efc8, 0x4003ec68, 0x4003e008)  [entry]
                              argv[0] @ 0x4003dda8: "comm"
                              argv[1] @ 0x400212d8: "-13"
                              argv[2] @ 0x4003ef88: "/var/tmp//sh-np-1300956853"
                              argv[3] @ 0x4003edc8: "/var/tmp//sh-np-3600645176"
 .
 .
 .
[12884]{17507} open("/var/tmp//sh-np-1300956853", O_RDONLY|O_LARGEFILE, 0666) =
7
[12885]{17510} open("/var/tmp//sh-np-1300956853", O_WRONLY|O_LARGEFILE, 0166600)
 = 6
 .
 .
 .
[12885]{17510} dup2(6, 1) ................................ = 1
 
and no cleanup:
 
$ ls -l /var/tmp/sh-np-*
prw-------   1 nbkodln    develop          0 Mar 24 09:52 /var/tmp/sh-np-1300956853
prw-------   1 nbkodln    develop          0 Mar 24 09:47 /var/tmp/sh-np-1300959986
prw-------   1 nbkodln    develop          0 Mar 24 09:46 /var/tmp/sh-np-1300964951
prw-------   1 nbkodln    develop          0 Mar 24 09:45 /var/tmp/sh-np-1300966577
prw-------   1 nbkodln    develop          0 Mar 24 09:45 /var/tmp/sh-np-1300973486
prw-------   1 nbkodln    develop          0 Mar 24 09:48 /var/tmp/sh-np-1300985557
prw-------   1 nbkodln    develop          0 Mar 24 09:45 /var/tmp/sh-np-3600617851
prw-------   1 nbkodln    develop          0 Mar 24 09:46 /var/tmp/sh-np-3600620375
prw-------   1 nbkodln    develop          0 Mar 24 09:47 /var/tmp/sh-np-3600639559
prw-------   1 nbkodln    develop          0 Mar 24 09:52 /var/tmp/sh-np-3600645176
prw-------   1 nbkodln    develop          0 Mar 24 09:45 /var/tmp/sh-np-3600657288
prw-------   1 nbkodln    develop          0 Mar 24 09:48 /var/tmp/sh-np-3600666071
$

I prefer the <() to the >(), as the latter spawns a background job with job id display and such. I am a big fan of pipeline parallelism, low latency through pipes and no scripted temp files to have name collisions and cleanup.

The named pipe is in the middle, nicer than temp files but demanding pre-creation and, hopefully, cleanup. Also, named pipes can persist with a left over process waiting in vain for a partner. They are more appropriate in a service paradigm.

Sorry, my mistake:

"Process substitution is supported on systems that support named pipes (FIFOs) or the /dev/fd method of naming open files."

I can confirm, my /tmp directory here has dozens of sh-np-* files (I'm on version 3.2.16(1))

edit:

Yep found it: you can tell the bash devs in your email if you like:
There are a whole heap of calls to unlink_fifo_list(); in execute_cmd.c that need to have the test defined(HAVE_DEV_FD) changed to !defined(HAVE_DEV_FD)

I referred them here -- why copy when you can point, eh?

Really, they also need to kill what has the pipe open, like:

kill -9 $(fuser pipe_path) 2>/dev/null

That might be more cleanup than the /dev/fd/# versions do, but cleanup is good.

Still, it'd be nice if all UNIX had fd in file tree, as I also get lots of mileage out of /dev/stdin, /dev/stderr and /dev/stdout, to get things back onto pipes or into one log with commands that have their heads in the file-no-pipe sand. I guess you can <(cat) or >(cat), but that is a waste and delay.