Script to find NOT common strings in two files

hnux · March 22, 2011, 3:05pm

Hi all,

I'd like you to help or give any advise about the following:

I have two (2) files, file1 and file2, both files have information common to each other. The contents of file1 is a subset of the contents of file2:

file1:
errormsgadmin
esdp
esgservices
esignipa
iprice
ipvpn
irm
ishare
james
jasper
jcms

file2:
esolutions
mboards
metric
metrics
mib
mms
errormsgadmin
esdp
esgservices
esign
esolutions
esolutions
ess
ewebsync
framework
gchwf
global

I would like to make a script that displays all content that is in file2 but is not in file1. Or rather, to erase the contents from file2 that is included in the file1.

the output would look like this:

esolutions
ess
ewebsync
framework
gchwf
global
mboards
metric
metrics
mib
mms

Any sugestion is welcome, thank you.

bartus11 · March 22, 2011, 3:17pm

Try:

comm -13 <(sort file1) <(sort file2) | uniq

methyl · March 22, 2011, 3:25pm

Can nearly get your sample output despite the duplicate records in file2 . Have to clean up file2 first.

sort <file2 | uniq > file3
grep -F -v -f file1 file3

esign
esolutions
ess
ewebsync
framework
gchwf
global
mboards
metric
metrics
mib
mms

I get a record called "esign" in my output.

Shell_Life · March 22, 2011, 3:40pm

diff file1 file2

Both files must be sorted.

pravin27 · March 22, 2011, 3:49pm

OR

awk 'NR==FNR{a[$1]++;next} !a[$1]' file1 file2 | sort | uniq

hnux · March 22, 2011, 3:52pm

it does not work as expected

bartus11 · March 22, 2011, 4:01pm

[root@linux ~]# cat file1
errormsgadmin
esdp
esgservices
esignipa
iprice
ipvpn
irm
ishare
james
jasper
jcms
[root@linux ~]# cat file2
esolutions
mboards
metric
metrics
mib
mms
errormsgadmin
esdp
esgservices
esign
esolutions
esolutions
ess
ewebsync
framework
gchwf
global
[root@linux ~]# comm -13 <(sort file1) <(sort file2) | uniq
esign
esolutions
ess
ewebsync
framework
gchwf
global
mboards
metric
metrics
mib
mms

What output are you getting? What operating system are you using?

methyl · March 22, 2011, 4:05pm

@bartus11
What Operating System and Shell are you using?
Ps. I'm glad someone else gets the extra record in their output!

bartus11 · March 22, 2011, 4:11pm

Oracle Enterprise Linux 5.4 and bash.

jim_mcnamara · March 22, 2011, 4:14pm

Oracle Enterprise Linux == RHEL, correct?

cgkmal · March 22, 2011, 4:45pm

Hi hnux,

Maybe:

nawk 'NR==FNR{a[$1]++}{if(!a[$1]) b[$0]=$0}END{for (c in b) {print c}}' file1 file2
ewebsync
global
metrics
mms
framework
ess
esign
mib
gchwf
mboards
esolutions
metric

Or printing alphabetically:

nawk 'NR==FNR{a[$1]++}{if(!a[$1]) b[$0]=$0}END{for (c in b) {print c}}' file1 file2 | sort
esign
esolutions
ess
ewebsync
framework
gchwf
global
mboards
metric
metrics
mib
mms

Hope it helps.

DGPickett · March 22, 2011, 4:46pm

Close, comm needs unique:

comm -13 <(sort -u file1) <(sort -u file2)

Of course, the <(...) only works on Solaris and similar /dev/fd/# flavored LINUX/UNIX, which is a real shame, I love it! You can use named pipe "mknod <path> p" or maybe coshell.

bartus11 · March 22, 2011, 4:48pm

It was for some time, recently OEL started to go on it's own way - Oracle Linux 6 and the Red Hat compatible kernel - c0t0d0s0.org

hnux · March 22, 2011, 5:45pm

Thank you, it works good

---------- Post updated at 05:45 PM ---------- Previous update was at 05:38 PM ----------

I'm using SunOS 5.10.

I re-try again it woks right,

Thank you very much for your help

cfajohnson · March 22, 2011, 8:26pm

It works on any system with bash installed. It is a feature of bash, not the system.

DGPickett · March 24, 2011, 10:03am

I wonder how bash does that! For ksh users, it comes and goes, and if it was easy to have all the time, I'd think David K would go that way. Using truss/tusc/strace, I see bash is managing named pipes for this (and too many /, no pipe cleanup? -- I just emailed the bash devs and DGK.):

$ truss -faelpo /tmp/bash.tr bash -c 'comm -13 <(sort -u .profile) <(sort -u .profile)'
view /tmp/bash.tr
 .
 .
 .
[12884]{17507} lstat64("/var/tmp//sh-np-1300956853", 0x7b0f6418) ERR#2 ENOENT
[12884]{17507} mknod("/var/tmp//sh-np-1300956853", S_IFIFO|0600, 0 0x000000) = 0
 .
 .
 .
[12884]{17507} execve(0x4003efc8, 0x4003ec68, 0x4003e008)  [entry]
                              argv[0] @ 0x4003dda8: "comm"
                              argv[1] @ 0x400212d8: "-13"
                              argv[2] @ 0x4003ef88: "/var/tmp//sh-np-1300956853"
                              argv[3] @ 0x4003edc8: "/var/tmp//sh-np-3600645176"
 .
 .
 .
[12884]{17507} open("/var/tmp//sh-np-1300956853", O_RDONLY|O_LARGEFILE, 0666) =
7
[12885]{17510} open("/var/tmp//sh-np-1300956853", O_WRONLY|O_LARGEFILE, 0166600)
 = 6
 .
 .
 .
[12885]{17510} dup2(6, 1) ................................ = 1
 
and no cleanup:
 
$ ls -l /var/tmp/sh-np-*
prw-------   1 nbkodln    develop          0 Mar 24 09:52 /var/tmp/sh-np-1300956853
prw-------   1 nbkodln    develop          0 Mar 24 09:47 /var/tmp/sh-np-1300959986
prw-------   1 nbkodln    develop          0 Mar 24 09:46 /var/tmp/sh-np-1300964951
prw-------   1 nbkodln    develop          0 Mar 24 09:45 /var/tmp/sh-np-1300966577
prw-------   1 nbkodln    develop          0 Mar 24 09:45 /var/tmp/sh-np-1300973486
prw-------   1 nbkodln    develop          0 Mar 24 09:48 /var/tmp/sh-np-1300985557
prw-------   1 nbkodln    develop          0 Mar 24 09:45 /var/tmp/sh-np-3600617851
prw-------   1 nbkodln    develop          0 Mar 24 09:46 /var/tmp/sh-np-3600620375
prw-------   1 nbkodln    develop          0 Mar 24 09:47 /var/tmp/sh-np-3600639559
prw-------   1 nbkodln    develop          0 Mar 24 09:52 /var/tmp/sh-np-3600645176
prw-------   1 nbkodln    develop          0 Mar 24 09:45 /var/tmp/sh-np-3600657288
prw-------   1 nbkodln    develop          0 Mar 24 09:48 /var/tmp/sh-np-3600666071
$

I prefer the <() to the >(), as the latter spawns a background job with job id display and such. I am a big fan of pipeline parallelism, low latency through pipes and no scripted temp files to have name collisions and cleanup.

The named pipe is in the middle, nicer than temp files but demanding pre-creation and, hopefully, cleanup. Also, named pipes can persist with a left over process waiting in vain for a partner. They are more appropriate in a service paradigm.

cfajohnson · March 24, 2011, 3:15pm

Sorry, my mistake:

"Process substitution is supported on systems that support named pipes (FIFOs) or the /dev/fd method of naming open files."

Chubler_XL · March 24, 2011, 4:50pm

I can confirm, my /tmp directory here has dozens of sh-np-* files (I'm on version 3.2.16(1))

edit:

Yep found it: you can tell the bash devs in your email if you like:
There are a whole heap of calls to unlink_fifo_list(); in execute_cmd.c that need to have the test defined(HAVE_DEV_FD) changed to !defined(HAVE_DEV_FD)

DGPickett · March 25, 2011, 4:46pm

I referred them here -- why copy when you can point, eh?

Really, they also need to kill what has the pipe open, like:

kill -9 $(fuser pipe_path) 2>/dev/null

That might be more cleanup than the /dev/fd/# versions do, but cleanup is good.

Still, it'd be nice if all UNIX had fd in file tree, as I also get lots of mileage out of /dev/stdin, /dev/stderr and /dev/stdout, to get things back onto pipes or into one log with commands that have their heads in the file-no-pipe sand. I guess you can <(cat) or >(cat), but that is a waste and delay.