Comparing 2 huge text files

I have this 2 files:

k5login

sanwar@systems.nyfix.com
jjamnik@systems.nyfix.com
nisha@SYSTEMS.NYFIX.COM
rdpena@SYSTEMS.NYFIX.COM
service/backups-ora@SYSTEMS.NYFIX.COM
ivanr@SYSTEMS.NYFIX.COM
nasapova@SYSTEMS.NYFIX.COM
tpulay@SYSTEMS.NYFIX.COM
rsueno@SYSTEMS.NYFIX.COM
peterd@SYSTEMS.NYFIX.COM
casehan@SYSTEMS.NYFIX.COM
akrapivi@SYSTEMS.NYFIX.COM
....

access.ldif

dn: uid=jjamnik,ou=People,dc=prod,dc=nyfix,dc=com
cn: Josh Jamnik
objectClass: account
objectClass: posixAccount
objectClass: top
userPassword:: e0tFUkJFUk9TfWpqYW1uaWtAU1lTVEVNUy5OWUZJWC5DT00=
loginShell: /bin/bash
gidNumber: 409
gecos: Josh Jamnik
structuralObjectClass: account
entryUUID: b8a40bfa-3056-102a-89a3-93bef80cd7c4
creatorsName: cn=Manager,dc=prod,dc=nyfix,dc=com
createTimestamp: 20060212210311Z
uid: jjamnik
uidNumber: 6503
homeDirectory: /home/prodbus/jjamnik
entryCSN: 20090819203245Z#00000e#00#000000
modifiersName: cn=Manager,dc=prod,dc=nyfix,dc=com
modifyTimestamp: 20090819203245Z

dn: uid=nishap,ou=People,dc=prod,dc=nyfix,dc=com
cn: Nisha Patel
objectClass: account
objectClass: posixAccount
objectClass: top
userPassword:: e0tFUkJFUk9TfW5pc2hhcEBTWVNURU1TLk5ZRklYLkNPTQ==
loginShell: /bin/bash
gidNumber: 409
gecos: Nisha Patel
structuralObjectClass: account
entryUUID: 874cc37a-3057-102a-89a9-93bef80cd7c4
creatorsName: cn=Manager,dc=prod,dc=nyfix,dc=com
createTimestamp: 20060212210858Z
uid: nishap
uidNumber: 6506
homeDirectory: /home/prodeng/nishap

dn: uid=sanwar,ou=People,dc=prod,dc=nyfix,dc=com
cn: Sohel Anwar
objectClass: account
objectClass: posixAccount
objectClass: top
userPassword:: e0tFUkJFUk9TfXNhbndhckBTWVNURU1TLk5ZRklYLkNPTQ==
loginShell: /bin/bash
uidNumber: 6514
gecos: Sohel Anwar
structuralObjectClass: account
entryUUID: 1078797a-305b-102a-89bb-93bef80cd7c4
creatorsName: cn=Manager,dc=prod,dc=nyfix,dc=com
createTimestamp: 20060212213417Z
uid: sanwar
gidNumber: 410
homeDirectory: /home/network/sanwar
entryCSN: 20090610030006Z#000000#00#000000
modifiersName: cn=Manager,dc=prod,dc=nyfix,dc=com
modifyTimestamp: 20090610030006Z

This is to compare k5login to access.ldif file. The output should print those uid which is not existing in access.ldif file like this:

rdpena@SYSTEMS.NYFIX.COM
service/backups-ora@SYSTEMS.NYFIX.COM
ivanr@SYSTEMS.NYFIX.COM
nasapova@SYSTEMS.NYFIX.COM
tpulay@SYSTEMS.NYFIX.COM
rsueno@SYSTEMS.NYFIX.COM
peterd@SYSTEMS.NYFIX.COM
casehan@SYSTEMS.NYFIX.COM
akrapivi@SYSTEMS.NYFIX.COM

I hope that solutions will come up with my inquiry.

:slight_smile:

awk 'NR==FNR{sub("@.*","");a[$1];next}/^uid:/&&!($2 in a)' k5login access.ldif

... oops no, that one was displaying those from access.ldif that does not exist in k5login ...

... here you go to get those from k5login that doesn't exist in access.ldif :

awk 'NR==FNR{if (/^uid:/) a[$2];next}{sub("@.*","");if(!($1 in a)) print $1}' access.ldif k5login

If you need to display the full mail address :

nawk 'NR==FNR{if (/^uid:/) a[$2];next}{x=$0;sub("@.*","",x);if(!(x in a)) print $1}' access.ldif k5login

If running on SunOS / Solaris plateform, use nawk or /usr/xpg4/bin/awk instead of awk

... note that in your example there are different : nisha and nishap

Thanks for your reply but im getting only one result, can we use awk ?

(csi15,root)# nawk 'NR==FNR{sub("@.*","");a[$1];next}/^uid:/&&!($2 in a)' k5login access.ldif
uid: nishap

---------- Post updated at 07:32 PM ---------- Previous update was at 07:29 PM ----------

i'm getting this syntax error:

(csi15,root)# pwd
/usr/xpg4/bin
(csi15,root)# awk 'NR==FNR{sub("@.*","");a[$1];next}/^uid:/&&!($2 in a)' k5login access.ldif
awk: syntax error near line 1
awk: illegal statement near line 1
awk: syntax error near line 1
awk: bailing out near line 1

When you change into that directory, you have to use ./ in front of the command, to execute the file from the directory where you are standing. Else it will just fetch the command that is found via your PATH variable. Also ctsgnb stated to use nawk - not sure if there is a awk link or binary in that directory.

I updated my previous post,

please try

/usr/xpg4/bin/awk 'NR==FNR{if (/^uid:/) a[$2];next}{x=$0;sub("@.*","",x);if(!(x in a)) print $1}' access.ldif k5login

or

nawk 'NR==FNR{if (/^uid:/) a[$2];next}{x=$0;sub("@.*","",x);if(!(x in a)) print $1}' access.ldif k5login

or

awk 'NR==FNR{if (/^uid:/) a[$2];next}{x=$0;sub("@.*","",x);if(!(x in a)) print $1}' access.ldif k5login

Make sure you are in the directory where your access.ldif and k5login files are located

or use nawk or /usr/xpg4/bin/awk if on SunOS / Solaris plateform

1 Like

@ctsgnb ... awesome! you make my life easier... life is great ! :b:

i will review my awk since i forgot this already.

thanks!

more simple solution :wink:

# egrep -v $(sed -n 's/uid: \(.*\)/\1/p' access.ldif |sed ':a;N;s/\n/|/;ta') k5login
nisha@SYSTEMS.NYFIX.COM
rdpena@SYSTEMS.NYFIX.COM
service/backups-ora@SYSTEMS.NYFIX.COM
ivanr@SYSTEMS.NYFIX.COM
nasapova@SYSTEMS.NYFIX.COM
tpulay@SYSTEMS.NYFIX.COM
rsueno@SYSTEMS.NYFIX.COM
peterd@SYSTEMS.NYFIX.COM
casehan@SYSTEMS.NYFIX.COM
akrapivi@SYSTEMS.NYFIX.COM
....
1 Like

@ygemici

I got the following error :

# nawk 'NR==FNR{if (/^uid:/) a[$2];next}{x=$0;sub("@.*","",x);if(!(x in a)) print $1}' f2 f1
nisha@SYSTEMS.NYFIX.COM
rdpena@SYSTEMS.NYFIX.COM
service/backups-ora@SYSTEMS.NYFIX.COM
ivanr@SYSTEMS.NYFIX.COM
nasapova@SYSTEMS.NYFIX.COM
tpulay@SYSTEMS.NYFIX.COM
rsueno@SYSTEMS.NYFIX.COM
peterd@SYSTEMS.NYFIX.COM
casehan@SYSTEMS.NYFIX.COM
akrapivi@SYSTEMS.NYFIX.COM
# egrep -v $(sed -n 's/uid: \(.*\)/\1/p' f2 | sed ':a N;s/\n/|/;ta') f1
Label too long: :a N;s/\n/|/;ta
# uname -a
SunOS <anonymized> 5.10 Generic_141414-01 sun4u sparc SUNW,Sun-Fire-V490
#

I got the same error with the semicolon before the N

# egrep -v $(sed -n 's/uid: \(.*\)/\1/p' f2  |sed ':a;N;s/\n/|/;ta') f1
Label too long: :a;N;s/\n/|/;ta

hmm yes it is solaris :rolleyes:
i modified some :slight_smile:

# egrep -v $(sed -n 's/uid: \(.*\)/\1/p' access.ldif |sed -e ':a' -e '$!N;s/\n/|/' -e 'ta') k5login

@ygemici :

Yup, those '-e' make it works just fine . :slight_smile:

But

1) ... I just wonder if it would still work fine if the files are huge so that the uid1|uid2|... strings becomes very long

2) By the way, if you have some subpattern matching like nisha and nishap, if you fall in a case where you grep -v nisha you may filter out nishap which is not the intended behaviour ...

hi ctsgnb,

i have this additional requirement and here goes:

awk 'NR==FNR{if (/^uid:/) a[$2];next}{x=$0;sub("@.*","",x);if(!(x in a)) print $1}' access.ldif k5login

from the above script...i need to save the script output to temporarily file let say k5login-temp and then parse it or compare it again to another ldif file let say bny.ldif. I'm having a hard time for a modified script .

---------- Post updated at 03:51 PM ---------- Previous update was at 03:00 PM ----------

ah... i think i got it..testing now...

hi Awk Masters,

awk 'NR==FNR{if (/^uid:/) a[$2];next}{x=$0;sub("@.*","",x);if(!(x in a)) print $1}' access.ldif k5login

from the script above, instead of printing the output... i would like to delete it automatically in k5login file those does not exists from ldif file .

Anyone can revised the script above.

Thanks in advance