Is there a UNIX command that can compare fields of files with differing number of fields?

Hi,

Below are the sample files. x.txt is from an Excel file that is a list of users from Windows and y.txt is a list of database account.

$ head -500 x.txt y.txt
==> x.txt <==
TEST01  APP_USER_PROFILE
USER03  APP_USER_PROFILE
TEST02  APP_USER_EXP_PROFILE
TEST04  APP_USER_PROFILE
USER01  APP_USER_PROFILE
USER02  APP_USER_EXP_PROFILE
USER04  APP_USER_PROFILE
MINNIE  APP_USER_EXP_PROFILE
XXXX02  APP_USER_PROFILE
XXXX04  APP_USER_PROFILE
ABCD01  APP_USER_PROFILE
ABCD02  APP_USER_PROFILE
TEST03  APP_USER_PROFILE
ABCD03  APP_USER_PROFILE
ZZZZ03  APP_USER_PROFILE
PLUTO   APP_USER_PROFILE
ABCD04  APP_USER_PROFILE

==> y.txt <==
TEST01  APP_USER_PROFILE        LOCKED  2016-08-05=14:00
USER03  APP_USER_PROFILE        OPEN
TEST02  APP_USER_EXP_PROFILE    EXPIRED 2017-01-01=13:00
TEST04  APP_USER_PROFILE        LOCKED  2017-12-25=15:00
USER01  APP_USER_PROFILE        EXPIRED 2017-02-14=14:00
USER02  APP_USER_EXP_PROFILE    LOCKED  2018-12-25=11:00
USER04  APP_USER_PROFILE        OPEN
XXXX01  APP_USER_EXP_PROFILE    LOCKED  2016-04-01=12:00
XXXX02  APP_USER_PROFILE        LOCKED  2019-01-01=01:00
XXXX04  APP_USER_PROFILE        OPEN
ABCD01  APP_USER_PROFILE        OPEN
ABCD02  APP_USER_PROFILE        LOCKED  2019-01-23=15:00
TEST03  APP_USER_PROFILE        OPEN
ABCD03  APP_USER_PROFILE        OPEN
ZZZZ03  APP_USER_PROFILE        OPEN
ABCD04  APP_USER_PROFILE        OPEN
MICKEY  APP_USER_PROFILE        LOCKED  2018-04-01=16:00
DONALD  APP_USER_PROFILE        OPEN

And at the moment I am using the script below to check whether each line of x.txt exist in y.txt or not. That is whether a Windows account in x.txt exist as a database account in y.txt

$ cat x.ksh
#!/bin/ksh
#

search_list="x.txt"
search_from="y.txt"

while read list
do
   if [[ ! -z `grep "^$list" ${search_from}` ]] ; then
      echo "- FOUND ==> `grep "^$list" ${search_from}`"
   else
      echo "- NOT FOUND !!! ${list}"
   fi
done < ${search_list}

Sample run of the script below:

$ ./x.ksh
- FOUND ==> TEST01      APP_USER_PROFILE        LOCKED  2016-08-05=14:00
- FOUND ==> USER03      APP_USER_PROFILE        OPEN
- FOUND ==> TEST02      APP_USER_EXP_PROFILE    EXPIRED 2017-01-01=13:00
- FOUND ==> TEST04      APP_USER_PROFILE        LOCKED  2017-12-25=15:00
- FOUND ==> USER01      APP_USER_PROFILE        EXPIRED 2017-02-14=14:00
- FOUND ==> USER02      APP_USER_EXP_PROFILE    LOCKED  2018-12-25=11:00
- FOUND ==> USER04      APP_USER_PROFILE        OPEN
- NOT FOUND !!! MINNIE  APP_USER_EXP_PROFILE
- FOUND ==> XXXX02      APP_USER_PROFILE        LOCKED  2019-01-01=01:00
- FOUND ==> XXXX04      APP_USER_PROFILE        OPEN
- FOUND ==> ABCD01      APP_USER_PROFILE        OPEN
- FOUND ==> ABCD02      APP_USER_PROFILE        LOCKED  2019-01-23=15:00
- FOUND ==> TEST03      APP_USER_PROFILE        OPEN
- FOUND ==> ABCD03      APP_USER_PROFILE        OPEN
- FOUND ==> ZZZZ03      APP_USER_PROFILE        OPEN
- NOT FOUND !!! PLUTO   APP_USER_PROFILE
- FOUND ==> ABCD04      APP_USER_PROFILE        OPEN

Converting x.txt and y.txt to pipe delimited files and sort | uniq so that they now look as below.

x.txt:
ABCD01|APP_USER_PROFILE
ABCD02|APP_USER_PROFILE
ABCD03|APP_USER_PROFILE
ABCD04|APP_USER_PROFILE
MINNIE|APP_USER_EXP_PROFILE
PLUTO|APP_USER_PROFILE
TEST01|APP_USER_PROFILE
TEST02|APP_USER_EXP_PROFILE
TEST03|APP_USER_PROFILE
TEST04|APP_USER_PROFILE
USER01|APP_USER_PROFILE
USER02|APP_USER_EXP_PROFILE
USER03|APP_USER_PROFILE
USER04|APP_USER_PROFILE
XXXX02|APP_USER_PROFILE
XXXX04|APP_USER_PROFILE
ZZZZ03|APP_USER_PROFILE

y.txt:
ABCD01|APP_USER_PROFILE|OPEN|
ABCD02|APP_USER_PROFILE|LOCKED|2019-01-23=15:00
ABCD03|APP_USER_PROFILE|OPEN|
ABCD04|APP_USER_PROFILE|OPEN|
DONALD|APP_USER_PROFILE|OPEN|
MICKEY|APP_USER_PROFILE|LOCKED|2018-04-01=16:00
TEST01|APP_USER_PROFILE|LOCKED|2016-08-05=14:00
TEST02|APP_USER_EXP_PROFILE|EXPIRED|2017-01-01=13:00
TEST03|APP_USER_PROFILE|OPEN|
TEST04|APP_USER_PROFILE|LOCKED|2017-12-25=15:00
USER01|APP_USER_PROFILE|EXPIRED|2017-02-14=14:00
USER02|APP_USER_EXP_PROFILE|LOCKED|2018-12-25=11:00
USER03|APP_USER_PROFILE|OPEN|
USER04|APP_USER_PROFILE|OPEN|
XXXX01|APP_USER_EXP_PROFILE|LOCKED|2016-04-01=12:00
XXXX02|APP_USER_PROFILE|LOCKED|2019-01-01=01:00
XXXX04|APP_USER_PROFILE|OPEN|
ZZZZ03|APP_USER_PROFILE|OPEN|

Is there a UNIX command of some sort that can be used to maybe do the same as what the script is doing to simply display matching and non-matching first and second field between the two files? I can't use diff because the two files are completely different given they both don't have the same number of fields and will never be.

The script works as I want it to be, it just took a while for where the number of lines is in the thousands instead of a few lines like it is in the sample file.

I thought I saw a question about the same in the forum but can't find it now.

Try

grep -vffile1 file2
XXXX01  APP_USER_EXP_PROFILE    LOCKED  2016-04-01=12:00
MICKEY  APP_USER_PROFILE        LOCKED  2018-04-01=16:00
DONALD  APP_USER_PROFILE        OPEN

and

grep -vf<(cut -d" " -f1-3 file2) file1
MINNIE  APP_USER_EXP_PROFILE
PLUTO   APP_USER_PROFILE

Hi.

Using RudiC's technique for cut , also consider diff and comm :

#!/usr/bin/env bash

# @(#) s1       Demonstrate comparison of selected fields, diff, comm.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C diff comm

pl " Input data files data[12], pasted for brevity:"
paste data[12] | expand -30

pl " Results, diff:"
diff data1 <(cut -d "|" -f1,2 data2)

pl " Results, comm:"
comm -3 data1 <(cut -d "|" -f1,2 data2)

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-7-amd64, x86_64
Distribution        : Debian 8.11 (jessie) 
bash GNU bash 4.3.30
diff (GNU diffutils) 3.3
comm (GNU coreutils) 8.23

-----
 Input data files data[12], pasted for brevity:
ABCD01|APP_USER_PROFILE       ABCD01|APP_USER_PROFILE|OPEN|
ABCD02|APP_USER_PROFILE       ABCD02|APP_USER_PROFILE|LOCKED|2019-01-23=15:00
ABCD03|APP_USER_PROFILE       ABCD03|APP_USER_PROFILE|OPEN|
ABCD04|APP_USER_PROFILE       ABCD04|APP_USER_PROFILE|OPEN|
MINNIE|APP_USER_EXP_PROFILE   DONALD|APP_USER_PROFILE|OPEN|
PLUTO|APP_USER_PROFILE        MICKEY|APP_USER_PROFILE|LOCKED|2018-04-01=16:00
TEST01|APP_USER_PROFILE       TEST01|APP_USER_PROFILE|LOCKED|2016-08-05=14:00
TEST02|APP_USER_EXP_PROFILE   TEST02|APP_USER_EXP_PROFILE|EXPIRED|2017-01-01=13:00
TEST03|APP_USER_PROFILE       TEST03|APP_USER_PROFILE|OPEN|
TEST04|APP_USER_PROFILE       TEST04|APP_USER_PROFILE|LOCKED|2017-12-25=15:00
USER01|APP_USER_PROFILE       USER01|APP_USER_PROFILE|EXPIRED|2017-02-14=14:00
USER02|APP_USER_EXP_PROFILE   USER02|APP_USER_EXP_PROFILE|LOCKED|2018-12-25=11:00
USER03|APP_USER_PROFILE       USER03|APP_USER_PROFILE|OPEN|
USER04|APP_USER_PROFILE       USER04|APP_USER_PROFILE|OPEN|
XXXX02|APP_USER_PROFILE       XXXX01|APP_USER_EXP_PROFILE|LOCKED|2016-04-01=12:00
XXXX04|APP_USER_PROFILE       XXXX02|APP_USER_PROFILE|LOCKED|2019-01-01=01:00
ZZZZ03|APP_USER_PROFILE       XXXX04|APP_USER_PROFILE|OPEN|
                              ZZZZ03|APP_USER_PROFILE|OPEN|

-----
 Results, diff:
5,6c5,6
< MINNIE|APP_USER_EXP_PROFILE
< PLUTO|APP_USER_PROFILE
---
> DONALD|APP_USER_PROFILE
> MICKEY|APP_USER_PROFILE
14a15
> XXXX01|APP_USER_EXP_PROFILE

-----
 Results, comm:
        DONALD|APP_USER_PROFILE
        MICKEY|APP_USER_PROFILE
MINNIE|APP_USER_EXP_PROFILE
PLUTO|APP_USER_PROFILE
        XXXX01|APP_USER_EXP_PROFILE

Best wishes ... cheers, drl

Note that this way, the spacing needs to be exactly the same for this to work. Also note that had PLUTO been present in y.txt it would still have been either listed as absent, because of the cut utility's single space separator, if the spacing were different, or listed as present on the basis of column 1 only and the exactly two spaces after it.

IMO, a more robust way is to use the spacing tolerance of for example awk plus exact string matching :

awk 'NR==FNR{A[$1,$2]; next} !(($1,$2) in A)' y.txt x.txt

or the other way around:

awk 'NR==FNR{A[$1,$2]; next} !(($1,$2) in A)' x.txt y.txt
1 Like