Compare - 1st col of file

swame_sp · September 14, 2009, 12:11pm

Hi,

I have two different files, one has two columns and other has only one column. I would like to compare the first column in the first file with the data in the second file and write a third file with the data that is not present is not common to them.

First file:
NEIS0MDL-00022|060406043A
NEIS2FTE-00111|060406043A
NEIS2FTE-00112|060406043A
NEIS2FTE-00113|060406043A
NEIS2FTE-00114|060406043A
NEIS2FTE-00115|060406043A

Second File:
NEIS0MDL-00022
NEIS2FTE-00111
NEIS2FTE-00112
NEIS2FTE-00113
NEIS2FTE-00114
NEIS2FTE-211

Third File:
NEIS2FTE-211
NEIS2FTE-00115

Kindly help me to achieve the above.

Thanks in Advance,

Regards,

vgersh99 · September 14, 2009, 12:15pm

nawk -F'|' 'FNR==NR {f1[$1];next} !($1 in f1)' file1 file2

swame_sp · September 14, 2009, 12:41pm

Hi thanks for the reply.

With the above command, it is writing the output with the data from second file(single column file) that is not present in first file. It is missing the data that are not common but present in first file(two column file).

gch · September 14, 2009, 12:52pm

You don't need to use awk or nawk. It is like killing mosquito with a sledge hammer.

cat file1 file2 | cut -f1 -d \| | sort | uniq -u

vgersh99 · September 14, 2009, 12:53pm

this is what your sample (third) file is.
Please provide a desired/correct output given your 2 sample files.

swame_sp · September 14, 2009, 1:04pm

I think this is working fine, but could you please explain it. Does the "sort" modifies the order of display?

---------- Post updated at 12:04 PM ---------- Previous update was at 12:02 PM ----------

Still the desired output is same, the third file is expected to have data that are not common to both. The data 211 and 0015 are not common so they are in the third file. Kindly let me know if I'm not clear.

Franklin52 · September 14, 2009, 1:18pm

But the use of 4 external programs is not the most efficient way, try this:

awk -F"|" 'NR==FNR{a[$1];next}
$1 in a{a[$1]++;next}
{print}
END{for(i in a){if(!a){print i}}}
' Firstfile Secondfile

swame_sp · September 14, 2009, 1:24pm

This also works fine.
I'm sorry, Can I have one small change while writing the output alone.
For the data missing in first file, it has to go into a separate file while the data missing in second file, it should be in another file.

Franklin52 · September 14, 2009, 2:31pm

Should be something like this:

awk -F"|" 'NR==FNR{a[$1];next}
$1 in a{a[$1]++;next}
{print > "NotInFirstfile"}
END{for(i in a){if(!a){print i}}}
' Firstfile Secondfile > NotInSecondfile

swame_sp · September 14, 2009, 4:20pm

@Franklin52,

I would really appreciate if you can explain how it works.
Trying to learn....

I have not specified any file name but still it works fine.... how is that possible?
How does it identify the source files for creating the above two new files.?

Thanks,

Franklin52 · September 15, 2009, 5:04am

awk -F"|" 'NR==FNR{a[$1];next}
$1 in a{a[$1]++;next}
{print > "NotInFirstfile"}
END{for(i in a){if(!a){print i}}}
' Firstfile Secondfile > NotInSecondfile

Explanation:

-F"|"

Set fieldseparator.

'NR==FNR{a[$1];next}

Define an array a with the first field as index if we read the first file.

The next lines are for processing the second file:

$1 in a{a[$1]++;next}

If the first field is defined in array a increase the value of the array with 1 (line is present in both files) and read the next line.

{print > "NotInFirstfile"}

If the first field is NOT defined in array a print the line to the file "NotInFirstfile".

END{for(i in a){if(!a){print i}}}

At last we print the elements of the array a with the value 0 (not increased when we read the second file).

' Firstfile Secondfile > NotInSecondfile

Firstfile and Secondfile are the input files, the prints of the END section are redirected to the file NotInSecondfile.

Regards

gch · September 15, 2009, 11:30am

franklin52:

But the use of 4 external programs is not the most efficient way, try this:
awk -F"|" 'NR==FNR{a[$1];next}
$1 in a{a[$1]++;next}
{print}
END{for(i in a){if(!a){print i}}}
' Firstfile Secondfile

awk is the least efficient program to use. If you look at awk binary code, it is half as big as ksh. Means you are loading this on top of shell you are using. On top of that it makes code hard to debug and read and prevents programmers gaining in-depth experience with UNIX. It was developed at time when only sh was available and it could not process and format character strings. This need vanished with advent of ksh and bash. Awk right now is a crutch for people that never really learned UNIX commands.

This is character count on binaries:

# cd /usr/bin
# wc -c ksh
  171412 ksh
# wc -c awk
   80184 awk
# wc -c sort
    5816 sort
# wc -c uniq
   10036 uniq
# wc -c cut
    9928 cut

Which of these take more computer resources?

ripat · September 15, 2009, 11:37am

I am a awk *and* ksh user and I tend to pickup the right tool for the job. And for the question raised in the OP, awk *is* the right tool. Try to achieve the same result with ksh - or any other shell - with just one line of code. Oh, and awk will be _much_ faster also.

gch · September 15, 2009, 11:42am

Except syntax is simpler without awk. Mine was also one line of code and faster. You can test that with command "time". As you noticed I did not have to explain syntax to user.

Franklin52 · September 15, 2009, 11:48am

Where's your code to produce the 2 desired files?

ripat · September 15, 2009, 11:57am

Well, you will be disapointed. I just ran a benchmark on files with 13000 lines each and here are the results:

jeanluc@ibm:~/scripts/test$ time nawk -F'|' 'FNR==NR {f1[$1];next} !($1 in f1)' file1 file2 > /dev/null

real	0m0.261s
user	0m0.248s
sys	0m0.008s


jeanluc@ibm:~/scripts/test$ time mawk -F'|' 'FNR==NR {f1[$1];next} !($1 in f1)' file1 file2 > /dev/null

real	0m0.093s
user	0m0.080s
sys	0m0.008s


jeanluc@ibm:~/scripts/test$ time cat file1 file2 | cut -f1 -d \| | sort | uniq -u > /dev/null

real	0m0.943s
user	0m0.888s
sys	0m0.052s
jeanluc@ibm:~/scripts/test$

In your solution you are using three different external programs: cat, sort and uniq which, BTW, is useless as sort can handle that with the -u switch. The penalty for your system (memory and CPU wise) is higher than with a simple awk run.

gch · September 15, 2009, 12:10pm

cat test2From example he shows, he needed only third file with first column that was occurring only once in both files.
If one needed full line entry from both files this will do:

for i in `cat file1 file2 | cut -f1 -d \| | sort | uniq -u`
do 
grep -h $i file1 >> fil1
grep -h $i file2 >> fil2
done

If one wants to save output, one can redirect it to some file. It still runs faster than awk and it is self-explanatory.

ripat · September 15, 2009, 12:11pm

Just realised that I used vgersh99's solution. Here are the updated result with Franklin52's one.

with nawk:
real 0m0.279s
user 0m0.272s
sys 0m0.008s

with mawk:
real 0m0.141s
user 0m0.084s
sys 0m0.016s

with the cat | cut | sort | uniq
real 0m0.943s
user 0m0.888s
sys 0m0.052s

gch · September 15, 2009, 12:14pm

ripat, this is interesting. Which system are you using and which shell?

ripat · September 15, 2009, 12:17pm

Linux and ksh. But in this case I don't think that the type of shell is relevant as all solutions are using external programs. I ran that test on large files as one can assume that the OP was just giving a sample and will be working on larger files.