I have two different files, one has two columns and other has only one column. I would like to compare the first column in the first file with the data in the second file and write a third file with the data that is not present is not common to them.
First file:
NEIS0MDL-00022|060406043A
NEIS2FTE-00111|060406043A
NEIS2FTE-00112|060406043A
NEIS2FTE-00113|060406043A
NEIS2FTE-00114|060406043A
NEIS2FTE-00115|060406043A
Second File:
NEIS0MDL-00022
NEIS2FTE-00111
NEIS2FTE-00112
NEIS2FTE-00113
NEIS2FTE-00114
NEIS2FTE-211
Third File:
NEIS2FTE-211
NEIS2FTE-00115
With the above command, it is writing the output with the data from second file(single column file) that is not present in first file. It is missing the data that are not common but present in first file(two column file).
I think this is working fine, but could you please explain it. Does the "sort" modifies the order of display?
---------- Post updated at 12:04 PM ---------- Previous update was at 12:02 PM ----------
Still the desired output is same, the third file is expected to have data that are not common to both. The data 211 and 0015 are not common so they are in the third file. Kindly let me know if I'm not clear.
This also works fine.
I'm sorry, Can I have one small change while writing the output alone.
For the data missing in first file, it has to go into a separate file while the data missing in second file, it should be in another file.
I would really appreciate if you can explain how it works.
Trying to learn....
I have not specified any file name but still it works fine.... how is that possible?
How does it identify the source files for creating the above two new files.?
awk is the least efficient program to use. If you look at awk binary code, it is half as big as ksh. Means you are loading this on top of shell you are using. On top of that it makes code hard to debug and read and prevents programmers gaining in-depth experience with UNIX. It was developed at time when only sh was available and it could not process and format character strings. This need vanished with advent of ksh and bash. Awk right now is a crutch for people that never really learned UNIX commands.
I am a awk *and* ksh user and I tend to pickup the right tool for the job. And for the question raised in the OP, awk *is* the right tool. Try to achieve the same result with ksh - or any other shell - with just one line of code. Oh, and awk will be _much_ faster also.
Except syntax is simpler without awk. Mine was also one line of code and faster. You can test that with command "time". As you noticed I did not have to explain syntax to user.
Well, you will be disapointed. I just ran a benchmark on files with 13000 lines each and here are the results:
jeanluc@ibm:~/scripts/test$ time nawk -F'|' 'FNR==NR {f1[$1];next} !($1 in f1)' file1 file2 > /dev/null
real 0m0.261s
user 0m0.248s
sys 0m0.008s
jeanluc@ibm:~/scripts/test$ time mawk -F'|' 'FNR==NR {f1[$1];next} !($1 in f1)' file1 file2 > /dev/null
real 0m0.093s
user 0m0.080s
sys 0m0.008s
jeanluc@ibm:~/scripts/test$ time cat file1 file2 | cut -f1 -d \| | sort | uniq -u > /dev/null
real 0m0.943s
user 0m0.888s
sys 0m0.052s
jeanluc@ibm:~/scripts/test$
In your solution you are using three different external programs: cat, sort and uniq which, BTW, is useless as sort can handle that with the -u switch. The penalty for your system (memory and CPU wise) is higher than with a simple awk run.
cat test2From example he shows, he needed only third file with first column that was occurring only once in both files.
If one needed full line entry from both files this will do:
for i in `cat file1 file2 | cut -f1 -d \| | sort | uniq -u`
do
grep -h $i file1 >> fil1
grep -h $i file2 >> fil2
done
If one wants to save output, one can redirect it to some file. It still runs faster than awk and it is self-explanatory.
Linux and ksh. But in this case I don't think that the type of shell is relevant as all solutions are using external programs. I ran that test on large files as one can assume that the OP was just giving a sample and will be working on larger files.