I have 2 files 1.txt (10,000 lines of text) and 2.txt (7500 lines of text).
Both files have similar as well as dissimilar entries.
Is there a way(s) where i can perform the following operations :
Generate a file which will have all similar lines.
Generate a file which will have all dissimilar lines.
On my part, I performed the following command in order to, generate a file which will have all dissimilar lines :
Note that grep -Fvf 1.txt 2.txt will just give you a list of lines in 2.txt that are not in 1.txt. To also get the list of lines in 1.txt that are not in 2.txt, you'll need a second grep .
Could you please try following and let me know if this helps you.
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print $1 >> "similar_ones.txt";delete A[$1];next} !($1 in A){print $1 >> "dissimilar_ones.txt"} END{for(i in A){print A >> "dissimilar_ones.txt"}}' Input_file1 Input_file2
Above will create 2 files named similar_ones.txt and dissimilar_ones.txt , which will be as follows.
cat similar_ones.txt
1
3
8
cat dissimilar_ones.txt
x
z
m
0
4
6
f
2
g
EDIT: Adding a non-one liner form of solution now.
awk 'FNR==NR{
A[$1]=$1;
next
}
($1 in A){
print $1 >> "similar_ones.txt";
delete A[$1];
next
}
!($1 in A){
print $1 >> "dissimilar_ones.txt"
}
END{
for(i in A){
print A >> "dissimilar_ones.txt"
}
}
' Input_file1 Input_file2
NOTE: File dissimilar_ones.txt will have difference of both the files, means: will have contents which are in Input_file1 and NOT in Input_file2
+ will have contents which are in Input_file2 and NOT in Input_file1.
won't look for lines in 2.txt that are not in 1.txt if there aren't any lines in 1.txt that are not in 2.txt . It would seem that:
grep -Fvf 2.txt 1.txt ; grep -Fvf 1.txt 2.txt
would be more likely to give you what you want.
Note also that if you are only looking for complete line matches when looking for matching lines, you probably also want complete line matches when looking for non-matching lines. That would be:
grep -Fxvf 2.txt 1.txt; grep -Fxvf 1.txt 2.txt
Is there some reason why you would expect that:
grep -Fxf 2.txt 1.txt
and:
grep -Fxf 1.txt 2.txt
would produce different output (other than the order of matching lines found)? These commands both print lines that are present in both files. Why would lines that are found in 1.txt and found in 2.txt be different from lines that are found in 2.txt and found in 1.txt .
As long as no line in either of your input files is a substring of a line in the other input file AND there is at least one line in 1.txt that is not present in 2.txt , you will get away with using the 3 grep commands above to do what you are trying to do.
If any of the conditions listed above are violated, you will not get the correct results with the above code. But, the changes I suggested in post #6 for the 1st two grep commands:
But, unless there are duplicated lines in one or both of your input files, the single awk script RavinderSingh13 suggested will be faster (only needing 1 process instead of 3 and only reading the 17,500 lines of input from your two files once instead of three times). If what you're saying is that want an output file named 3.txt instead of dissimilar_ones.txt , I would assume that you understand that you can change the string "dissimilar_ones.txt" in Ravinder's awk script in two places to change the name of that output file.
If there are duplicated lines in your input files that need to be preserved, Ravinder's suggested awk script could be modified slightly to handle that case (which was not mentioned in your requirements and input samples) and still the get speed improvements afforded by executing fewer processes and only needing to read your input files once.
Of course, as always, if you're using a Solaris/SunOS system, you'd need to change awk in Ravinder's script to /usr/xpg4/bin/awk or nawk .