Generate separate files with similar and dissimilar contents

H_squared · July 22, 2016, 4:21am

Hello experts,

I have 2 files 1.txt (10,000 lines of text) and 2.txt (7500 lines of text).
Both files have similar as well as dissimilar entries.
Is there a way(s) where i can perform the following operations :

Generate a file which will have all similar lines.
Generate a file which will have all dissimilar lines.

On my part, I performed the following command in order to, generate a file which will have all dissimilar lines :

fgrep -v -f 1.txt 2.txt > 3.txt

Example of file 1.txt

Example of file 2.txt

Could you please help with both these queries.

Thank you.

Regards,
Haider

balajesuri · July 22, 2016, 6:00am

First of all I would recommend you to read the man pages of grep and understand the working of switch "-v". You'll get your answer

Don_Cragun · July 22, 2016, 4:21pm

Note that grep -Fvf 1.txt 2.txt will just give you a list of lines in 2.txt that are not in 1.txt. To also get the list of lines in 1.txt that are not in 2.txt, you'll need a second grep .

H_squared · July 25, 2016, 3:07am

Thanks for the idea, I tried the following code to get the
combined result of dissimilar elements from both files

 grep -Fvf 2.txt 1.txt && grep -Fvf 1.txt 2.txt

 2
4
6
g
f
x
z
m
0

With regards to similar elements, I tried the following code

 grep -Fxf 2.txt 1.txt
1
8
3
 
grep -Fxf 1.txt 2.txt
1
3
8

The result is the same in both cases.

RavinderSingh13 · July 25, 2016, 3:35am

Hello H squared,

Could you please try following and let me know if this helps you.

awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print $1 >> "similar_ones.txt";delete A[$1];next} !($1 in A){print $1 >> "dissimilar_ones.txt"} END{for(i in A){print A >> "dissimilar_ones.txt"}}'  Input_file1   Input_file2

Above will create 2 files named similar_ones.txt and dissimilar_ones.txt , which will be as follows.

cat similar_ones.txt
1
3
8

cat dissimilar_ones.txt
x
z
m
0
4
6
f
2
g

EDIT: Adding a non-one liner form of solution now.

awk 'FNR==NR{
             A[$1]=$1;
             next
            }
     ($1 in A){
                print $1 >> "similar_ones.txt";
                delete A[$1];
                next
              }
     !($1 in A){
                print $1 >> "dissimilar_ones.txt"
               }
     END{
                for(i in A){
                                print A >> "dissimilar_ones.txt"
                           }
        }
    '  Input_file1   Input_file2

NOTE: File dissimilar_ones.txt will have difference of both the files, means: will have contents which are in Input_file1 and NOT in Input_file2
+ will have contents which are in Input_file2 and NOT in Input_file1.

Thanks,
R. Singh

Don_Cragun · July 25, 2016, 3:42am

Note that:

grep -Fvf 2.txt 1.txt && grep -Fvf 1.txt 2.txt

won't look for lines in 2.txt that are not in 1.txt if there aren't any lines in 1.txt that are not in 2.txt . It would seem that:

grep -Fvf 2.txt 1.txt ; grep -Fvf 1.txt 2.txt

would be more likely to give you what you want.

Note also that if you are only looking for complete line matches when looking for matching lines, you probably also want complete line matches when looking for non-matching lines. That would be:

grep -Fxvf 2.txt 1.txt; grep -Fxvf 1.txt 2.txt

Is there some reason why you would expect that:

grep -Fxf 2.txt 1.txt

and:

grep -Fxf 1.txt 2.txt

would produce different output (other than the order of matching lines found)? These commands both print lines that are present in both files. Why would lines that are found in 1.txt and found in 2.txt be different from lines that are found in 2.txt and found in 1.txt .

H_squared · July 25, 2016, 4:02am

The output of scenario 1 (dissimilar elements) can be redirected to a file as :

 { grep -Fvf 2.txt 1.txt && grep -Fvf 1.txt 2.txt; } > 3.txt

For scenario 2, it is relatively easier as the contents output is the same :

grep -Fxf 1.txt 2.txt > 3.txt

Don_Cragun · July 26, 2016, 12:13am

h squared:

The output of scenario 1 (dissimilar elements) can be redirected to a file as :
 { grep -Fvf 2.txt 1.txt && grep -Fvf 1.txt 2.txt; } > 3.txt
 
For scenario 2, it is relatively easier as the contents output is the same :
grep -Fxf 1.txt 2.txt > 3.txt

As long as no line in either of your input files is a substring of a line in the other input file AND there is at least one line in 1.txt that is not present in 2.txt , you will get away with using the 3 grep commands above to do what you are trying to do.

If any of the conditions listed above are violated, you will not get the correct results with the above code. But, the changes I suggested in post #6 for the 1st two grep commands:

(grep -Fxvf 2.txt 1.txt; grep -Fxvf 1.txt 2.txt) > 3.txt

would give you correct results.

But, unless there are duplicated lines in one or both of your input files, the single awk script RavinderSingh13 suggested will be faster (only needing 1 process instead of 3 and only reading the 17,500 lines of input from your two files once instead of three times). If what you're saying is that want an output file named 3.txt instead of dissimilar_ones.txt , I would assume that you understand that you can change the string "dissimilar_ones.txt" in Ravinder's awk script in two places to change the name of that output file.

If there are duplicated lines in your input files that need to be preserved, Ravinder's suggested awk script could be modified slightly to handle that case (which was not mentioned in your requirements and input samples) and still the get speed improvements afforded by executing fewer processes and only needing to read your input files once.

Of course, as always, if you're using a Solaris/SunOS system, you'd need to change awk in Ravinder's script to /usr/xpg4/bin/awk or nawk .