Diff two files with threshold value

i have two big file which have thousand of line.
i have to sort on two key fields then diff the file.

if the interger value of one of the column is less then or greater then 1 it should ignore it.

for example
File1

abc|7000|jhon|2.3
xyz|9000|sam|6.7
pqr|8000|kapi|4.6    

File2

abc|7000|jhon|2.3
xyz|9000|sam|6.7
pqr|8000|kapi|4.5
uvw|1000|abe|5.6

diff Output --expected

uvw|1000|abe|5.6

the commands i tried out are below

sort file1 > sortfile1 |sort file2 sortfile2 | diff sortfile1 sortfile2

output

2c2,3
< pqr|8000|kapi|4.6
---
> pqr|8000|kapi|4.5
> uvw|1000|abe|5.6

Hi,

Your explanation is not clear for me:

Both files have same number of lines?

Which fields?

I don't understand this.

I don't understand that output. Lines that exists in one file but not in the other?

Regards,
Birei

Hi thanks for your response

1) both the file have different number of line
file1 may have 50 thousand records and file 2 may have 51 thousand record

2) I have to sort the file on the second and third column
for example
i have to sort first file1 and file2 on column 7000 and column jhon

abc|7000|jhon|2.3
xyz|9000|sam|6.7
pqr|8000|kapi|4.6

3) the file 1 have

pqr|8000|kapi|4.6

file 2 have

pqr|8000|kapi|4.5

[FONT=monospace]
there is difference of .1 in forth column (4.6 and 4.5)
the diff should ignore if the difference is of .1 in the forth column

4) if the difference is of more then .1 it should report in output

Therefore, the output should be:

1.- Lines that exists only in one of the two files.
2.- Lines whose first three columns are the same but the difference between the number of the fourth is different from 0.1 or -0.1. In that case, the line of what file is what you have to write to output?

Regards,
Birei

Thanks Birei for Quick response
Yes you are correct

Fourth column is Real number and if its difference is 0.1 or -0.1 it is acceptable so no need to be in output.

If the difference is more then 0.1 it should be in output.
let me put it in example

1) file1

abc|7000|jhon|2.3
xyz|9000|sam|6.7
pqr|8000|kapi|4.6
lmn|3000|kapi|4.6

2) file 2

abc|7000|jhon|2.3
xyz|9000|sam|6.7
pqr|8000|kapi|4.5
lmn|3000|kapi|4.1

Expectation
1)the third row is having diff of 0.1 so it is acceptable, not required to be in output

2) the forth row is having a diff more then 0.1 it is 0.5 so should be in the output.
Output

> lmn|3000|kapi|4.1
< lmn|3000|kapi|4.6

do you want output if the file occurs in only 1 file, or only if it appears in both files and has a difference of more than 0.1 in column 3?

yes
only if it appears in both files

Try next 'Perl' script:

$ cat file1
abc|7000|jhon|2.3
xyz|9000|sam|6.7
pqr|8000|kapi|4.6
lmn|3000|kapi|4.6
$ cat file2
abc|7000|jhon|2.3
xyz|9000|sam|6.7
pqr|8000|kapi|4.5
lmn|3000|kapi|4.1
$ cat script.pl
use warnings;
use strict;

@ARGV == 2 or die qq[Usage: perl $0 file1 file2\n];

my $num_file = 1;
my %line;

while ( <> ) {
        chomp;
        my @f = split /\|/;

        if ( $num_file == 1 ) {
                $line{ "@f[0..$#f-1]" } = $f[-1];
                next;
        }

        if ( $num_file == 2 ) {
                if ( exists $line{ "@f[0..$#f-1]" } and abs( $line{ "@f[0..$#f-1]" } - $f[-1] ) > 0.1 ) {
                        printf "> %s\n< %s\n", 
                                join( "|", @f[0..$#f-1], $line{ "@f[0..$#f-1]" } ),
                                join( "|", @f[0..$#f-1], $f[-1] );
                }
        }
} continue {
        ++$num_file if eof;
}
$ perl script.pl
Usage: perl script.pl file1 file2
$ perl script.pl file1 file2
> lmn|3000|kapi|4.6
< lmn|3000|kapi|4.1

Regards,
Birei

It work perfectly for forth column

But it don't catch the difference in 1 to 3 column

for example
file1
pqr|7000|jhon|2.3
xyz|9000|sam|6.7
pqr|8000|kapi|4.5
lmn|3000|kapi|4.6

file 2

pqr|7000|jhon|2.3
xyz|12000|sam|6.7
pqr|8000|kapi|4.5
lmn|3000|kapi|4.1

if the the second row second column is different it don't work
file1
xyz|9000|sam|6.7

file2
xyz|12000|sam|6.7

Seven posts to try to make clear what you are trying to get, and still confused. I think we are not understanding each other, and may be my problem. I will quote:

You said:

To question of spynappels:

your answer is:

The program compares the first three fields of each line in both files. Only if they are the same compares the fourth column. In your last example the second field is different (9000 < 12000). For the program those lines are different so there is nothing to compare and nothing to send to output.

You post different input with each message. Please, try to help me a little and tell what 'same line in both files' means to you. Say what to do when field1, or field2, or field3 is different in both files. Say what to do if there is a line in one file that doesn't exists in the other one (all fields different) and post an example of input file with all cases and exactly the expected output.

Regards,
Birei.

Birei sorry for not being clear,

the script work fine.
Perl is new to me so was not able to translate the logic.
but this script works fine for me :slight_smile:

Thanks Birei for all your effort

It does not matter.

You can safely indicate if the program doesn't work well, but understand that sometimes it is inconvenient to change the code because requirements are not clear from the beginning.

Regards,
Birei