Compare data in two files

jiam912 · February 15, 2015, 2:49pm

Gents,

Can you help please

I have a data base with lot information (file2) and I have some data in (file1) to compare.

Then the comparison should be done using the following keys:

Example ( values from file1 )
key1 = columns from 20-34 substr($0,20,15)

66705.00  19793

key2 = columns from 40-80 substr($0,40,41)

   1  2071  66500.00  19604.00  20016.001

Then using this keys need to compare that all information for key1 and range in key2. Matches in file2

The key1 is the same during 18 rows and then it change to other value, but the key2 changes every row.

So, the thing is to compare that all information in file1 match in file2..and when there is some difference..

Example in
file1

X  4960     22110  66705.00  19793.001 3441 40001  66704.00  19610.00  20016.001

file2

X              1   66705.00  19793.00  3441 36441  66704.00  19610.00  20016.001

values doesn't match in 40001 and 36441

The desired output will be:

66705.00 19793 has error.

If everything is ok not output will appears.

Please help me with this.

RudiC · February 16, 2015, 6:08am

You may want to try this one:

awk     'NR==FNR        {T[substr ($0, 20, 15) ".*" substr ($0, 40, 41)]=NR; next}
                        {for (t in T)   {if ($0 ~ t) {FOUND=1; break}
                                        }
                         if (!FOUND) print $3, $4 " has error."
                         FOUND=0
                        }
        ' /tmp/file1.txt /tmp/file2.txt

jiam912 · February 16, 2015, 6:31am

Hi RudiC

I did something like this and I got the desired output.

awk '{if(/^X/)print $0}' file1.txt > tmpX1
awk '{if(/^X/)print $0}' file2.txt > tmpX2
awk 'FNR == NR {c[substr($0,20,15)substr($0,52,5) ]++;next}c[(substr($0,20,15)substr($0,52,5))]-- == 1' tmpX1 tmpX2 >  tmpX3

cat tmpX1 | awk '{

				ln=substr($0,20,5);
				pt=substr($0,30,5);
				spr=substr($0,39,42);
				rpl=substr($0,52,5);
				rpl=substr($0,52,5);
				tape=$2;
				ffid=substr($0,8,8);
				printf( "%5d%5d%5d %s %4d %4d\n",ln,pt,rpl,spr,tape,ffid)}' | sort -k1n > tmp1

cat tmpX3 | awk '{

				ln=substr($0,20,5);
				pt=substr($0,30,5);
				spr=substr($0,39,42);
				rpl=substr($0,52,5);
				printf( "%5d%5d%5d %s\n",ln,pt,rpl,spr)}' | sort -k1n > tmp2


awk 'NR==FNR {a[substr($0,1,58)];next}!(substr($0,1,58) in a)' tmp2 tmp1 > tmp3

awk '{printf (" %5d/%5d has errors\n", substr($1,1,5),substr($1,6,5))}' tmp3		

rm -f *tmp*

Output

66705/19793 has errors

I will try your code now..

RudiC · February 16, 2015, 7:38am

Not all versions of cut offer the --output-delimiter=STRING option - mine does:

cut -c20-34,40-81 --output-delimiter=".*" /tmp/file2.txt | grep -vf- /tmp/file1.txt 
X  4960     22110  66705.00  19793.001 3441 40001  66704.00  19610.00  20016.001

---------- Post updated at 13:38 ---------- Previous update was at 13:34 ----------

and pipe it through cut again to only receive the two values you need:

 | cut -c20-34
66705.00  19793

jiam912 · February 16, 2015, 8:31am

Rudi C,
That cut version does not works for me

The previuos code works but take a lot of time and looks like as the files are not sorted as I did in my script .. your code got more data in the output

Many thanks for your help

RudiC · February 16, 2015, 9:12am

The awk script prints all lines in file2 that don't have a counterpart in file1. If it needs to be the other way round, exchange the file names, but check the printed fields, too.