Compare Two files and populate difference records into another file with difference column numbers

srivenumuthyalapati · July 27, 2023, 5:29pm

Hi Folks,

I need to compare Two files in Unix and populate the difference record with diff column numbers. Please help in creation of scripts to do this logic.

Example:
Left.dat

123|Chennai|IBM|782.0
786|Bangalore|TCS|999.00
124|Chennai|TCS|456

Right.dat

786|Bangalore|TCS|999.00
124|Chennai|TCS|456
123|Chenai|IBM|782
223|Kolkata|CGI|786

Output:

123|Chennai|IBM|782.0|2,4
223|Kolkata|CGI|786|1,2,3,4

Please note that LAST COLUMN will be the diff columns number.

Thanks in advance!!

vgersh99 · July 27, 2023, 5:59pm

welcome to the community, @srivenumuthyalapati.
Please do format your postings (data/code samples) using markdown code tags described here. I've edited your initial post for now - please do so going forward.

As far as your data samples and the desired output goes... I cannot correlate your sample input with your desired output. Take the output case of 123|Chennai|IBM|782.0|2,4. How did you arrive at 2,4? These are supposed to be the column numbers containing "different value". Why is it 2,4?
Can you walk us through your "diff" algorithm, please.

Also... anything you have tried on your own and got stuck somehow?

srivenumuthyalapati · July 28, 2023, 5:05am

Thanks for the response @vgersh99

Yeah - 2,4 are the column number which has different values.

With the help of below command, we can able to identified the different records.

awk -F "|" 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1] { c[$1]=0; print}; Left.dat Right.dat

I'm not able to proceed to populate the column number for different values.

Can you please help.

munkeHoller · July 28, 2023, 6:42am

diff/sdiff may help simplify your processing, see below, obviously the output from diff/sdiff needs massaging to get your desired result.

diff Left.dat Right.dat |grep '^[<>]'
< 123|Chennai|IBM|782.0
> 123|Chenai|IBM|782
> 223|Kolkata|CGI|786

#
sdiff -sl <(sort l.orig)  <(sort r.orig)
123|Chennai|IBM|782.0					      |	123|Chenai|IBM|782
							      >	223|Kolkata|CGI|786

srivenumuthyalapati · July 28, 2023, 8:38am

Thanks for the response @munkeHoller

Yeah - we can identified the differences using diff command - this is another option.

But mainly, concern about how to populate column number for difference records.

Any thought on this? please advise.

MadeInGermany · July 28, 2023, 11:27am

So the corresponding record in Right.dat is the first one with the same value in field 1.

When the corresponding record is found, a for loop through the fields will get the differences.

vgersh99 · July 28, 2023, 12:33pm

I'd start with trying to answer some of the questions asked previously in this post.
And then we/you can take things from there.

munkeHoller · July 28, 2023, 1:06pm

I've a [potential] solution, will post once at home

not the prettiest/shortest, but hopefully you can understand.

cat gendiffs.awk
/^< / {
	if ( p1 == "" ) # initial condition
	{
		p1=$0
		sub(/^< /,"", p1 )  # trim diff output markers
		split(p1,p1Fld,"|")
		next;
	}
	
	p1=p2=""; delete p1Fld; delete p2Fld	
}
/^> / { # 2nd part
	diffs=""# temp field list buffer
	p2=$0
	sub(/^> /,"", p2 ) # trim diff output markers
	
	if ( p1 == "" ) # p2 is unique!
	{
		printf("%s|", p2 )
		for( fld=1; fld <= split($0,p2Fld,"|"); fld++ )
			diffs=diffs "" (diffs!=""?",":"") fld
	}
	else
	{
		printf("%s|", p1 )
		for( fld=1; fld <= split(p2,p2Fld,"|"); fld++ )
			if ( length(p1Fld[fld]) != length(p2Fld[fld]) || p1Fld[fld] != p2Fld[fld] )
				diffs=diffs "" (diffs!=""?",":"") fld
	}
	
	print diffs;
	p1=p2=""; delete p1Fld; delete p2Fld	
}	

#
# test
#
diff Left.dat Right.dat | awk -f script.awk
123|Chennai|IBM|782.0|2,4
223|Kolkata|CGI|786|1,2,3,4

MadeInGermany · July 28, 2023, 5:28pm

This one directly runs on the input files:

awk -F"|" '
  # File1 : store lines in c[ ] indexed by $1
  NR==FNR { c[$1]=$0; next }
  {
    # Split the corresponding stored line to d[ ]
    split(c[$1], d, FS)
    # For loop to find and collect deviating fields
    out=sep=""
    for (i=1; i<=NF; i++)
      if ($i"" != d[i]"") { out=(out sep i); sep="," }
    # Appended "" casts to a string so a 4.0 differs from a 4
  }
  # Print the line plus the collected string
  out!="" { print ($0 FS out) }
' Left.dat Right.dat

system · May 23, 2024, 6:20pm

This topic was automatically closed 300 days after the last reply. New replies are no longer allowed.