How to replace the complex strings from a file using sed or awk?

Badhrish · February 20, 2015, 2:21am

[/CODE]Dear All,

I am having a requirement to find the difference between 2 files and generate a discrepancy report out of it as an html page. I prefer using diff -y file1 file2 since it gives user friendly layout to know any discrepancy in the record and unique records among the 2 file. Here's how it looks like.
File1:

ABCD*DEFG~ 
HI*JK~
LMN*OP~

File2:

ABCD*DEFG~
HIH*JK~
LMN*OP~
FGH*NM~

Output is :

ABCD*DEFG~                                                   ABCD*DEFG~
HI*JK~                                                        |  HIH*JK~
                                                                  > XY*Z~
LMN*OP~                                                        LMN*OP~
                                                                  > FGH*NM~

I need to replace the lines that has bad data with html tags as prefix and suffix w/o altering the inundation of the output

ABCD*DEFG~             ABCD*DEFG~
<font color="red">HI*JK~  | HIH*JK~</font>
<font color="red">            > XY*Z~</font>
LMN*OP~                            LMN*OP~
<font color="red">            > FGH*NM~</font>

I am not able to use | and > as FS or delimiter in awk or sed since my actual files might also contain such characters. Please suggest me the best solution to overcome this challenge.

RudiC · February 20, 2015, 3:29am

With a recent bash providing "process substitution", you could try

awk     'FNR==NR {T[$0]; next}
         $0 in T {print "<font color=\"red\">" $0 "</font>"; next}
         1
        ' <(diff -y --suppress-common-lines file[12]) <(diff -y file[12])
<font color="red">ABCD*DEFG~                               |    ABCD*DEFG~</font>
<font color="red">HI*JK~                                  |    HIH*JK~</font>
LMN*OP~                                LMN*OP~
<font color="red">                                  >    FGH*NM~</font>

Badhrish · February 20, 2015, 5:24am

Hi Rudi, Thanks for the reply but I am not getting the output out of this code. Please check if I am missing something

My file :diffoutput.txt

ABCD*DEFG~                                                  ABCD*DEFG~
HI*JK~                                                        | HIH*JK~
                                                              > XY*Z~
LMN*OP~                                                         LMN*OP~
                                                              > FGH*NM~

Awk code: awktest.sh

#! /bin/bash

awk     'FNR==NR {T[$0]; next}
         $0 in T {print "<font color=\"red\">" $0 "</font>"; next} 1 ' diffoutput.txt

RudiC · February 20, 2015, 5:38am

No surprise as you're not priniting anything. That awk needs two files, first the result of diff -y --suppress-common-lines file[12] , second the result of diff -y file[12] .

Badhrish · February 20, 2015, 6:49am

My bad..I altered it and the cmd did work like beauty. But it is messing up the inundation of the final output. Means when I view this in html page the format is gone
Each record has to be preserved by not going in next line. I did use --width=100 so as to accommodate each record in it's position, but in vain.
Is there a way to preserve the layout ? Hope I am not asking much...!

ABCD*DEFG~                                                      ABCD*DEFG~
<font color="red">HI*JK~                                                              | HIH*JK~</font>LMN*OP~                                                           LMN*OP~
<font color="red">                                                            > FGH*NM~</font>

drl · February 20, 2015, 12:16pm

Hi.

Also a few available utilities:

#!/usr/bin/env bash

# @(#) s1	Demonstrate colorize diff output.
# ANSIfilter:
# Andr� Simon - Startseite

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C diff colordiff ansifilter

pl " Input data files data?:"
head data?

pl " Results:"
diff -y --suppress-common-lines data? |
colordiff |
ansifilter -B   # -B bbcode; -H html; -L latex, etc.

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
diff (GNU diffutils) 2.8.1
colordiff diff (GNU diffutils) 2.8.1
ansifilter - ( local: ~/executable/ansifilter, 2014-01-28 )

-----
 Input data files data?:
==> data1 <==
ABCD*DEFG~ 
HI*JK~
LMN*OP~

==> data2 <==
ABCD*DEFG~
HIH*JK~
LMN*OP~
FGH*NM~

-----
 Results:
ABCD*DEFG~                            | ABCD*DEFG~
HI*JK~                                | HIH*JK~
                                      > FGH*NM~

Best wishes ... cheers, drl

Badhrish · February 24, 2015, 4:05am

drl:

Hi.

Also a few available utilities:

#!/usr/bin/env bash

# @(#) s1	Demonstrate colorize diff output.
# ANSIfilter:
# Andr� Simon - Startseite

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C diff colordiff ansifilter

pl " Input data files data?:"
head data?

pl " Results:"
diff -y --suppress-common-lines data? |
colordiff |
ansifilter -B   # -B bbcode; -H html; -L latex, etc.

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
diff (GNU diffutils) 2.8.1
colordiff diff (GNU diffutils) 2.8.1
ansifilter - ( local: ~/executable/ansifilter, 2014-01-28 )

-----
 Input data files data?:
==> data1 <==
ABCD*DEFG~ 
HI*JK~
LMN*OP~

==> data2 <==
ABCD*DEFG~
HIH*JK~
LMN*OP~
FGH*NM~

-----
 Results:
ABCD*DEFG~                            | ABCDE*DEFG~
HI*JK~                                | HIH*JK~
   > FGH*NM~

Best wishes ... cheers, drl

Am not good with C. Could you please let me know where exactly to feed the input files in your code. Also is there a way to get the text coloured exactly in the position where the discrepancy is? Something like below

ABCD*DEFG~                            | ABCDE*DEFG~
HI*JK~                                | HIH*JK~
                                      > FGH*NM~

drl · February 24, 2015, 8:43am

Hi, Badhrish.

The heart of the solution are these lines:

diff -y --suppress-common-lines data? |
colordiff |
ansifilter -B   # -B bbcode; -H html; -L latex, etc.

the input files are are provided just like you did with with your diff , except that I called them data1 and data2, and I used the shell meta-character "?" to allow expansion of those filenames.

If you are just looking at the output at a terminal, then ansifilter is not required. I used it to produce bbcode markup to paste here. There are other uses, for example if you would be including the output in an HTML email message.

Nothing occurs to me off-hand, but a Google search might be useful. If I get some time, I'll look into it.

Best wishes ... cheers, drl

Badhrish · February 25, 2015, 9:53am

Thanks for your time DRL. Please let me know if you can crack it. I am also trying my best here.

I came across this error while running this code. Please suggest how to fix it.

$ ./newfilecompare.bash

-----
 Input data files data?:
==> File1.txt <==
ABCD*DEFG~
HI*JK~
LMN*OP~

==> File2.txt <==
ABCD*DEFG~
HIH*JK~
LMN*OP~
FGH*NM~

-----
 Results:
./newfilecompare.bash: line 21: colordiff: command not found
./newfilecompare.bash: line 22: ansifilter: command not found

---------- Post updated at 08:23 PM ---------- Previous update was at 01:11 PM ----------

Hi Rudi, This time my input being XML, the font tags gets bypassed in my html report. So firstly I converted by difference report into HTML using enscript and in the o/p all symbols "<" and ">" will be converted to underlying code.

Input1:

<LIST>                                  
<ControlSegment                         
ISACONTROLNUMBER="58677398"             
GSCONTROLNUMBER="58677398"              
groupControlNumber="58677398"           
time="21:31:03.0130000-08:00"  />				
</LIST>

Input2:

<LIST>
<ControlSegment
ISACONTROLNUMBER="58677399"     
GSCONTROLNUMBER="58677399"      
groupControlNumber="58677399"   
time="21:31:03.2570000-08:00" /> 
entityIdentifierCode2=""
</LIST>

<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>Enscript Output</TITLE>
</HEAD>
<BODY>
<A NAME="top">
<A NAME="file1">
<H1>xmlcompare.txt</H1>

<PRE>
<LIST>                                                          				<LIST>
<ControlSegment                                                           				 <ControlSegment
ISACONTROLNUMBER="58677398"                                                           |     ISACONTROLNUMBER="58677399"                     
GSCONTROLNUMBER="58677398"                                                            |     GSCONTROLNUMBER="58677399"        
groupControlNumber="58677398"                                                         |       groupControlNumber="58677399"                 
time="21:31:03.0130000-08:00"  />                                                  |       time="21:31:03.2570000-08:00" />      
												  >    entityIdentifierCode2=""
</LIST>												</LIST>

</PRE>
<HR>
<ADDRESS>Generated by <A HREF="http://www.iki.fi/~mtr/genscript/">GNU enscript 1.6.4</A>.</ADDRESS>
</BODY>
</HTML>

Now I fed the HTML file to awk script, and the output was like each and every lines were appended with <font> tags(this time symbols will not be replaced as underlying code, hence they appear as actual html tags) and when you view it in the browser the whole data appears red. Is there a better way to handle this? I am so stuck here. Thank you.

awk     'FNR==NR {T[$0]; next}
         $0 in T {printf "\"<font color=\"red\">\"" $0 "\"</font>\""; next} 1 ' <(cat `pwd`/editedfile.html) <(cat `pwd`/editedfile.html)

drl · February 25, 2015, 9:55am

Hi.

The message colordiff: command not found means just that. It may be on your system, but you have not included its location into your PATH variable -- a set of locations in which the shell looks for commands. Another reason might be that it is not installed on your system. Try running the command:

which colordiff

which on my main system produces:

/usr/bin/colordiff

On a system to which I have access to, for example:

which colordiff

produces:

because, although it is available for that system, I have not installed it from the repository. On that system:

yum info colordiff

produces:

Loaded plugins: fastestmirror, refresh-packagekit, security
Loading mirror speeds from cached hostfile
 * base: mirrors.gigenet.com
 * centosplus: mirrors.cmich.edu
 * contrib: mirror.ubiquityservers.com
 * epel: ftp.osuosl.org
 * extras: mirror.us.leaseweb.net
 * updates: ftp.osuosl.org
Available Packages
Name        : colordiff
Arch        : noarch
Version     : 1.0.9
Release     : 3.el6
Size        : 23 k
Repo        : epel
Summary     : Color terminal highlighter for diff files
URL         : http://colordiff.sourceforge.net/
License     : GPLv2+
Description : Colordiff is a wrapper for diff and produces the same output but
            : with pretty syntax highlighting.  Color schemes can be customized.

You have not mentioned what system you are working on, so I cannot help farther for this issue. In the worst case, the command may not be available at all for your hardware/software platform, or you may need to ask the system administrator (SA) to install it. For many Linux systems, it you are the SA, then you may be able to install it. Noting that it had been available on sourceforge, it may be able to be installed in your personal files, and used from there.

As I mentioned earlier, the ansifilter is not strictly necessary unless you intend to use the colored results other than on the terminal.

If all this is too much to take in or too much work for this task, then the other previous solutions may be a better use of your time.

I have found something which does illustrate character-level differences (insertions, deletions, replacements), but not in color. An advantage is that it is a shell script, so you would probably be able to use it easily, but no color is involved (although an enterprising person might be able to add color).

Best wishes ... cheers, drl

( Edit 1: correct minor typos )

RudiC · February 25, 2015, 10:27am

What happens if you apply my UNALTERED script to the two files that contain the results of the two diff operations?
Don't cat the input files, don't cat the diff results; awk is well able to read those results.

Badhrish · February 26, 2015, 1:46am

Hi DRL-Sorry that I missed to mention my system info. It is GNU\LINUX. And I fount that colordiff package is not present. Since I am focusing on multi-platform compatibility(UNIX/LINUX) I wouldn't able to use this command in my script. I really appreciate your help thus far

---------- Post updated at 12:16 PM ---------- Previous update was at 11:58 AM ----------

Hi Rudi- Your awk works perfect for my XML files also. But when I convert my final text file into HTML report using enscript command all symbols "<"and">" are converted to underlying code "<" and ">". This affects including font tags too(which is not desired).

<font color="red"> and </font>

Hence in the browser they appear as is. That is why I applied enscript command in 1st place and appended the file with awk so as to get the proper output like below, but this time as i mentioned earlier all the lines are effected.

<font color="red"> and </font>

Badhrish · March 13, 2015, 5:22am

Hello World, I've managed to encode the XML data(using sed replace), which helps in displaying your XML data as such in the browser, yet with proper applied HTML attributes. I hope this will be useful for others who come across such requirement. Thanks to Rudi and DRL for their code snippets.

awk 'FNR==NR {T[$0]; next}
     $0 in T {printf "<font color=\"red\">" $0 "</font>"; next} 1 ' <(diff --width=220 -y --suppress-common-lines FILE[12] | sed -e "s/\(.\)\([A_Za-z0-9]*\)\(>\)\(.\)/\1\2\>\4/g" -e "s/\(.\)\([A_Za-z0-9]*\)\(>\)/\1\2\>/g" -e "s/\(<\)\([A_Za-z0-9]*\)\(.*\)/\<\2\3/g" -e "s/\(.\)\(<\)\([A_Za-z0-9]*\)\(.\)/\1\<\3\4/g") <(diff --width=$WIDTH -y FILE[12] | sed -e "s/\(.\)\([A_Za-z0-9]*\)\(>\)\(.\)/\1\2\>\4/g" -e "s/\(.\)\([A_Za-z0-9]*\)\(>\)/\1\2\>/g" -e "s/\(<\)\([A_Za-z0-9]*\)\(.*\)/\<\2\3/g" -e "s/\(.\)\(<\)\([A_Za-z0-9]*\)\(.\)/\1\<\3\4/g") | sed -r 's/([^^>])(<font>)/\1\n\2/g;s/(<\/font>)([^$>])/\1\n\2/g' > XMLTEMP1.txt