Sort html based on .jar, .war file names and still keep text within three groups.

Output from zipdiff GNU EAR comparison tool produces output in html divided into three sections "Added, Removed, Changed". I want the output to be sorted by jar or war file.

<html>
<body>
<table>
<tr>
<td class="diffs" colspan="2">Added </td>
</tr>
<tr><td>
<ul>
<li>jar1.jar/com/aaa/bbbb/cc/file1.class</li>
<li>jar1.jar/com/aaa/bbbb/cc/file3.class</li>
<li>jarname.jar/com/aaa/bbbb/cc/dd/filename.class</li>
<li>jarname.jar/com/aaa/bbbb/cc/ee/af/fileblala.class</li>
</ul>
<tr>
<td class="diffs" colspan="2">Removed </td>
</tr>
<tr><td>
<ul>
<li>jar3.war/com/aaa/bbbb/cc/file5.class</li>
<li>jarbla.jar/com/aaa/bbbb/cc/file6.class</li>
<li>jarblabla.war/com/aaa/bbbb/cc/ee/fa/afd/filenamefa.class</li>
<li>jar3.war/com/aaa/bbbb/cc/affa/faf/wrw/filenaa.class</li>
</ul>
<tr>
<td class="diffs" colspan="2">chagned </td>
</tr>
<ul>
<tr><td>
<li>jar4.jar/com/aaa/bbbb/cc/filefsaf.class</li>
<li>jarfsadf.war/com/aaa/bbbb/cc/filedfasf.class</li>
<li>jar4.jar/com/aaa/bbbb/cc/file11.class</li>
<li>jardfasdf.war/com/aaa/bbbb/cc/rr/ryy/filedfasf.class</li>
</ul>
</td>
</tr>
</table>

Expected output will have sorted by .jar and .war file names under sections
"Added, Removed, Updated". I think awk or sed can do this with inline replacing.

<html>
<body>
<table>
<tr>
<td class="diffs" colspan="2">Added </td>
</tr>
<tr><td>
<ul>
<li>jar1.jar/com/aaa/bbbb/cc/file1.class</li>
<li>jar1.jar/com/aaa/bbbb/cc/file3.class</li>
<li>jarname.jar/com/aaa/bbbb/cc/dd/filename.class</li>
<li>jarname.war/com/aaa/bbbb/cc/ee/af/fileblala.class</li>
</ul>
<tr>
<td class="diffs" colspan="2">Removed </td>
</tr>
<tr><td>
<ul>
<li>jar3.war/com/aaa/bbbb/cc/file5.class</li>
<li>jar3.war/com/aaa/bbbb/cc/affa/faf/wrw/filenaa.class</li>
<li>jarbla.jar/com/aaa/bbbb/cc/file6.class</li>
<li>jarblabla.war/com/aaa/bbbb/cc/ee/fa/afd/filenamefa.class</li>
</ul>
<tr>
<td class="diffs" colspan="2">chagned </td>
</tr>
<ul>
<tr><td>
<li>jar4.jar/com/aaa/bbbb/cc/filefsaf.class</li>
<li>jar4.jar/com/aaa/bbbb/cc/file11.class</li>
<li>jarfsadf.war/com/aaa/bbbb/cc/filedfasf.class</li>
<li>jardfasdf.war/com/aaa/bbbb/cc/rr/ryy/filedfasf.class</li>
</ul>
</td>
</tr>
</table>

Making a few wild assumptions:

  1. The pathnames of the files in all three of the lists have exactly six components where the 1st component is of the form jardigits.suffix where digits is a string of one or more decimal digits and suffix is either jar or war ; the second, third, fourth, and fifth components are the same in all of the pathnames; and the sixth component is of the form filedigits2.class where digits2 is another string of one or more decimal digits.
  2. The first component in the pathnames will not have the same digits string for both a .jar directory name and a .war directory name.
  3. Even though the <ul> HTML tag might be misplaced in some groups, an </ul> HTML tag will always appear on the first line after the lines containing the <li> HTML tags for each table.
  4. And, the <li> HTML tag always appears at the start of a line.

the following works with your provided sample input:

awk '
BEGIN {	cmd = "sort -t/ -k1.8,1n -k6.5,6n"
}
/<li>/ {print | cmd
	next
}
/<\/ul>/ {
	close(cmd)
}
1' file.html

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

Don,
I have corrected my input. I only need to sort on first column(i.e .jar/.war names)
sort -t/ -k1 is working fine.

This code is printing only lines start with "<li>", how can I keep html around it?
output is divided into three groups: Added/Removed/Changed. Could you try to keep these?

In the code I suggested:

awk '
BEGIN {	cmd = "sort -t/ -k1.8,1n -k6.5,6n"
}
/<li>/ {print | cmd
	next
}
/<\/ul>/ {
	close(cmd)
}
1' file.html

the 1 shown in red at the end of the awk code prints the lines that you say are not being printed. I can only assume that you did not copy that part of my suggestion into the code you used.

With your new requirements, the line in my suggestion:

BEGIN {	cmd = "sort -t/ -k1.8,1n -k6.5,6n"

can be simplified to just:

BEGIN {	cmd = "sort"

With this change to my suggestion and your new sample input, the output produced is:

<html>
<body>
<table>
<tr>
<td class="diffs" colspan="2">Added </td>
</tr>
<tr><td>
<ul>
<li>jar1.jar/com/aaa/bbbb/cc/file1.class</li>
<li>jar1.jar/com/aaa/bbbb/cc/file3.class</li>
<li>jarname.jar/com/aaa/bbbb/cc/dd/filename.class</li>
<li>jarname.jar/com/aaa/bbbb/cc/ee/af/fileblala.class</li>
</ul>
<tr>
<td class="diffs" colspan="2">Removed </td>
</tr>
<tr><td>
<ul>
<li>jar3.war/com/aaa/bbbb/cc/affa/faf/wrw/filenaa.class</li>
<li>jar3.war/com/aaa/bbbb/cc/file5.class</li>
<li>jarbla.jar/com/aaa/bbbb/cc/file6.class</li>
<li>jarblabla.war/com/aaa/bbbb/cc/ee/fa/afd/filenamefa.class</li>
</ul>
<tr>
<td class="diffs" colspan="2">chagned </td>
</tr>
<ul>
<tr><td>
<li>jar4.jar/com/aaa/bbbb/cc/file11.class</li>
<li>jar4.jar/com/aaa/bbbb/cc/filefsaf.class</li>
<li>jardfasdf.war/com/aaa/bbbb/cc/rr/ryy/filedfasf.class</li>
<li>jarfsadf.war/com/aaa/bbbb/cc/filedfasf.class</li>
</ul>
</td>
</tr>

which seems to meet your requirements.

1 Like

Thanks Don. My mistake I did not include 1 . Its working like magic. what does close(cmd) and 1 do? Do you mind explaining how this is working?

Note that awk statements take the form:

condition {action}

where condition is evaluated for each input line and, if it yields a value of TRUE or a non-zero numeric value or a non-empty string string value, the actions specified by action will be performed for that input line. If no condition is specified, the given action will be performed for every line. If condition is specified and no action is given, print the input line if condition evaluates to true.

awk '			# Use awk to interpret the following script...

BEGIN {	cmd = "sort"	# Before any lines are read from the input file, define
}			# cmd to be the command through which lines containing
			# "<li>" will be piped.

/<li>/ {print | cmd	# Send any line containing "<li>" to the sort command.
	next		# Restart processing with the next input line (skipping
			# later steps in this script for the current input
			# line).
}
/<\/ul>/ {		# When a line is found containing "</ul>", close the
	close(cmd)	# pipe to the sort command (forcing sort to print the
}			# lines that have been written to it in sorted order.

1			# Since no action is specified for this condition and
			# the default action is to print the current input line,
			# print the current input line.

' file.html		# End the awk script and specify the input file(s) to be
			# processed.

Does this answer your questions?

1 Like