Deleting table cells in a script

I'd like to use sed or awk to do this but I'm weak on both along with RE. Looking for a way with sed or awk to count for the 7th table data within a table row and if the condition is met to delete "<td>and everything in between </td>". Since the table header start on a specific line each time, that I can delete using sed easy.

Stumped on how to get rid of the other data in that column. Also, the table the script retrieves may vary in length and this is the reason why I'd like it scripted as I've described. If you have any better ideas, I'm open to them.

For those that like gui's here's a simple diagram:

qrstuvwxyz
qrstuvwxyz
qrstuvwxyz
qrstuvwyz
qrstuvwyz
qrstuvwyz

Do you mean *every* 7th occurrance of <td> within *every* <tr> in a file??

Almost there...

sed -n '/<tr>/,/<\/tr> {
           s/.*<tr>//
           s/<\/tr>.*//
           p
           }' /path/to/my/file

That's correct. If that condition is met I need it to do the above. Thanks.

This is the simplest answer, if the table data is the same and ONLY in that 7th table data and nowhere else you could use this:

 
cat myfile |sed 's|<td>x</td>||g'

NOTE
I am using "|" as the delimeter and not "/" so I do not have to escape the "/" in "</td>"...

However if you have other table data fields with the same text and you wish to keep these other cels you may wish to use a loop to parse the tokens and count the "<td>" occurances and then do a test to figure out if this cel matches what you are looking for before handing to sed.

SIMPLE EXAMPLE OF LOOP:

OUTFILE=/some/file
TD=0
CT=0
cat myfile |while read LINE
do
    # Check to see if the LINE is non-empty, and has a <td> tag in it.
    if [ -n "$LINE" -a `echo $LINE |grep "<td>"` != "" ]
    then
        # Increase the TD counter by 1
        CT=`echo "$CT+1" |bc`
        
        # Check to see if the TD counter is at 6 (we are at 7th TD as the counter starts at 0 not 1)
        if [ "$CT" -eq 6 ]
        then
            # Use sed to remove this TD tag
            echo $LINE |sed 's|<td>x</td>||' >> $OUTFILE
        else 
            echo $LINE >> $OUTFILE
        fi
    else
        echo $LINE >> $OUTFILE
    fi
    
    # If we are leaving a table row the we need to reset the TD counter!
    if [ -n "$LINE" -a `echo $LINE |grep "</tr>"` != "" ]
        CT=0
    fi
done

Note that this does NOT account for multiple "<td>" tags on one line.

Ok I think I follow and I'm going to combine with my sed script to test. Please review it below as I've made some revisions. Remember, my goal is to remove everything between the table data tags and the content within will vary, it'll never be the same.

#!/bin/sh

#TD=0
CT=0
cat oldfile.html |while read LINE
do
    # Check to see if the LINE is non-empty, and has a <td> tag in it.
    if [ -n "$LINE" -a `echo $LINE |grep "<td>"` != "" ]
    then
        # Increase the TD counter by 1
        CT=`echo "$CT+1" |bc`
        
        # Check to see if the TD counter is at 6 (we are at 7th TD as the counter starts at 0 not 1)
        if [ "$CT" -eq 6 ]
        then
            # Use sed to remove this TD tag AND everything in between
            echo $LINE |sed -n '/<tr>/,/<\/tr> {
					    s/.*<tr>//
					    s/<\/tr>.*//
					    p
					    }' >> newfile.html
        else 
            echo $LINE >> newfile.html
        fi
    else
        echo $LINE >> newfile.html
    fi
    
    # If we are leaving a table row the we need to reset the TD counter!
    if [ -n "$LINE" -a `echo $LINE |grep "</tr>"` != "" ]
        CT=0
    fi
done

But I have two questions regarding your script. What is the var TD for? Notice I've commented it out for now. Lastly, since you set CT to 0, on the -very first count it will be 0 not 1 correct?

OK I created a file named testHTM.txt with the following

$ cat testHTM.txt
<table>
<tr>
<td>x</td>
<td>y</td>
<td>z</td>
</tr>
</table>


then I ran this against that file

$ OUT=`cat testHTM.txt` && echo $OUT |sed 's/<td>[x-z]<\/td>//1'
<table> <tr>  <td>y</td> <td>z</td> </tr> </table>

$ OUT=`cat testHTM.txt` && echo $OUT |sed 's/<td>[x-z]<\/td>//2'
<table> <tr> <td>x</td>  <td>z</td> </tr> </table>

$ OUT=`cat testHTM.txt` && echo $OUT |sed 's/<td>[x-z]<\/td>//3'
<table> <tr> <td>x</td> <td>y</td>  </tr> </table>


Notice the setting of the occurance you wish to change at the end of the line... so setting to handle occurrence 1 removes the "<td>x</td>" and setting to handle occurrence 2 removes "<td>y</td>" and so on.

Placing this code inside a loop checking weather you are inside "<tr>" and "</tr>" tags and setting to handle occurrence 7 would do it.

oops yeah TD was an extra that I never used so safe to comment/remove

and the CT question:

    if [ -n "$LINE" -a `echo $LINE |grep "<td>"` != "" ]
    then
        # Increase the TD counter by 1
        CT=`echo "$CT+1" |bc`
        
        # Check to see if the TD counter is at 6 (we are at 7th TD as the counter starts at 0 not 1)
        if [ "$CT" -eq 6 ]

You should change "-eq 6" to "-eq 7" as we increment before we test... again oops

Oddly, on my Linux box, I cannot make your code work...

echo $LINE |sed -n '/<tr>/,/<\/tr> {
					    s/.*<tr>//
					    s/<\/tr>.*//
					    p
					    }' >> newfile.html

unless I add a trailing "/" before the "{"...

echo $LINE |sed -n '/<tr>/,/<\/tr>/ {
					    s/.*<tr>//
					    s/<\/tr>.*//
					    p
					    }' >> newfile.html

Just an FYI :slight_smile:

Also

echo $LINE

yields very different results than it's "quoted" counterpart...

echo "$LINE"

k, I have a few issues/questions at this point. When I run my script it errs:

[: !=: unexpected operator
[: !=: unexpected operator 

Seems that part of ddreggors code is breaking during the test command. I'm going to man test to see if I can dig up something there, here, or elsewhere.

Secondly, and please bare with me on this as I'm still learning, but what can I do to tell the script to 'do nothing and keep going' vs. echo "blah" in my loop. I feel like I'm just filling in the blanks here because I'm stumped since I'm sure if I leave it out, it'll break. Would the solution be to just echo to devnull? :confused:

Third, ddreggors, I'm looking around right now but if I'm going to use your sed example, I'll need an expression a little more complex than yours since the range of characters goes beyond just [x-z] I think what I need is [a-zA-Z0-9]. Also needs to include "(|)|:|.|,|/" (brackets, semicolons, periods, commas, slashes if I noted that right). I'll try with my own sed example first then explore later if need be.

#!/bin/sh

#TD=0
CT=0
cat status.html |while read LINE
do
    # Check to see if the LINE is non-empty, and has a <td> tag in it.
    if [ -n "$LINE" -a `echo $LINE |grep "<td>"` != "" ] ; then
        # Increase the TD counter by 1
        CT=`echo "$CT+1" |bc`
        
        # Check to see if the TD counter is at 6 (we are at 7th TD as the counter starts at 0 not 1)
        if [ "$CT" -eq 6 ] ; then
            # Use sed to remove this TD tag AND everything in between
            echo $LINE |sed -n '/<tr>/,/<\/tr> {
                                            s/.*<tr>//
                                            s/<\/tr>.*//
                                            p
                                            }' >> ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3
        else 
            echo $LINE >> ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3
        fi
    else
        echo $LINE >> ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3
    fi
    
    # If we are leaving a table row the we need to reset the TD counter!
    if [ -n "$LINE" -a `echo $LINE |grep "</tr>"` != "" ] ; then
                CT=0
    else
    	echo "No reset"
	fi
	
    if [ -n "$LINE" -a `echo $LINE |grep "</html>"` != "" ] ; then 
                mv ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3 status.html
		else	
			echo "Not done yet, keep going" 
		fi

done

OK change it some to do this:

    # If we are leaving a table row the we need to reset the TD counter!
    TEST=`echo $LINE |grep '</tr>'`
    if [ -n "$TEST" ] ; then
                CT=0
    else
    	echo "No reset"
	fi
	
    TEST=`echo $LINE |grep '</html>'`
    if [ -n "$TEST" ] ; then 
                mv ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3 status.html
		else	
			echo "Not done yet, keep going" 
		fi

That should fix that, and as for the sed expression, I was sure you WOULD have to change that as I am not sure we have ever seen the exact pattern you are looking for. If you did post that pattern I missed it, sorry.

Script doesn't err but the sed isn't clearing the cells. I found when I ran it manually on the file..

# cp status.html teststatus.html
# OUT=`cat teststatus.html` && echo $OUT |sed 's/<td>[a-zA-Z0-9|(|)]<\/td>//'
OUT=<?xml: Command not found.
# grep -n "<?xml" teststatus.html
1:<?xml version="1.0" encoding="iso-8859-1"?>
# 

Within the script that brings the page into my box I added:

sed -e '1d' <status.html > ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3 ;
mv ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3 sentinalstatus.html

I go to test again and hence it complains about something else..

# OUT=`cat teststatus.html` && echo $OUT |sed 's/<td>[a-zA-Z0-9|(|)]<\/td>//'
OUT=<!DOCTYPE: Command not found.
# 

I'm probably missing quotes somewhere I figure. Tried adding them to the var but it doesn't work. Below is an update of what I have so far.

#!/bin/sh

#TD=0
CT=0
cat status.html |while read LINE
do
    # Check to see if the LINE is non-empty, and has a <td> tag in it.
	TD=`echo $LINE |grep '</td>'`
	if [ -n "$TD" ] ; then
        # Increase the TD counter by 1
        CT=`echo "$CT+1" |bc`
        
        # Check to see if the TD counter is at 6 (we are at 7th TD as the counter starts at 0 not 1)
        if [ "$CT" -eq 7 ] ; then
            # Use sed to remove this TD tag AND everything in between
            echo $LINE |sed 's/<td>[a-zA-Z0-9|(|)]<\/td>//' >> ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3
        else 
            echo $LINE >> ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3
        fi
    else
        echo $LINE >> ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3
    fi
    
    # If we are leaving a table row the we need to reset the TD counter!
    TR=`echo $LINE |grep '</tr>'`
    if [ -n "$TR" ] ; then
                CT=0
    else
    	echo "" > /dev/null
	fi
	
    HTML=`echo $LINE |grep '</html>'`
    if [ -n "$HTML" ] ; then 
        mv ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3 status.html
	else	
		echo "" > /dev/null
	fi

done

on command line give this a try:

# export OUT=`cat teststatus.html`
# echo "$OUT" |sed 's/<td>[a-zA-Z0-9|(|)]<\/td>//'

Notice the quotes around the variable in the echo line (echo "$OUT")

Tried but it didn't work. Assuming the below were in a file by itself, if I can get sed to empty it out then I -should be ok.

<TD ALIGN=CENTER>
<A HREF=addcomment.pl?type=li&serv_ip=1.30.33.2 onclick="NewWindow(this.href,'name','500','300','yes');return false;"><I>(Curtis Blow)</I>: CASE IN QUEUE - RAID REBOOT<BR>
<A HREF=/server/singleserveruptime.pl?server_ip=1.30.33.2&time_period=1&days=&start=&end=&submit=Submit><font size=1><i>Click To See Uptime/Assign History</i></font></A></A>
</TD>

As you can see I'm dealing with characters like ? < > , . = & ' ; ( ) / _ etc.

This should do it...

#!/bin/sh

IN=0
CT=0
OUTFILE="TestHTML.out"
echo > $OUTFILE # Start with fresh file always

cat TestHTML.htm |while read LINE
do
    # If we are entering a table row the we need to reset the TD counter
    TR=`echo $LINE |grep -i '<tr'`
    if [ -n "$TR" ]
    then
        CT=0
    else
        echo "" > /dev/null
    fi

    # Check to see if the LINE is non-empty, and has an opening td tag in it.
    TD=`echo $LINE |tr -d '\n' |grep -i '<td'`
    if [ -n "$TD" ]
    then
        # We are inside a td tag.
        IN=1
    fi

    # Check to see if the LINE is non-empty and has a closing td tag in it.
    ENDTD=`echo $LINE |tr -d '\n' |grep -i '/td>'`
    if [ -n "$ENDTD" ]
    then
        # We are leaving a td tag.
        IN=0
        # Increase the TD counter by 1
        CT=`echo "$CT+1" |bc`
    fi

    if [ "$IN" -eq 1 -a "$CT" -eq 6 -a -z "$ENDTD" ]
    then
        # Use sed to remove this TD tag AND everything in between
        echo $LINE |tr -d '\n' |sed 's/.*//' >> $OUTFILE
    elif [ "$IN" -eq 0 -a "$CT" -eq 7 ]
    then
        # We may (or may not) have an opening and closing td tag in 1 line.
        TMP=`echo $LINE |tr -d '\n' |sed 's/<TD.*//'`
        echo $TMP |sed 's/.*\/TD>//' >> $OUTFILE
    else
        echo $LINE >> $OUTFILE
    fi
done

This was tested against this file (TestHTML.htm):

<HTML>
<BODY>

<TABLE>
<TR>
<TD>Table Data1</TD>
<TD>Table Data2</TD>
<TD>Table Data3</TD>
<TD>Table Data4</TD>
<TD>Table Data5</TD>
<TD>Table Data6</TD>
<TD ALIGN=CENTER>
<A HREF=addcomment.pl?type=li&serv_ip=1.30.33.2 onclick="NewWindow(this.href,'name','500','300','yes');return false;"><I>(Curtis Blow)</I>: CASE IN QUEUE - RAID REBOOT<BR>
<A HREF=/server/singleserveruptime.pl?server_ip=1.30.33.2&time_period=1&days=&start=&end=&submit=Submit><font size=1><i>Click To See Uptime/Assign History</i></font></A></A>
</TD>
</TR>
</TABLE>

<!-- COMMENT -->

<TABLE>
<TR>
<TD>Table Data1</TD>
<TD>Table Data2</TD>
<TD>Table Data3</TD>
<TD>Table Data4</TD>
<TD>Table Data5</TD>
<TD>Table Data6</TD>
<TD ALIGN=CENTER>
<A HREF=addcomment.pl?type=li&serv_ip=1.30.33.2 onclick="NewWindow(this.href,'name','500','300','yes');return false;"><I>(Curtis Blow)</I>: CASE IN QUEUE - RAID REBOOT<BR>
<A HREF=/server/singleserveruptime.pl?server_ip=1.30.33.2&time_period=1&days=&start=&end=&submit=Submit><font size=1><i>Click To See Uptime/Assign History</i></font></A></A>
</TD>
</TR>
</TABLE>

<!-- COMMENT -->

</BODY>
</HTML>

and the resulting file (TestHTML.out):

<HTML>
<BODY>

<TABLE>
<TR>
<TD>Table Data1</TD>
<TD>Table Data2</TD>
<TD>Table Data3</TD>
<TD>Table Data4</TD>
<TD>Table Data5</TD>
<TD>Table Data6</TD>

</TR>
</TABLE>

<!-- COMMENT -->

<TABLE>
<TR>
<TD>Table Data1</TD>
<TD>Table Data2</TD>
<TD>Table Data3</TD>
<TD>Table Data4</TD>
<TD>Table Data5</TD>
<TD>Table Data6</TD>

</TR>
</TABLE>

<!-- COMMENT -->

</BODY>
</HTML>

dd, sorry I didn't get back sooner but I just wanted to let you know that the script worked like a charm. Thanks again.

Great! Glad I could help. :slight_smile: