I'd like to use sed or awk to do this but I'm weak on both along with RE. Looking for a way with sed or awk to count for the 7th table data within a table row and if the condition is met to delete "<td>and everything in between </td>". Since the table header start on a specific line each time, that I can delete using sed easy.
Stumped on how to get rid of the other data in that column. Also, the table the script retrieves may vary in length and this is the reason why I'd like it scripted as I've described. If you have any better ideas, I'm open to them.
For those that like gui's here's a simple diagram:
This is the simplest answer, if the table data is the same and ONLY in that 7th table data and nowhere else you could use this:
cat myfile |sed 's|<td>x</td>||g'
NOTE
I am using "|" as the delimeter and not "/" so I do not have to escape the "/" in "</td>"...
However if you have other table data fields with the same text and you wish to keep these other cels you may wish to use a loop to parse the tokens and count the "<td>" occurances and then do a test to figure out if this cel matches what you are looking for before handing to sed.
SIMPLE EXAMPLE OF LOOP:
OUTFILE=/some/file
TD=0
CT=0
cat myfile |while read LINE
do
# Check to see if the LINE is non-empty, and has a <td> tag in it.
if [ -n "$LINE" -a `echo $LINE |grep "<td>"` != "" ]
then
# Increase the TD counter by 1
CT=`echo "$CT+1" |bc`
# Check to see if the TD counter is at 6 (we are at 7th TD as the counter starts at 0 not 1)
if [ "$CT" -eq 6 ]
then
# Use sed to remove this TD tag
echo $LINE |sed 's|<td>x</td>||' >> $OUTFILE
else
echo $LINE >> $OUTFILE
fi
else
echo $LINE >> $OUTFILE
fi
# If we are leaving a table row the we need to reset the TD counter!
if [ -n "$LINE" -a `echo $LINE |grep "</tr>"` != "" ]
CT=0
fi
done
Note that this does NOT account for multiple "<td>" tags on one line.
Ok I think I follow and I'm going to combine with my sed script to test. Please review it below as I've made some revisions. Remember, my goal is to remove everything between the table data tags and the content within will vary, it'll never be the same.
#!/bin/sh
#TD=0
CT=0
cat oldfile.html |while read LINE
do
# Check to see if the LINE is non-empty, and has a <td> tag in it.
if [ -n "$LINE" -a `echo $LINE |grep "<td>"` != "" ]
then
# Increase the TD counter by 1
CT=`echo "$CT+1" |bc`
# Check to see if the TD counter is at 6 (we are at 7th TD as the counter starts at 0 not 1)
if [ "$CT" -eq 6 ]
then
# Use sed to remove this TD tag AND everything in between
echo $LINE |sed -n '/<tr>/,/<\/tr> {
s/.*<tr>//
s/<\/tr>.*//
p
}' >> newfile.html
else
echo $LINE >> newfile.html
fi
else
echo $LINE >> newfile.html
fi
# If we are leaving a table row the we need to reset the TD counter!
if [ -n "$LINE" -a `echo $LINE |grep "</tr>"` != "" ]
CT=0
fi
done
But I have two questions regarding your script. What is the var TD for? Notice I've commented it out for now. Lastly, since you set CT to 0, on the -very first count it will be 0 not 1 correct?
Notice the setting of the occurance you wish to change at the end of the line... so setting to handle occurrence 1 removes the "<td>x</td>" and setting to handle occurrence 2 removes "<td>y</td>" and so on.
Placing this code inside a loop checking weather you are inside "<tr>" and "</tr>" tags and setting to handle occurrence 7 would do it.
oops yeah TD was an extra that I never used so safe to comment/remove
and the CT question:
if [ -n "$LINE" -a `echo $LINE |grep "<td>"` != "" ]
then
# Increase the TD counter by 1
CT=`echo "$CT+1" |bc`
# Check to see if the TD counter is at 6 (we are at 7th TD as the counter starts at 0 not 1)
if [ "$CT" -eq 6 ]
You should change "-eq 6" to "-eq 7" as we increment before we test... again oops
Seems that part of ddreggors code is breaking during the test command. I'm going to man test to see if I can dig up something there, here, or elsewhere.
Secondly, and please bare with me on this as I'm still learning, but what can I do to tell the script to 'do nothing and keep going' vs. echo "blah" in my loop. I feel like I'm just filling in the blanks here because I'm stumped since I'm sure if I leave it out, it'll break. Would the solution be to just echo to devnull?
Third, ddreggors, I'm looking around right now but if I'm going to use your sed example, I'll need an expression a little more complex than yours since the range of characters goes beyond just [x-z] I think what I need is [a-zA-Z0-9]. Also needs to include "(|)|:|.|,|/" (brackets, semicolons, periods, commas, slashes if I noted that right). I'll try with my own sed example first then explore later if need be.
#!/bin/sh
#TD=0
CT=0
cat status.html |while read LINE
do
# Check to see if the LINE is non-empty, and has a <td> tag in it.
if [ -n "$LINE" -a `echo $LINE |grep "<td>"` != "" ] ; then
# Increase the TD counter by 1
CT=`echo "$CT+1" |bc`
# Check to see if the TD counter is at 6 (we are at 7th TD as the counter starts at 0 not 1)
if [ "$CT" -eq 6 ] ; then
# Use sed to remove this TD tag AND everything in between
echo $LINE |sed -n '/<tr>/,/<\/tr> {
s/.*<tr>//
s/<\/tr>.*//
p
}' >> ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3
else
echo $LINE >> ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3
fi
else
echo $LINE >> ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3
fi
# If we are leaving a table row the we need to reset the TD counter!
if [ -n "$LINE" -a `echo $LINE |grep "</tr>"` != "" ] ; then
CT=0
else
echo "No reset"
fi
if [ -n "$LINE" -a `echo $LINE |grep "</html>"` != "" ] ; then
mv ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3 status.html
else
echo "Not done yet, keep going"
fi
done
# If we are leaving a table row the we need to reset the TD counter!
TEST=`echo $LINE |grep '</tr>'`
if [ -n "$TEST" ] ; then
CT=0
else
echo "No reset"
fi
TEST=`echo $LINE |grep '</html>'`
if [ -n "$TEST" ] ; then
mv ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3 status.html
else
echo "Not done yet, keep going"
fi
That should fix that, and as for the sed expression, I was sure you WOULD have to change that as I am not sure we have ever seen the exact pattern you are looking for. If you did post that pattern I missed it, sorry.
I'm probably missing quotes somewhere I figure. Tried adding them to the var but it doesn't work. Below is an update of what I have so far.
#!/bin/sh
#TD=0
CT=0
cat status.html |while read LINE
do
# Check to see if the LINE is non-empty, and has a <td> tag in it.
TD=`echo $LINE |grep '</td>'`
if [ -n "$TD" ] ; then
# Increase the TD counter by 1
CT=`echo "$CT+1" |bc`
# Check to see if the TD counter is at 6 (we are at 7th TD as the counter starts at 0 not 1)
if [ "$CT" -eq 7 ] ; then
# Use sed to remove this TD tag AND everything in between
echo $LINE |sed 's/<td>[a-zA-Z0-9|(|)]<\/td>//' >> ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3
else
echo $LINE >> ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3
fi
else
echo $LINE >> ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3
fi
# If we are leaving a table row the we need to reset the TD counter!
TR=`echo $LINE |grep '</tr>'`
if [ -n "$TR" ] ; then
CT=0
else
echo "" > /dev/null
fi
HTML=`echo $LINE |grep '</html>'`
if [ -n "$HTML" ] ; then
mv ztmp.Ps23zp2s.2-Fpps3-wmmm0dss3 status.html
else
echo "" > /dev/null
fi
done
#!/bin/sh
IN=0
CT=0
OUTFILE="TestHTML.out"
echo > $OUTFILE # Start with fresh file always
cat TestHTML.htm |while read LINE
do
# If we are entering a table row the we need to reset the TD counter
TR=`echo $LINE |grep -i '<tr'`
if [ -n "$TR" ]
then
CT=0
else
echo "" > /dev/null
fi
# Check to see if the LINE is non-empty, and has an opening td tag in it.
TD=`echo $LINE |tr -d '\n' |grep -i '<td'`
if [ -n "$TD" ]
then
# We are inside a td tag.
IN=1
fi
# Check to see if the LINE is non-empty and has a closing td tag in it.
ENDTD=`echo $LINE |tr -d '\n' |grep -i '/td>'`
if [ -n "$ENDTD" ]
then
# We are leaving a td tag.
IN=0
# Increase the TD counter by 1
CT=`echo "$CT+1" |bc`
fi
if [ "$IN" -eq 1 -a "$CT" -eq 6 -a -z "$ENDTD" ]
then
# Use sed to remove this TD tag AND everything in between
echo $LINE |tr -d '\n' |sed 's/.*//' >> $OUTFILE
elif [ "$IN" -eq 0 -a "$CT" -eq 7 ]
then
# We may (or may not) have an opening and closing td tag in 1 line.
TMP=`echo $LINE |tr -d '\n' |sed 's/<TD.*//'`
echo $TMP |sed 's/.*\/TD>//' >> $OUTFILE
else
echo $LINE >> $OUTFILE
fi
done