Remove strings within range using sed

ksk · July 24, 2010, 2:08pm

Hey folks

I have a big file that contains junk data between the tags <point> and </point> and I need to delete it (including `<point>' and `</point>').

i.e.

a = 1
<point>
 123123
 2342352
 234231
 234256
</point>
print a

needs to become

a = 1
print a

I'm certain that this is a one-liner in sed, but I'm not familiar enough with the language to know how to do it.

guruprasadpr · July 24, 2010, 2:21pm

Hi

sed '/point/,/\/point/d' file

Guru.

ygemici · July 24, 2010, 4:21pm

# sed -n '1p;$p' infile
a = 1
print a

walid2mi · July 24, 2010, 4:51pm

sed -n '/point/,/\/point/!p' file

ygemici · July 24, 2010, 6:13pm

# sed -n '/[a-z][^<point>]/p' infile
a = 1
print a

---------- Post updated at 01:13 AM ---------- Previous update was at 01:05 AM ----------

# sed '/<point>/,${;/print/!d}' infile
a = 1
print a

kurumi · July 24, 2010, 10:17pm

#!/bin/bash

declare -i flag
flag=0
while read -r LINE
do
  case "$LINE" in
   *"</point>"*)
      LINE=${LINE##*</point>}
      flag=0
      ;;
   *"<point>"*)
      LINE=${LINE%%<point>*}
      echo "$LINE"
      flag=1
      ;;
  esac
  [[ $flag = 1 ]] && continue
  [[ $flag = 0 ]] && echo "$LINE"
done < "file"

# cat file
a = 1
some text here , don't delete <point>
 123123
 2342352
 234231
 234256
</point> some text here  too that cannot be deleted
print a

linux$ ./myscript.sh
a = 1
some text here , don't delete
 some text here  too that cannot be deleted
print a

ksk · July 25, 2010, 12:57am

Thanks everyone.

That works perfectly well for the filetype I mentioned. Now, the problem is that I have to take out the line immediately preceding the first pattern as well.

That is, if the file is

a = 1
b = 1
<point>
124134
123123
42352
</point>
print a
c = 1

I need

a = 1
print a
c = 1

Can you suggest the modification?

guruprasadpr · July 25, 2010, 1:30am

Hi

awk 'NR!=1 && !f{ if($0 ~ /point/)f=1;else if(x)print x;}{if ($0 ~ /\/point/){f=0;x="";}else x=$0}END{print x}' f=0 file

Guru.

ksk · July 25, 2010, 12:37pm

Thanks Guru. I ran the awk command you suggested. It certainly removes what I wanted, but it also seems to be knocking out some other parts of the file, that are essential.

I tried presenting the problem at its simplest, so as not to burden you guys with unnecessary complications, but I see that it would be better if I explained the whole thing out.

The file that I am trying to edit is a KML file. It stores GPS coordinates and is interpreted by both Google Earth and Google Maps to draw maps. The file contains many folders, some called `Waypoints', some called `Tracks', some called `Points' etc. The `Points' folder contains extra waypoints that do not form a part of the track. These only clutter up the map, so I want to nuke the entire Points folder. Sadly, the declaration of folder names is slightly clumsy in KML. A file looks like this

<Folder>
  <name>Waypoints</name>
  # SOME ESSENTIAL STUFF
  # SOME ESSENTIAL STUFF
</Folder>
<Folder>
  <name>Tracks</name>
  # SOME ESSENTIAL STUFF
  # SOME ESSENTIAL STUFF
  <Folder>
    <name>Points</name>
    # JUNK
    # JUNK
    # JUNK
  </Folder>
  # SOME ESSENTIAL STUFF
</Folder>

What I need to remove is all instances of the Points folder (there are more than one), i.e. from <Folder> to </Folder> when the name tag value is `Points'. Another way to look at it would be to remove <Folder> to </Folder> when the 1st line after <Folder> reads <name>Points</name>.

I hope this helps

Thanks in advance!

---------- Post updated at 10:07 PM ---------- Previous update was at 11:45 AM ----------

No joy?

Christoph_Spohr · July 25, 2010, 5:18pm

Hi,

try:

awk '/Folder>/{s=$0;t=getline;if ($0 ~ /Points/){f=0} else {if (t) print s;f=1}}f' file

Output:

<Folder>
  <name>Waypoints</name>
  # SOME ESSENTIAL STUFF
  # SOME ESSENTIAL STUFF
</Folder>
<Folder>
  <name>Tracks</name>
  # SOME ESSENTIAL STUFF
  # SOME ESSENTIAL STUFF
  </Folder>
  # SOME ESSENTIAL STUFF
  </Folder>

HTH Chris

ksk · July 26, 2010, 9:55am

Hi Cristoph

That takes out what I need, but again, it seems to delete parts of the script that it isn't meant to. For some reason, it removes the first 100 odd lines of script as well. I would post the input and output files, but they're huge. Would it help if you showed you both?

Thanks