sed awk question

I have a string that I need to remove data that is not within <>. For example:

this is a <test> of removing <text> outside brackets

output should be:

<test> <text>

or:

test text

I can use either of the two outputs but so far I have not had much luck removing all of the other text. The closest I have gotten is with he following awk command:

awk 'BEGIN{RS="<";FS="> "}{print $1}' filename

this outputs the following:

this is a
test
text

I can not get rid of the beggining of the line and the output is showing up on multiple lines which will not work as I am trying to assign the output to a single string variable.

I also tried the following awk:

awk -F'<|>' '{print $2}' filename

this outputs the following:

test

better but I am missing the second field.

When using sed I can remove all the text inside the <>

sed 's/<[^>]*>//g'

but not the reverse.

I think I am close but can not get it. Any help would be appreciated.

you only want the result in one line?
for this porpuse try this:

$more file
<test> <text>
$awk 'BEGIN{RS="<";FS=">"}{printf $1" " } END {printf "\n"} ' file
test text

There is probably an easier way then doing 2 sed commands, but this works:

echo "<test> <text>" | sed 's/<//g' | sed 's/>//g'

Output:

test text

chipcmc,

your update fixes my problem of having the output on a signle line but I still have the beggining of the line to deal with:

awk 'BEGIN{RS="<";FS=">"}{printf $1" " } END {printf "\n"} ' filename

output

this is a  test text

Padow,

You sed commands will only remove the <> from the string:

 cat filename | sed 's/<//g' | sed 's/>//g'

output

this is a test of removing text outside brackets

To keep the forums high quality for all users, please take the time to format your posts correctly.

First of all, use Code Tags when you post any code or data samples so others can easily read your code. You can easily do this by highlighting your code and then clicking on the # in the editing menu. (You can also type code tags

```text
 and 
```

by hand.)

Second, avoid adding color or different fonts and font size to your posts. Selective use of color to highlight a single word or phrase can be useful at times, but using color, in general, makes the forums harder to read, especially bright colors like red.

Third, be careful when you cut-and-paste, edit any odd characters and make sure all links are working property.

Thank You.

The UNIX and Linux Forums

vgersh99,

I updated my inital post to follow the correct format.

Thanks

What makes this difficult is having multiple occurrences of the bracketed fields per line. This rules out cut and makes awk very difficult to use. I'm sure some shell super-guru can accomplish this in a shell script, but it may be easier just to step this one up to perl or python.

Padow,

The example I provided has two occurances of <> within the same line. The final script will need to work with much more complex lines some containing 5 or 6 <>. I'm not oppossed to using perl, i also don't mind doing this in multiple commands. For example using one command to remove the beggining of the line up to the first < then using the awk command I have to delete the rest of the data in the line. This is the approach I am trying now. If someone has a perl command(s) to do the same I would be willing to try it.

Thanks

nawk -F'<|>' '{ for(i=2;i<=NF;i+=2) printf("%s%c", $i, (i==NF-1)?RS:OFS)}' myFile
awk 'BEGIN{RS="<";FS="> "}{print $1}' filename
  • sets record separator to "<" and field separator to ">": So, from
this is a <test> of removing <text> outside brackets

you have a set of records:

this is a 
test> of removing 
text> outside brackets
  • and you are printing the first field before the ">" - it is obviose that you have what you have.
    To remove first 'record' in your brake-down by "<" - use NR, when it is not 1: NR>1 or if (NR>1):
src> echo "some text <frst> more text <scnd> ending"| nawk 'BEGIN{FS=">"; RS="<";} NR>1 {print $1}'
frst
scnd

It is new for me the construction '<|>', but I gess it means : field separator is '<' OR '>'
And you are printing only second field - so, what should you expect?!
Having that -F'<|>' you should only print even fields:

echo "some text <frst< else >scnd> end"| nawk -F'<|>' '{for (i=1;i<=NF;i++) if (i%2==0) print $i}'
frst
scnd
  • but now the AWK does not care which separator is used: see I have used them incorrectly, and it is not a problem for this code.

Thanks, both vgersh99 and alex_5161 solutions works.

For sed:

echo "some text <frst.> else <scnd> end"| sed 's/[^<]*[<]*\([^>]*\)>*/\1/g'
frst.scnd
  • the text from \( to \) is saved by sed as a \1 (one more usage of the \(\) pair will saved \2, and so on..); so, that text is used for substitution.
    That is what you had removed in your post.
    Now need only remove everything else
    So, part before open angle: [^<]* - everything, but not '<'
    Next, the '<' is in range to have the '*' after that - so, having 0 or more times: than is needed to remove ending part that does not have the '<'.

I hope it is understandable.

A snippet that can be added to something bigger:

{
i = split($0, a)
for(n=1; n<=i; n++)
{
if (match(a[n], /<.*>/) > 0)
print a[n]
}
}

use below:-

nawk -v FS="<|>" '{print $1,$4}'

BR

this works (on your input)... although probably not the best

echo "this is a <test> of removing <text> outside brackets" | tr '<' '\012' | grep ">" | cut -d">" -f1 | tr '\012' ' '

how about perl:

while(<DATA>){
	chomp;
	my @tmp=$_=~/(<[^>]*>)/g;
	print join " ",@tmp;
	print "\n";
}
__DATA__
this is a <test> of removing <text> outside brackets
this <is> a <test> of removing <text> outside brackets