I have a string that I need to remove data that is not within <>. For example:
this is a <test> of removing <text> outside brackets
output should be:
<test> <text>
or:
test text
I can use either of the two outputs but so far I have not had much luck removing all of the other text. The closest I have gotten is with he following awk command:
awk 'BEGIN{RS="<";FS="> "}{print $1}' filename
this outputs the following:
this is a
test
text
I can not get rid of the beggining of the line and the output is showing up on multiple lines which will not work as I am trying to assign the output to a single string variable.
I also tried the following awk:
awk -F'<|>' '{print $2}' filename
this outputs the following:
test
better but I am missing the second field.
When using sed I can remove all the text inside the <>
sed 's/<[^>]*>//g'
but not the reverse.
I think I am close but can not get it. Any help would be appreciated.
To keep the forums high quality for all users, please take the time to format your posts correctly.
First of all, use Code Tags when you post any code or data samples so others can easily read your code. You can easily do this by highlighting your code and then clicking on the # in the editing menu. (You can also type code tags
```text
and
```
by hand.)
Second, avoid adding color or different fonts and font size to your posts. Selective use of color to highlight a single word or phrase can be useful at times, but using color, in general, makes the forums harder to read, especially bright colors like red.
Third, be careful when you cut-and-paste, edit any odd characters and make sure all links are working property.
What makes this difficult is having multiple occurrences of the bracketed fields per line. This rules out cut and makes awk very difficult to use. I'm sure some shell super-guru can accomplish this in a shell script, but it may be easier just to step this one up to perl or python.
The example I provided has two occurances of <> within the same line. The final script will need to work with much more complex lines some containing 5 or 6 <>. I'm not oppossed to using perl, i also don't mind doing this in multiple commands. For example using one command to remove the beggining of the line up to the first < then using the awk command I have to delete the rest of the data in the line. This is the approach I am trying now. If someone has a perl command(s) to do the same I would be willing to try it.
sets record separator to "<" and field separator to ">": So, from
this is a <test> of removing <text> outside brackets
you have a set of records:
this is a
test> of removing
text> outside brackets
and you are printing the first field before the ">" - it is obviose that you have what you have.
To remove first 'record' in your brake-down by "<" - use NR, when it is not 1: NR>1 or if (NR>1):
src> echo "some text <frst> more text <scnd> ending"| nawk 'BEGIN{FS=">"; RS="<";} NR>1 {print $1}'
frst
scnd
It is new for me the construction '<|>', but I gess it means : field separator is '<' OR '>'
And you are printing only second field - so, what should you expect?!
Having that -F'<|>' you should only print even fields:
echo "some text <frst< else >scnd> end"| nawk -F'<|>' '{for (i=1;i<=NF;i++) if (i%2==0) print $i}'
frst
scnd
but now the AWK does not care which separator is used: see I have used them incorrectly, and it is not a problem for this code.
echo "some text <frst.> else <scnd> end"| sed 's/[^<]*[<]*\([^>]*\)>*/\1/g'
frst.scnd
the text from \( to \) is saved by sed as a \1 (one more usage of the \(\) pair will saved \2, and so on..); so, that text is used for substitution.
That is what you had removed in your post.
Now need only remove everything else
So, part before open angle: [^<]* - everything, but not '<'
Next, the '<' is in range to have the '*' after that - so, having 0 or more times: than is needed to remove ending part that does not have the '<'.
while(<DATA>){
chomp;
my @tmp=$_=~/(<[^>]*>)/g;
print join " ",@tmp;
print "\n";
}
__DATA__
this is a <test> of removing <text> outside brackets
this <is> a <test> of removing <text> outside brackets