<Country code="US"><tag>adsf</tag><tag>bdfs</tag></Country><Country code="CA"><tag>asdf</tag><tag>bsdf</tag></Country>
I want to grab the country code save it, then drop each new "<..." onto a new line with the country code added to the beginning of each
So, the above would become:
Country - US
US asdf
US bsdf
US csdf
Country - CA
CA asdf
CA bsdf
CA csdf
I do this:
sed 's/<Country code="\(..\)">/<Country - \1>/g;s/</\n&/g' inputfile.txt | grep Country
<Country - US>
</Country>
<Country - CA>
</Country>
But, if I try to add a reference to \1 to the second substitution, I get an error :
sed 's/<Country code="\(..\)">/<Country - \1>/g;s/</\n\1&/g' inputfile.txt | grep Country
sed: -e expression #1, char 54: invalid reference \1 on `s' command's RHS
What you need is not a (non-persistent) backreference but the (persistent) "hold space": this is a space where you can add, copy or pull text out of. I suggest you read the man page of sed to make yourself acquainted with the concept.
You best start by thinking through what you need to do for every type of line, i.e.:
a) lines with "country=" in them: load hold space with country code
b) all lines:
b1) split into tags (starting with "<")
b2) remove the tags themselves (as i took it from your sample output)
b3) add the content of the hold space to the begin of the pattern space
Note that you make your life a lot easier by using multi-line sed-programs. In principle you can the above also in a one-liner - like you can write a C program into one line too - but it would be very hard to read, even harder to understand and nigh impossible to debug.
The 2nd s resets the references.
It only works if its search pattern is empty i.e. taken from a previous match - but how to get a g option from a match?
The following works with one s but the number of <tag> s must be static, e.g. 2.
And another one that works like the previous suggestions.
sed 's/</\
/g' inputfile.txt | sed '
\#^Country code="\(.*\)".*#{s##\1#;h;}
\#^tag>#!d
G;s#^tag>\(.*\)\n\(.*\)#\2 \1#
'
Because the first sed splits into lines, the second sed can take any number of <tag> s. Also it will tolarate(discard) other stuff in between (for example a <b> tag).
A Countr= header is left as an exercise.
sed 's/<Country code="\(..\)">/\nCountry - \1/g; s/^\n//; s|</[^>]*>||g; s/<[^>]*>/\n/g' file | sed '/Country - / {p; s///; x; d}; G; s/\(.*\)\n\(.*\)/\2 \1/'
Country - US
US adsf
US bdfs
Country - CA
CA asdf
CA bsdf
Thanks. I'll look into that. the data comes in all on one line, which is what allowed me to think the back reference would work. I'll look at the hold space as an option.
Here is an example on how to use the hold space. It does not exactly solve your requirement but illustrates IMHO quite well how the mechanism works:
This is your input file:
=Chapter 1
line 1
line 2
line 3
=Chapter 2
line A
line B
line C
We want the "chapter headings" to appear after every line of that chapter like this:
line 1 (Chapter 1)
line 2 (Chapter 1)
line 3 (Chapter 1)
line A (Chapter 2)
line B (Chapter 2)
line C (Chapter 2)
The following sed-script will do this. I have added comments, which are NOT part of the script, so you should remove them when you try it:
sed '/^=/ { # only lines starting with "=" , do:
s/^=// # remove the leading equal sign
s/^/ (/ # add a leading " (" to the remainder
s/$/)/ # add a trailing ")"
h # move the result to the hold space
d # delete the line from pattern space
}
G # all lines: add content of hold space to pattern space
s/\n// ' inputfile # remove EOL markers from inside lines
Notice that the last line is necessary: the hold space is delimited by its own EOL marker which is copied to the pattern space too. A line in pattern space would look like:
<line content><EOL> # initially, after reading in the line
<line content><EOL><hold space><EOL> # after copying hold space content
and you need to get rid of the EOL in the middle so that the line when printed is not broken into two.