sed - use back reference in 2nd command

JenniferAmon · August 7, 2017, 3:48pm

I have data that looks like this:

<Country code="US"><tag>adsf</tag><tag>bdfs</tag></Country><Country code="CA"><tag>asdf</tag><tag>bsdf</tag></Country>

I want to grab the country code save it, then drop each new "<..." onto a new line with the country code added to the beginning of each
So, the above would become:
Country - US
US asdf
US bsdf
US csdf
Country - CA
CA asdf
CA bsdf
CA csdf

I do this:
 sed 's/<Country code="\(..\)">/<Country - \1>/g;s/</\n&/g' inputfile.txt | grep Country
<Country - US>
</Country>
<Country - CA>
</Country>

But, if I try to add a reference to \1 to the second substitution, I get an error :

sed 's/<Country code="\(..\)">/<Country - \1>/g;s/</\n\1&/g' inputfile.txt | grep Country
sed: -e expression #1, char 54: invalid reference \1 on `s' command's RHS

ideas?

bakunin · August 7, 2017, 4:01pm

What you need is not a (non-persistent) backreference but the (persistent) "hold space": this is a space where you can add, copy or pull text out of. I suggest you read the man page of sed to make yourself acquainted with the concept.

You best start by thinking through what you need to do for every type of line, i.e.:

a) lines with "country=" in them: load hold space with country code
b) all lines:
b1) split into tags (starting with "<")
b2) remove the tags themselves (as i took it from your sample output)
b3) add the content of the hold space to the begin of the pattern space

Note that you make your life a lot easier by using multi-line sed-programs. In principle you can the above also in a one-liner - like you can write a C program into one line too - but it would be very hard to read, even harder to understand and nigh impossible to debug.

I hope this helps.

bakunin

rdrtx1 · August 7, 2017, 4:07pm

sed 's/</\n</g' input_file | awk '
/<Country .*code=/ {
   country=$0;
   sub(".*code=\"*", "", country);
   sub("\".*", "", country);
   c=$0;
   sub(" .*", "", c);
   sub("< *", "", c);
   print c " - " country;
   next;
}
/<[^\/]*>/ {
   tag=$0;
   sub("<[^>]*> *", "", tag);
   print 1,country, 1,tag;
}
'

MadeInGermany · August 7, 2017, 4:36pm

The 2nd s resets the references.
It only works if its search pattern is empty i.e. taken from a previous match - but how to get a g option from a match?
The following works with one s but the number of <tag> s must be static, e.g. 2.

sed 's#<Country code="\(..\)"><tag>\([^<]*\)</tag><tag>\([^<]*\)</tag></Country>#Country - \1\
\1 \2\
\1 \3\
#g' inputfile.txt

ctac · August 7, 2017, 5:18pm

Hi
With awk

awk 'BEGIN{FS="[><]";RS="<Country code="}/^$/{next};{gsub("\"","",$1);print "Country - "$1"\n"$1" "$4"\n"$1" "$8 }' inputfile

MadeInGermany · August 8, 2017, 3:18am

And another one that works like the previous suggestions.

sed 's/</\
/g' inputfile.txt | sed '
\#^Country code="\(.*\)".*#{s##\1#;h;}
\#^tag>#!d
G;s#^tag>\(.*\)\n\(.*\)#\2 \1#
'

Because the first sed splits into lines, the second sed can take any number of <tag> s. Also it will tolarate(discard) other stuff in between (for example a <b> tag).
A Countr= header is left as an exercise.

RudiC · August 8, 2017, 6:27am

Try also

sed 's/<Country code="\(..\)">/\nCountry - \1/g; s/^\n//; s|</[^>]*>||g; s/<[^>]*>/\n/g' file | sed '/Country - / {p; s///; x; d}; G; s/\(.*\)\n\(.*\)/\2 \1/'
Country - US
US adsf
US bdfs
Country - CA
CA asdf
CA bsdf

JenniferAmon · August 8, 2017, 8:05am

bakunin:

What you need is not a (non-persistent) backreference but the (persistent) "hold space": this is a space where you can add, copy or pull text out of. I suggest you read the man page of sed to make yourself acquainted with the concept.

You best start by thinking through what you need to do for every type of line, i.e.:

a) lines with "country=" in them: load hold space with country code
b) all lines:
b1) split into tags (starting with "<")
b2) remove the tags themselves (as i took it from your sample output)
b3) add the content of the hold space to the begin of the pattern space

Note that you make your life a lot easier by using multi-line sed-programs. In principle you can the above also in a one-liner - like you can write a C program into one line too - but it would be very hard to read, even harder to understand and nigh impossible to debug.

I hope this helps.

bakunin

Thanks. I'll look into that. the data comes in all on one line, which is what allowed me to think the back reference would work. I'll look at the hold space as an option.

bakunin · August 9, 2017, 3:17am

Here is an example on how to use the hold space. It does not exactly solve your requirement but illustrates IMHO quite well how the mechanism works:

This is your input file:

=Chapter 1
line 1
line 2
line 3
=Chapter 2
line A
line B
line C

We want the "chapter headings" to appear after every line of that chapter like this:

line 1 (Chapter 1)
line 2 (Chapter 1)
line 3 (Chapter 1)
line A (Chapter 2)
line B (Chapter 2)
line C (Chapter 2)

The following sed-script will do this. I have added comments, which are NOT part of the script, so you should remove them when you try it:

sed '/^=/ {                     # only lines starting with "=" , do:
             s/^=//             #    remove the leading equal sign
             s/^/ (/            #    add a leading " (" to the remainder
             s/$/)/             #    add a trailing ")"
             h                  #    move the result to the hold space
             d                  #    delete the line from pattern space
          }
     G                          # all lines: add content of hold space to pattern space
     s/\n// ' inputfile         # remove EOL markers from inside lines

Notice that the last line is necessary: the hold space is delimited by its own EOL marker which is copied to the pattern space too. A line in pattern space would look like:

<line content><EOL>                  # initially, after reading in the line
<line content><EOL><hold space><EOL> # after copying hold space content

and you need to get rid of the EOL in the middle so that the line when printed is not broken into two.

I hope this helps.

bakunin

JenniferAmon · August 9, 2017, 8:44am

madeingermany:

And another one that works like the previous suggestions.
sed 's/</\
/g' inputfile.txt | sed '
\#^Country code="\(.*\)".*#{s##\1#;h;}
\#^tag>#!d
G;s#^tag>\(.*\)\n\(.*\)#\2 \1#
'
Because the first sed splits into lines, the second sed can take any number of <tag> s. Also it will tolarate(discard) other stuff in between (for example a <b> tag).
A Countr= header is left as an exercise.

This works beautifully! Thank you. I'm going to go with this to start with, while I continue to look at and decipher the others.