Make awk gsub take value of for loop

bathtime · February 25, 2018, 7:03am

I am running Debian, mksh shell and #!/bin/mksh script.

Here is one instance I am trying to match. There are other level and n values, but they must be gathered in numerical order or the program will not work properly:

level="0" n="0"

Here is my code which does not work:

{ for (a = 0; a <= 10; ++a)
     { gsub(/level="[a]" n="[0-9]"/, "") }
}

The above code does not match and execute it, but the below code does. This is only proof of concept that it can be matched:

{ for (a = 0; a <= 10; ++a)
     { gsub(/level="0" n="[0-9]"/, "") }
}

So it seems that gsub is not taking the a parameter. How can I make gsub take the value of the for loop?

Thank you.

Btw, I have tried:

{ gsub(/level="[0-9]" n="[0-9]"/, "") }

This catches other instances before level="0" n="0".

RudiC · February 25, 2018, 7:39am

Regexes enclosed in slashes /.../ are regex constants, i.e. taken verbatim / literally. So a won't be replaced by its contents but matches "a".
Try to build the regex from partial string constants and variable contents, like e.g.

gsub("level=\"" a "\" n=\"[0-9]\"", "")

Not sure, though, what your intentions are with the square brackets around the a variable. And, why not a taylored regex (with "alternation") in lieu of the loop across a .
Some decent input (good and bad, i.e. to be matched or not) samples would help to understand what you're after.

bathtime · February 25, 2018, 12:38pm

rudic:

Regexes enclosed in slashes /.../ are regex constants, i.e. taken verbatim / literally. So a won't be replaced by its contents but matches "a".
Try to build the regex from partial string constants and variable contents, like e.g.
gsub("level=\"" a "\" n=\"[0-9]\"", "")
Not sure, though, what your intentions are with the square brackets around the a variable. And, why not a taylored regex (with "alternation") in lieu of the loop across a .
Some decent input (good and bad, i.e. to be matched or not) samples would help to understand what you're after.

This was the code needed to make it work! Thank you!

The big issue was that it must match perfectly or it would mess up the printout by going past its point and overwriting other data. With the forums help ;), I got that fixed (after 6hrs of droning at the comp). Maybe there is a very simple way, but none of the other offered solutions have worked so far, though I did learn new things from them.

The square brackets were just a beginners mistake; I was trying to use them as some sort way of inputting a substitution. :rolleyes:

If anyone is interested, a working solution is here. It's not the most efficient, but it works perfectly. If anyone wants all the code, just ask:

# Find 'sense id' number and store in a variable
{ /<sense id=\"n.*" level/; {vID = substr($2, 1, length($2)-1)}}

# Used for testing to see value:
# {print "vID: " vID}

# If matched then print section divider
{ for (vid = 0; vid <= 19; vid++){
    { for (vl = 0; vl <= 3; vl++){
        { for (vn = 0; vn <=25; vn++){

            {if (vn<=10) {vnx=vn    } }
            {if (vn==11) {vnx="I"    } }
            {if (vn==12) {vnx="II"    } }
            {if (vn==13) {vnx="III"    } }
            {if (vn==14) {vnx="IV"    } }
            {if (vn==15) {vnx="IV."    } }
            {if (vn==16) {vnx="V"    } }
            {if (vn==17) {vnx="V."  } }
            {if (vn==18) {vnx="A"  } }
            {if (vn==19) {vnx="B"  } }
            {if (vn==20) {vnx="C"  } }
            {if (vn==21) {vnx="D"  } }
            {if (vn==22) {vnx="E"  } }
            {if (vn==23) {vnx="F"  } }
            {if (vn==24) {vnx="G"  } }
            {if (vn==25) {vnx="H"  } }

                        # Used for testing to see values:
            # valuev="<sense " vID "." vid "\" level=\"" vl "\" n=\"" vnx "\" opt=\"n\">"; print valuev "\n"
            # {gsub("<sense " vID "." vid "\" level=\"" vl "\" n=\"" vnx "\" opt=\"n\">", vdefSep)}

                        # Everything is ready, so try to make a match!
            {gsub("<sense " vID "." vid "\" level=\"" vl "\" n=\"" vnx "\" opt=\"n\">", vdefSep)}

        }}
    }}
}}

# A sampe of what I'm trying to match:
#
# <sense id="n1.0" level="0" n="0" opt="n">
# <sense id="n1.1" level="1" n="I" opt="n">
# <sense id="n1.2" level="2" n="A" opt="n">
# <sense id="n1.3" level="2" n="B" opt="n">
# <sense id="n1.4" level="3" n="1" opt="n">
# <sense id="n1.5" level="3" n="2" opt="n">
# <sense id="n1.6" level="1" n="II" opt="n">
# <sense id="n1.7" level="2" n="A" opt="n">
# <sense id="n1.8" level="3" n="1" opt="n">
# <sense id="n1.9" level="3" n="2" opt="n">
# <sense id="n1.10" level="2" n="B" opt="n">
# <sense id="n1.11" level="3" n="1" opt="n">
# <sense id="n1.12" level="3" n="2" opt="n">
# <sense id="n1.12" level="3" n="2" opt="n">
# <sense id="n1.13" level="3" n="3" opt="n">

Basically, it goes through every single possibility that exists. There must be a better way.

RudiC · February 25, 2018, 2:02pm

A few comments, although I'm afraid I didn't understand all the details of your code snippet nor interpret any of those correctly. Partly due to the unusual indenting that doesn't lend itself immediately:

Although braces don't hurt and the parser will understand / eliminate them, too many of them makes the code difficult to read. {if (vn<=10) {vnx=vn } } can be written as if (vn<=10) vnx=vn without sacrifying logics but improving readability.
for every single input line, you execute those nested loops 20 x 4 x 26, i.e. 2080 times - quite lengthy for more than a few input lines.
instead of the 16 if s for the vnx constants assignment, you could use an array.
you seem to execute 2080 gsub s on $0 with different patterns, each and every one overwriting the former ones - not sure if each of those really makes sense and is necessary.

I could imagine that if you explain your problem verbosely in plain English supporting this with a few meaningful examples, people in here could come up with a taylored, crisp proposal on how to improve and accelerate the solution.

EDIT:
This

{ /<sense id=\"n.*" level/; {vID = substr($2, 1, length($2)-1)}}

is NOT a pattern {action} pair and will change vID with every new input line. Is that intended? Why then the /<sense id=\"n.*" level/ ?

EDIT 2: After replacing the gsub with a print - just as a proof of concept - , this yields the identical output as your code above:

awk '
BEGIN   {split ("1 2 3 4 5 6 7 8 9 10 I II III IV IV. V V. A B C D E F G H", VNARR)
         VNARR[0] = 0
        }

        {vID = substr($2, 1, length($2)-1)

#               Used for testing to see value:
#               {print "vID: " vID}

#               If matched then print section divider
         for (vid = 0; vid <= 19; vid++)
           for (vl = 0; vl <= 3; vl++)
             for (vn = 0; vn <=25; vn++)        print "<sense " vID "." vid "\" level=\"" vl "\" n=\"" VNARR[vn] "\" opt=\"n\">", vdefSep
        }

' file

bathtime · February 25, 2018, 5:39pm

Yes, I will try to ensure all new code is trimmed down. For now, I don't want to mess with the other braces.

Yes, I know no other working alternative.

How eloquent that is! This is the type of thing I was looking for�so many lines saved!

I hadn't known that; my thought was that gsub only executed on a match? Maybe something like (and I've tried to make this work for a while):

# Make a variable for easy access:
IDvar="<sense " vID "." vid "\" level=\"" vl "\" n=\"" VNARR[vn] "\" opt=\"n\">"

# If that variable exists, then run gsub.
if (/IDvar/) gsub(IDvar, vdefSep)

Thank you. I try to keep speech to a min, so people aren't overwhelmed...

EDIT:
This
{ /<sense id=\"n.*" level/; {vID = substr($2, 1, length($2)-1)}}
is NOT a pattern {action} pair and will change vID with every new input line. Is that intended? Why then the /<sense id=\"n.*" level/ ?

It would actually be the same due to how the xml file stores things, but, that said, you are right; it is not necessary to run and rerun that variable, even if the info is constant. I've taken it out of the for loop and put it just before.

Here is the updated version:

#!/bin/mksh

# This program requires an xml dictionary file to run. If it is not on your machine,
# this program will automatically download it and store in ~/.config/latin/.

# Name this file as 'latin' and run:
#
# $ chmod +x latin
#
# To run:
# $ ./latin amo
#
# To enable internet auto-decline:
# $ ./latin -d amo
#
# To run with only auto-decline:
# $ ./latin -c amo
#
# Where 'amo' is the term searched.

searchTerm=$2

URL="http://www.perseus.tufts.edu/hopper/morph?l=$searchTerm&la=la"

wFIN='<h4 class="la">'
wFOUT='</h4>'
wDefIn='<span class="lemma_definition">'
wDefOut='</span>'
wFormIn='<td class="la">'$searchTerm'</td>'
wFormOut='<td style="font-size: x-small">'

## Code which connects to perseus to attain 1st per. sg. (needed as key for xml file)
if [[ ("$1" == "-d") ]]; then

	searchTerms=$(wget -q -O- "$URL" | mawk -v vWFIN="$wFIN" -v vWFOUT="$wFOUT" \
	' $0 ~ vWFIN,$0 ~ vWFOUT {printf substr($0,18, length($0)-22)"\n"; next;}')

elif [[ ("$1" == "-c") ]]; then

	wget -q -O- "$URL" | mawk -v vDefIn="$wDefIn" -v vDefOut="$wDefOut" -v vFormIn="$wFormIn" -v vFormOut="$wFormOut" -v vWFIN="$wFIN" -v vWFOUT="$wFOUT" \
	' $0 ~ vWFIN,$0 ~ vWFOUT {printf "\n[ " substr($0,18, length($0)-22)" ]"; next;}   $0 ~ vDefIn,$0 ~ vDefOut {{ if (!/>/) {{$1=$1}1; x+=1; print " "$0"";} }}   $0 ~ vFormIn,$0 ~ vFormOut {{ if (!/td /) {{$1=$1}1;   $0=substr($0,5, length($0)-9); print "-"$0; next;} } }'

else
	searchTerms=$1
fi

if [ "$1" == "-c" ]; then
	exit
fi

XMLfile=Perseus_text_1999.04.0060.xml
XMLdir=~/.config/latin/
XMLlink="http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus:text:1999.04.0060"

if [ ! -e $XMLdir$XMLfile ]; then
        echo "\nFile:" $XMLdir$XMLfile "not found.\n\nDownloading from" $XMLlink "...\n"
	mkdir -p ~/.config/latin
	wget -qO- $XMLlink | tr -d '\r' > $XMLdir$XMLfile
fi

for searchTerm in $searchTerms
do

#echo "Searching for:"$searchTerms

keyIn='key="'$searchTerm'"'	# Which tag shall be searched?
keyOut='</entry>'	#
tagIn='<'		# How are tags to be distinguished?
tagOut='>'		#
keySepA=''		# Separates the main word from its roots
keySepB=','		#
etySepA='['		# Etymology left
etySepB=']\n\n � '	# Etymology right
defSep='\n\n '          # Separates individual definitions
emSep='\n\n � '		# Separates em-dashes

#echo $keyIn

# First concatenate the result into a usable string
awk -v vkeyIn="$keyIn" -v vkeyOut="$keyOut" ' $0 ~ vkeyIn, $0 ~ vkeyOut {printf $0; }' $XMLdir$XMLfile |
awk -v tagIn="$tagIn" -v tagOut="$tagOut" -v vkeySepA="$keySepA" -v vkeySepB="$keySepB" -v vdefSep="$defSep" -v vetySepA="$etySepA" -v vetySepB="$etySepB" -v vemSep="$emSep" '

	# Separation after main key word
	{ gsub("<orth>", vkeySepA) }
	{ gsub("</orth>", vkeySepB) }

	# Add separation for several variations of definitions
	#{gsub(/<etym lang="la" opt="n">/, vetySepA)}
	# Testing
	{ gsub(/<sense id.*><etym lang="la" opt="n">/, vetySepA) }
	{ gsub(/<\/etym>\. �<\/sense>/, "]") }
	{ gsub(/<\/etym>\, <trans opt="n">/, vetySepB) }
	{ gsub(/<\/etym>\.�/, vetySepB) }
	{ gsub(/<\/etym>\. /, "]") }

	# Get rid of potential extra definition markers
	{ gsub(/\.�<\/sense>/, ".") }
	{ gsub(/\.� <\/sense>/, ".") }
	{ gsub(/\. � <\/sense>/, ".") }
	{ gsub(/<\/usg>�<\/sense>/, ".") }

	{ vID = substr($2, 1, length($2)-1) }

BEGIN   { split ("1 2 3 4 5 6 7 8 9 10 I II III IV IV. V V. A B C C. D E F G H", VNARR)
         VNARR[0] = 0
        }

        {

	#If matched then print section divider
	for (vid = 0; vid <= 19; vid++)
	  for (vl = 0; vl <= 3; vl++)
	    for (vn = 0; vn <=26; vn++) {

		#IDvar="<sense " vID "." vid "\" level=\"" vl "\" n=\"" VNARR[vn] "\" opt=\"n\">"
		#print IDvar

		gsub("<sense " vID "." vid "\" level=\"" vl "\" n=\"" VNARR[vn] "\" opt=\"n\">", vdefSep )

		}
	}

	# Add missing dot after gender
	{ gsub(/<\/gen>/, ". ") }

	# Collapse all remaining tags
	{ gsub(tagIn "[^" tagOut "]*" tagOut, "") }

	# Separate em-dash text
	{ if ((!/�\\,/) && (!/[a-zA-Z]�/) && (!/ �/)) {gsub (/�/, vemSep) }}
        { if ((!/�\\,/) ) {gsub (/\.�/, "." vemSep)}}
        { gsub (/ � /, vemSep)}
	{ if (!/�\\,/) {gsub (/\.�/, "." vemSep)}}

	# Remove double spaces and spaces between certain characters
	{ gsub(/ +/,  " ") }
	{ gsub(/ ,/,  ",") }
	{ gsub(/\( /, "(") }
	{ gsub(/ \)/, ")") }
	{ gsub(/ \./, ".") }
	{ gsub(/ \:/, ":") }
	{ gsub(/ \?/, "?") }
	{ gsub(/\� /, "�") }
	{ gsub(/ \�/, "�") }
	{ gsub(/^ /,  "" ) }
	{ gsub(/\.\.\. /, "...") }
	{ NF }

{ print "\n" $0 "\n" }'

done

I had made a version with such great notes, but upon finishing it, there was an error which I could fix. Likely I lost a bracket somewhere.

Once again, thank you all. I am still (always) open to any other suggestions.

*EDIT*

Updated script: XML dictionary file is now automatically downloaded to ~/.config/latin/ if not present. There is no manual downloading required. Just run the script and all is done automatically.

RudiC · February 26, 2018, 3:19am

As already proposed, if you describe what is needed someone could come up with some nifty trick e.g. regex. Pls be aware that if the substitution has taken place in the first loop, another 2079 loops will be executed nevertheless.

I hadn't known that; my thought was that gsub only executed on a match? Maybe something like (and I've tried to make this work for a while):
# Make a variable for easy access:
IDvar="<sense " vID "." vid "\" level=\"" vl "\" n=\"" VNARR[vn] "\" opt=\"n\">"

# If that variable exists, then run gsub.
if (/IDvar/) gsub(IDvar, vdefSep)

gsub analyses the input line / variable char by char for a match, as does the matching operators, e.g. /.../ - so it parses the input twice. Unnecessary, and costly, esp. for lengthy lines. If you're sure there is only one single match, use sub to stop after that match. BTW, /IDvar/ looks for exactly that literal string, "IDvar", verbatim.

I wasn't asking for a romantic novel, but for a meaningful explanation / formulation of the central problem(s).

There are editors in them 'thar hills that allow for checking for e.g. unpaired brackets, braces, parentheses. Clever indentation also helps.

bathtime · February 26, 2018, 5:40am

Issue fixed. With one line of code:

gsub(vdefTagIn "[^" vdefTagOut "]*" vdefTagOut, vdefSep)

where the tags were defined as '<sense' and '>'. No more need for 2079 loops of madness.

Anyways, nothing was wasted; all the ideas posted will help in future scripting.

As of now, I will be working on merging some gsub commands with regex tricks.

Oh-and about the braces, when I removed certain ones the program would not operate correctly; it would scatter text and such. I just added a brace between the beginning of the program (after the variables) and before { print $0 }, and I was able to remove all the other braces!

If anyone is interested:

latin:

#!/bin/mksh

# This program requires an xml dictionary file to run. If it is not on your machine,
# this program will automatically download it and store in ~/.config/latin/.

# Name this file as 'latin' and run:
#
# $ chmod +x latin
#
# To run:
# $ ./latin amo
#
# To enable internet auto-decline:
# $ ./latin -d amo
#
# To run with only auto-decline:
# $ ./latin -c amo
#
# Where 'amo' is the term searched.

key=$2

URL="http://www.perseus.tufts.edu/hopper/morph?l=$key&la=la"

wFIN='<h4 class="la">'
wFOUT='</h4>'
wDefIn='<span class="lemma_definition">'
wDefOut='</span>'
wFormIn='<td class="la">'$key'</td>'
wFormOut='<td style="font-size: x-small">'

## Code which connects to perseus to attain 1st per. sg. (needed as key for xml file)
if [[ ("$1" == "-d") ]]; then

	searchTerms=$(wget -q -O- "$URL" | mawk -v vWFIN="$wFIN" -v vWFOUT="$wFOUT" \
	' $0 ~ vWFIN,$0 ~ vWFOUT {printf substr($0,18, length($0)-22)"\n"; next;}')

elif [[ ("$1" == "-c") ]]; then

	wget -q -O- "$URL" | mawk -v vDefIn="$wDefIn" -v vDefOut="$wDefOut" -v vFormIn="$wFormIn" -v vFormOut="$wFormOut" -v vWFIN="$wFIN" -v vWFOUT="$wFOUT" \
	' $0 ~ vWFIN,$0 ~ vWFOUT {printf "\n[ " substr($0,18, length($0)-22)" ]"; next;}   $0 ~ vDefIn,$0 ~ vDefOut {{ if (!/>/) {{$1=$1}1; x+=1; print " "$0"";} }}   $0 ~ vFormIn,$0 ~ vFormOut {{ if (!/td /) {{$1=$1}1;   $0=substr($0,5, length($0)-9); print "-"$0; next;} } }'

else
	searchTerms=$1
fi

if [ "$1" == "-c" ]; then
	exit
fi

XMLfile=Perseus_text_1999.04.0060.xml
XMLdir=~/.config/latin/
XMLlink="http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus:text:1999.04.0060"

if [ ! -e $XMLdir$XMLfile ]; then
        echo "\nFile:" $XMLdir$XMLfile "not found.\n\nDownloading from" $XMLlink "...\n"
	mkdir -p ~/.config/latin
	wget -qO- $XMLlink | tr -d '\r' > $XMLdir$XMLfile
fi

for key in $searchTerms; do

keyIn='key="'$key'"'	# Which tag shall be searched?
keyOut='</entry>'	#
tagIn='<'		# How are tags to be distinguished?
tagOut='>'		#
defTagIn='<sense'	# How are definitions defined?
defTagOut='>'
keySepA=''		# Separates the main word from its roots
keySepB=','		#
etySepA='['		# Etymology left
etySepB=']\n\n � '	# Etymology right
defSep='\n\n '          # Separates individual definitions
emSep='\n\n � '		# Separates em-dashes

# First concatenate the result into a usable string
awk -v vkeyIn="$keyIn" -v vkeyOut="$keyOut" ' $0 ~ vkeyIn, $0 ~ vkeyOut {printf $0; }' $XMLdir$XMLfile |
awk -v vdefTagIn="$defTagIn" -v vdefTagOut="$defTagOut" -v tagIn="$tagIn" -v tagOut="$tagOut" -v vkeySepA="$keySepA" -v vkeySepB="$keySepB" -v vdefSep="$defSep" -v vetySepA="$etySepA" -v vetySepB="$etySepB" -v vemSep="$emSep" '
{
	# Separation after main key word
	gsub("<orth>", vkeySepA)
	gsub("</orth>", vkeySepB)

	# Add separation for several variations of definitions
	#gsub(/<etym lang="la" opt="n">/, vetySepA)
	gsub(/<sense id.*><etym lang="la" opt="n">/, vetySepA)
	gsub(/<\/etym>\. -<\/sense>/, "]")
	gsub(/<\/etym>\, <trans opt="n">/, vetySepB)
	gsub(/<\/etym>\.-/, vetySepB)
	gsub(/<\/etym>\. /, "]")

	# Get rid of potential extra definition markers
	gsub(/\.-<\/sense>/, ".")
	gsub(/\.- <\/sense>/, ".")
	gsub(/\. - <\/sense>/, ".")
	gsub(/<\/usg>-<\/sense>/, ".")
	gsub(/<\/usg> -<\/sense>/, ".")

	# Add missing dot after gender
	gsub(/<\/gen>/, ". ")

	# Collapse all definition tags and add formatting in their place
	gsub(vdefTagIn "[^" vdefTagOut "]*" vdefTagOut, vdefSep)

	# Collapse all remaining tags
	gsub(tagIn "[^" tagOut "]*" tagOut, "")

	# Separate em-dash text
	if ((!/-\\,/) && (!/[a-zA-Z]-/) && (!/ -/)) gsub (/-/, vemSep)
        if ((!/-\\,/) ) gsub (/\.-/, "." vemSep)
        gsub (/ - /, vemSep)
	gsub (/ -/, vemSep)
	if (!/-\\,/) gsub (/\.-/, "." vemSep)

	# Remove double spaces and spaces between certain characters
	gsub(/ +/,  " ")
	gsub(/ ,/,  ",")
	gsub(/\( /, "(")
	gsub(/ \)/, ")")
	gsub(/ \./, ".")
	gsub(/ \:/, ":")
	gsub(/ \?/, "?")
	gsub(/\� /, "�")
	gsub(/ \'/, "'")
	gsub(/^ /,  "" )
	gsub(/\.\.\. /, "...")

}

{ print "\n" $0 "\n" } '

done

Don_Cragun · February 26, 2018, 6:27am

You still have 2 invocations of awk when you only need 1 and you still have too many braces in your 2nd awk script. Try changing:

awk -v vkeyIn="$keyIn" -v vkeyOut="$keyOut" ' $0 ~ vkeyIn, $0 ~ vkeyOut {printf $0; }' $XMLdir$XMLfile |
awk -v vdefTagIn="$defTagIn" -v vdefTagOut="$defTagOut" -v tagIn="$tagIn" -v tagOut="$tagOut" -v vkeySepA="$keySepA" -v vkeySepB="$keySepB" -v vdefSep="$defSep" -v vetySepA="$etySepA" -v vetySepB="$etySepB" -v vemSep="$emSep" '
{
	# Separation after main key word
	gsub("<orth>", vkeySepA)
... ... ...
	gsub(/\.\.\. /, "...")

}

{ print "\n" $0 "\n" } '

to:

awk -v vkeyIn="$keyIn" -v vkeyOut="$keyOut" -v vdefTagIn="$defTagIn" -v vdefTagOut="$defTagOut" -v tagIn="$tagIn" -v tagOut="$tagOut" -v vkeySepA="$keySepA" -v vkeySepB="$keySepB" -v vdefSep="$defSep" -v vetySepA="$etySepA" -v vetySepB="$etySepB" -v vemSep="$emSep" '
$0 ~ vkeyIn, $0 ~ vkeyOut {
	# Separation after main key word
	gsub("<orth>", vkeySepA)
... ... ...
	gsub(/\.\.\. /, "...")
	print "\n" $0 "\n"
}' $XMLdir$XMLfile

It should give you exactly the same results with a single awk instead of two awk s piped together.

bathtime · February 26, 2018, 7:18am

don cragun:

You still have 2 invocations of awk when you only need 1 and you still have too many braces in your 2nd awk script. Try changing:

awk -v vkeyIn="$keyIn" -v vkeyOut="$keyOut" ' $0 ~ vkeyIn, $0 ~ vkeyOut {printf $0; }' $XMLdir$XMLfile |
awk -v vdefTagIn="$defTagIn" -v vdefTagOut="$defTagOut" -v tagIn="$tagIn" -v tagOut="$tagOut" -v vkeySepA="$keySepA" -v vkeySepB="$keySepB" -v vdefSep="$defSep" -v vetySepA="$etySepA" -v vetySepB="$etySepB" -v vemSep="$emSep" '
{
	# Separation after main key word
	gsub("<orth>", vkeySepA)
... ... ...
	gsub(/\.\.\. /, "...")

}

{ print "\n" $0 "\n" } '

to:

awk -v vkeyIn="$keyIn" -v vkeyOut="$keyOut" -v vdefTagIn="$defTagIn" -v vdefTagOut="$defTagOut" -v tagIn="$tagIn" -v tagOut="$tagOut" -v vkeySepA="$keySepA" -v vkeySepB="$keySepB" -v vdefSep="$defSep" -v vetySepA="$etySepA" -v vetySepB="$etySepB" -v vemSep="$emSep" '
$0 ~ vkeyIn, $0 ~ vkeyOut {
	# Separation after main key word
	gsub("<orth>", vkeySepA)
... ... ...
	gsub(/\.\.\. /, "...")
	print "\n" $0 "\n"
}' $XMLdir$XMLfile

It should give you exactly the same results with a single awk instead of two awk s piped together.

I've been wanting to merge the two awks into one, but have not been successful. This code does not make the program run as intended. It just gives a splash of ongoing text from the xml file.

The reason I used two awks is because the xml file has text that is all broken up between lines; I needed first to concatenate those lines into one line (only the ones of the key phrase), and then that line is easy to edit in the second awk; else, I would be having to edit one line of text between several lines, and that is beyond my knowledge at this point. If there is a way to first do one task (concatenate the text), and then do another (the rest of the text manipulation with that concatenated text), that would be great. I've tried several variations and have not been successful.

RudiC · February 26, 2018, 7:31am

Untested: Instead of printf $0 in the first awk script, concatenate $0 to a working variable, like WRK = WRK " " $0 , then assign WRK back to $0 for the further processing.

Untested, and a hint only: Methinks replacing the above five lines with

# Get rid of potential extra definition markers
     gsub (/(\.|<\/usg>) ?� ?<\/sense>/,    ".")

yields the same result. Same might be true for other opportunitites.

Don_Cragun · February 27, 2018, 2:34am

Did you figure out what needs to be done based on what RudiC suggested, or do you still need help completing it?

bathtime · February 27, 2018, 7:02am

Works perfectly!

awk -v vkeyIn="$keyIn" -v vkeyOut="$keyOut" -v vdefTagIn="$defTagIn" -v vdefTagOut="$defTagOut" -v tagIn="$tagIn" -v tagOut="$tagOut" -v vkeySepA="$keySepA" -v vkeySepB="$keySepB" -v vdefSep="$defSep" -v vetySepA="$etySepA" -v vetySepB="$etySepB" -v vemSep="$emSep" '
$0 ~ vkeyIn, $0 ~ vkeyOut { WRK = WRK $0; next; }END{

	$0 = WRK

	# Separation after main key word
	sub(/<orth>/, vkeySepA)
	sub(/<\/orth>/, vkeySepB)

... ... ...

	gsub(/^ /,  "" )
	gsub(/\.\.\. /, "...")

	print "\n" $0 "\n"

}  ' $XMLdir$XMLfile

Untested, and a hint only: Methinks replacing the above five lines with
# Get rid of potential extra definition markers
   gsub (/(\.|<\/usg>) ?- ?<\/sense>/,    ".")
yields the same result. Same might be true for other opportunitites.

Hmmm, a 5:1 reduction in code-not bad. :rolleyes:

Yes, above. I've been reading the GNU Awk User's Guide (https://www.gnu.org/software/gawk/manual/gawk.html\#\), so I haven't been doing as much coding.

Though atm, I do need help with using an array/variable in SUB. Rudi had pointed out in post #4 https://www.unix.com/303013671-post4.html a working solution which involved this, but I just could not break it down and make it execute properly:

I am trying to use an array of strings/variables (which is working fine) to insert into sub and be replaced (which is not working):

# Separation after main key word
# sub(/<orth>/, vkeySepA)

{ split ("<orth> "vkeySepA"", VNARR)
		VNARR[0] = 0

		# Will use when other issues are sorted
		# for (a = 1; a <=20; a+=2)

		# VNARR seems to print fine
		print "1: " VNARR[1] "\n2: " VNARR[2]

		# Faulty code below; does not match the data
		sub(VNARR[1], VNARR[2])

}

Don_Cragun · February 27, 2018, 11:41am

bathtime:

Works perfectly!

awk -v vkeyIn="$keyIn" -v vkeyOut="$keyOut" -v vdefTagIn="$defTagIn" -v vdefTagOut="$defTagOut" -v tagIn="$tagIn" -v tagOut="$tagOut" -v vkeySepA="$keySepA" -v vkeySepB="$keySepB" -v vdefSep="$defSep" -v vetySepA="$etySepA" -v vetySepB="$etySepB" -v vemSep="$emSep" '
$0 ~ vkeyIn, $0 ~ vkeyOut { WRK = WRK $0; next; }END{

	$0 = WRK

	# Separation after main key word
	sub(/<orth>/, vkeySepA)
	sub(/<\/orth>/, vkeySepB)

... ... ...

	gsub(/^ /,  "" )
	gsub(/\.\.\. /, "...")

	print "\n" $0 "\n"

}  ' $XMLdir$XMLfile

Hmmm, a 5:1 reduction in code�not bad. :rolleyes:

Yes, above. I've been reading the GNU Awk User's Guide (The GNU Awk User�s Guide), so I haven't been doing as much coding.

I'm very happy that you got this to work for you.

But, please stop trying to hide the logic in your code! Make it obvious. Change:

$0 ~ vkeyIn, $0 ~ vkeyOut { WRK = WRK $0; next; }END{

	$0 = WRK

to:

$0 ~ vkeyIn, $0 ~ vkeyOut { WRK = WRK $0; next}

END {
	$0 = WRK

Though atm, I do need help with using an array/variable in SUB. Rudi had pointed out in post #4 https://www.unix.com/303013671-post4.html a working solution which involved this, but I just could not break it down and make it execute properly:

I am trying to use an array of strings/variables (which is working fine) to insert into sub and be replaced (which is not working):
# Separation after main key word
# sub(/<orth>/, vkeySepA)

{ split ("<orth> "vkeySepA"", VNARR)
		VNARR[0] = 0

		# Will use when other issues are sorted
		# for (a = 1; a <=20; a+=2)

		# VNARR seems to print fine
		print "1: " VNARR[1] "\n2: " VNARR[2]

		# Faulty code below; does not match the data
		sub(VNARR[1], VNARR[2])

}

It isn't immediately obvious what isn't working in this example and you don't give us any indication of what you think it is doing wrong.

The call to split() could be rewritten more simply as:

split("<orth> "vkeySepA, VNARR)

and always give you identical results. Given that vkeySepA has been defined to be an empty string in your earlier code and assuming that it still is when you ran this, one might note that the call above would return 1 (not the 2 that you seem to be assuming). But since unassigned array elements (like any other unassigned variables) will have a 0 value if used as a number or an empty string value if used as a string, that won't make any difference in this case. Your calls to sub() with those array values should change the first occurrence of the string <orth> in $0 to an empty string.

Note that the <space> after <orth> will be treated as a field separator, not as part of the string to be replaced. Note also that with many of your search patterns (many of which contain <space>s) and replacement patterns (many of which contain <space>s), using code like the above will give you more than 2 fields in the created array unless you use a different array element separator and add an ERE to your split() call specifying the character(s) in your separator as the element separator.

bathtime · February 28, 2018, 5:13pm

Was not intentional. Just a thing of habit.

Yes, I have to be more clear. I got it working though. It was a silly mistake of not watching the correct text that I was replacing; it was actually working all along. :rolleyes:

Yes, found this to be the case. And I've decided to keep the code as it was for now-this is a little too deep for me atm and not entirely necessary for the program to work... As for the other code, I've shortened it up abit.

Just a note: I've started to make this program in C++; it seems that it could benefit from the speed and features. Already, the C++ program can open a file and extract, replace, and print a few text combinations; so you may not see me posting for a little bit in the Shell Programming Forum, but I will still be using awk for the many things that it can do!

RudiC · March 1, 2018, 3:06am

This is my personal feeling: It is well known that any dedicated compiled program, be it C, C++, Pascal, or other, usually benefit from increased execution speed compared to e.g. scripts. But there is a tradeoff in terms of flexibility vs. e.g. awk , perl esp. when it comes to text analysis and processing, and adapting / modelling algorithms, for which those were specifically targeted / designed.
I'd be very interested in any results comparing execution times of your C++ with an equivalent awk script, as they both will use the same regex system calls.

bathtime · March 12, 2018, 12:47pm

I hadn't seen this post, else I'd have responded sooner...

I'm not sure how to set a timer within the awk program to just time the replacing time itself, so the results will be based on the programs opening/searching for the key/replacing text/closing.

The word 'ad' was used as it has a long definition and is close to the beginning of the file (less search time).

The timer program:

#!/bin/mksh
#
# Run:
#
# $ ./bench.sh '<program> <parameters>' <number of iterations> 
#
# Ex.:
#
# $ ./bench.sh 'lat ad' 1000
#

echo "\nRunning \""$1"\" for" $2 "iterations.\n\nPlease wait...\n"

i=0

time while [ $i -le $2 ]; do /home/user/scripts/$1; i=$(($i + 1)); done

The results:

Awk (regex):
1000x @ 2m08.22s real 2m04.66s user 0m05.01s system

C++ (regex):
1000x @ 15.24s real 0m13.44s user 0m01.53s system (8.4x faster than Awk)

C++ (custom search and replace - I'll post this code on the forum)
1000x @ 3.37s real 0m02.54s user 0m00.62s system (4.5x faster than C++ regex)

Timed within the programs themselves and only timing the search and replace process:

C++ (regex)
1000x @ 12.26s real 0m05.82s user 0m00.00s system
Ram used: 3mb

C++ (custom)
1000x @ 1.66s real 0m01.66s user 0m00.00s system
Ram used: > 1mb