awk Associative Array and/or Referring to Field by String (Nonconstant String Value)

jvoot · January 31, 2019, 11:22pm

I will start with an example of what I'm trying to do and then describe how I am approaching the issue.

File

 PS028,005 [JHRS-<Pr>] [ABC <Ob>]
 Lexeme     HRS       # M      #
 PhraseType  1(1:1) 7(7)
 PhraseLab  501[0]      503[0]
 ClauseType ZYq0

 PS028,005 [W-<Cj>] [L> <Ng>] [JBN-<Pr>] [XYZ <Ob>]
 Lexeme     W      # L>      # BNH      # M      #
 PhraseType  6(6) 11(11) 1(1:1) 7(7)
 PhraseLab  509[0]   510[0]    501[0]     503[0]
 ClauseType WxY0

Desired Output

 PS028,005 ABC

 PS028,005 XYZ

I would also be happy with the following where I can strip things off by piping into sed :

 PS028,005 [ABC <Ob>]

 PS028,005 [XYZ <Ob>]

In essence, when a line begins with /^ PS/ then print $1 of that line along with the string between strings "[" and "<Ob>]". I can use sed to get the string between "[" and "<Ob>]" but I cannot get $1 (when $1 ~/^ PS/) to print along with it.

I have attempted:

awk '/^ PS/{print $1, $(/\[.*\<Ob\>\]/)}' File

Here I am attempting to use a nonconstant field number, however this seems to print the entire line containing the matching string in question.

Another attempt has been this:

awk '/^PS/{a = $1; $2 = /\[.*\<Ob\>\]/}{print a,$2}' File

Finally I have tried utilize an array, and must admit that even after reading the man awk page, I still find these confusing.

awk 'BEGIN{a[NR]=$0}{if(/\[.*\<Ob\>\]/ in a && $1 ~/^ PS/) print}' File

Obviously, none of these has worked. I would greatly appreciate any help on what should be a relatively easy bit of code that I'm just not getting. Thanks in advance.

Scrutinizer · January 31, 2019, 11:43pm

Hi, try using the square brackets as field separators, for example:

awk -F '[][]' '
  $1~/^[ \t]*PS/ {
    for(i=2; i<=NF; i+=2)
      if($i~/<Ob>/) {
        split($i,F," ")
        print $1 F[1]
        next
      }
  }
' file

The code could perhaps be simplified if the file is always structured in a certain way, for instance if <Ob> always occcurs in the last field:

awk -F '[][]' '                                                                     
  $1~/^[ \t]*PS/ && $(NF-1)~/<Ob>/ {
    split($(NF-1),F," ")
    print i,$1 F[1]
  }
' file

And in which case you could probably also do it without adjusting the field separators:

awk '$1~/^PS/ && $NF~/<Ob>/ { 
  sub(/\[/,"",$(NF-1))
  print $1, $(NF-1)
}' file

jvoot · February 1, 2019, 12:06am

Deleted.

Don_Cragun · February 1, 2019, 12:09am

In your sample data, the [string <0b>] always appears at the end of the line that starts with <space>s immediately followed by PS . Is that also true in your real data? If it is, we can simplify the code Scrutinizer suggested to something like:

awk '$1 ~ /^PS/ {sub(/\[/, "", $(NF - 1));print $1, $(NF - 1)}' file

or:

awk '$1 ~ /^PS/ {print $1, substr($(NF - 1), 2)}' file

Scrutinizer · February 1, 2019, 12:13am

I wrote some additional approaches in my page. And there was an extra variable (used for debugging) that I now removed in the first example. The is a space between the brackets in the field separator that should not be there in your example:

jvoot · February 1, 2019, 12:18am

Thanks so much Scrutinizer. It looked like it was printing out some manner of counter (possibly string length?) as the first field of every line. I adjusted your code slightly and also for simplicity sake took out the leading space in the input file. I also needed to transcribe your code to a one-liner as I was passing output into it via pipe (I presented it as a file above for simplicity sake).

Thus, your code transcribed awk -F '[][]' '{for(i=2; i<=NF; i+=2) if($i~/<Ob>/){split($i,F," "); print i,$1 F[1]; next}}' gave me this:

4  PS028,005 M
8  PS028,005 M

I adjusted to awk -F '[][]' '{for(i=2; i<=NF; i+=2) if($i~/<Ob>/){split($i,F," "); print $1 F[1]; next}}' and while I haven't investigated in detail, that seems to have done the trick. Thanks so much!

--- Post updated at 09:18 PM ---

don cragun:

In your sample data, the [string <0b>] always appears at the end of the line that starts with <space>s immediately followed by PS . Is that also true in your real data? If it is, we can simplify the code Scrutinizer suggested to something like:
awk '$1 ~ /^PS/ {sub(/\[/, "", $(NF - 1));print $1, $(NF - 1)}' file
or:
awk '$1 ~ /^PS/ {print $1, substr($(NF - 1), 2)}' file

Unfortunately no Don, the string with <Ob> can appear anywhere in the line. Nevertheless, I did a bit of an adjustment to Scrutinizer's code and it seems to be working very well. Thank you so much Don.

Scrutinizer · February 1, 2019, 12:20am

You are welcome, you do not need to use a oneliner, BTW. You could do this:

INPUT |
awk ...

Don_Cragun · February 1, 2019, 12:37am

One could still try:

awk '$1 ~ /^PS/ {for(i=3; i<=NF; i++) if($i == "<Ob>]"){print $1,substr($(i-1), 2); next}}' file

without needing to use split() (unless I misunderstood and you changed your input file format to remove the <space> before the <Ob>] ).

jvoot · February 1, 2019, 1:02am

I'm so sorry Scrutinizer, but as my input is many thousand lines long I did not notice a potential complicating issue that I was wondering if I could get your help addressing. There are time where the desired string between an initial "[" and "<Ob>] contains a space.

So for example, given:

 PS028,006 [KJ <Cj>] [CM< <Pr>] [QWL TXNWNJ- <Ob>]
 Lexeme     KJ      # CM<      # QWL TXNWN J      #
 PhraseType  6(6) 1(1:2) 2(2.1,2.1,7)
 PhraseLab  509[0]    501[0]     503[0]
 ClauseType xQt0

Which I would pare down with INPUT | awk '$1 ~/^ PS/' to get:

PS028,006 [KJ <Cj>] [CM< <Pr>] [QWL TXNWNJ- <Ob>]

In this case, the desired output would be:

PS028,006 QWL TXNWNJ-

or

PS028,006 [QWL TXNWNJ- <Ob>]

The code you helped me with only gives:

PS028,006 QWL

Again, I apologize that I did not see the possibility of the space within the desired string until I double-checked the output against INPUT | sed -e 's/.* \[$.*$ <Ob>\].*/\1/' which gives me the desired string but not the $1 when $1 ~/^ PS/.

Would you be able to help me iron this out?

--- Post updated at 10:02 PM ---

don cragun:

One could still try:
awk '$1 ~ /^PS/ {for(i=3; i<=NF; i++) if($i == "<Ob>]"){print $1,substr($(i-1), 2); next}}' file
without needing to use split() (unless I misunderstood and you changed your input file format to remove the <space> before the <Ob>] ).

This works well Don except that I represented the desired output strings as "ABC" and "XYZ" which it seems that you took at being a three character string. I should have been more specific and said that "ABC" and "XYZ" represents a string of any length. Thus something like ["some amount of text" <Ob>] .

Don_Cragun · February 1, 2019, 1:35am

OK... One final attempt...

Based on your single sample latest input file, the following seems to do what you want and will at least show you lines where it wasn't able to match:

awk '
$1 ~ /^PS/ {
	if(match($0, /[[][^[]* <Ob>[]]/))
		print $1, substr($0, RSTART + 1, RLENGTH - 7)
	else
		print "No Match Found on line " NR, $0
}' file

RudiC · February 1, 2019, 4:28am

Try also

awk -F"[][]" '/^ *PS.*<Ob>/ {sub(/ *<Ob>.*$/, ""); print $1, $NF}' file
 PS028,005  ABC 
 PS028,005  XYZ 
 PS028,006  QWL TXNWNJ-

jvoot · February 1, 2019, 7:30pm

That did it RubiC! Such a simple and elegant way to accomplish it! Thanks so much also to Scrutinizer and Don Cragun for your help!

If I may, could I please ask a question about the field separator value? The man AWK page seems to only imply rather than being explicit that the use of the square brackets when setting the field separator from the command line tells AWK to interpret what is between them as a regex rather than simply a fixed string which would otherwise be indicated by "..."? Is this correct? Thanks again!

--- Post updated at 04:30 PM ---

That did it RubiC! Such a simple and elegant way to accomplish it! Thanks so much also to Scrutinizer and Don Cragun for your help!

If I may, could I please ask two questions about how this code is working? The first is about the field separator value. The man AWK page seems to only imply rather than being explicit that the use of the square brackets when setting the field separator from the command line tells AWK to interpret what is between them as a regex rather than simply a fixed string which would otherwise be indicated by "..."? Is this correct?

Secondly, since the value for FS has been set to "][" how come when the print statement calls for {print $1} is does not print from the beginning of the line to the first instance of "][" but rather prints what would be $1 when FS is set to whitespace? In other words, given:

 PS028,006 [KJ <Cj>] [CM< <Pr>] [QWL TXNWNJ- <Ob>]
 Lexeme     KJ      # CM<      # QWL TXNWN J      #
 PhraseType  6(6) 1(1:2) 2(2.1,2.1,7)
 PhraseLab  509[0]    501[0]     503[0]
 ClauseType xQt0

Why does RudiC's code not give: PS028,006 [KJ <Cj> for {print $1} if FS is set to "]["?

Rather it gives the (desired) first field if FS was at default PS028,006 ?

Thanks again!

Don_Cragun · February 1, 2019, 8:20pm

jvoot:

That did it RubiC! Such a simple and elegant way to accomplish it! Thanks so much also to Scrutinizer and Don Cragun for your help!

If I may, could I please ask a question about the field separator value? The man AWK page seems to only imply rather than being explicit that the use of the square brackets when setting the field separator from the command line tells AWK to interpret what is between them as a regex rather than simply a fixed string which would otherwise be indicated by "..."? Is this correct? Thanks again!

--- Post updated at 04:30 PM ---

That did it RubiC! Such a simple and elegant way to accomplish it! Thanks so much also to Scrutinizer and Don Cragun for your help!

If I may, could I please ask two questions about how this code is working? The first is about the field separator value. The man AWK page seems to only imply rather than being explicit that the use of the square brackets when setting the field separator from the command line tells AWK to interpret what is between them as a regex rather than simply a fixed string which would otherwise be indicated by "..."? Is this correct?

Secondly, since the value for FS has been set to "][" how come when the print statement calls for {print $1} is does not print from the beginning of the line to the first instance of "][" but rather prints what would be $1 when FS is set to whitespace? In other words, given:
 PS028,006 [KJ <Cj>] [CM< <Pr>] [QWL TXNWNJ- <Ob>]
 Lexeme     KJ      # CM<      # QWL TXNWN J      #
 PhraseType  6(6) 1(1:2) 2(2.1,2.1,7)
 PhraseLab  509[0]    501[0]     503[0]
 ClauseType xQt0
Why does RudiC's code not give: PS028,006 [KJ <Cj> for {print $1} if FS is set to "]["?

Rather it gives the (desired) first field if FS was at default PS028,006 ?

Thanks again!

Hi jvoot,
The standards clearly state that the value of the awk FS variable is an extended regular expression and it doesn't matter whether it is set using the -F option, using the -v option, using an assignment statement between pathname operands, or using an assignment statement in the awk script itself. When the ERE is set to [][] that is a bracket expression that specifies that the <open-square-bracket> character ( [ ) and the <close-square-bracket> character ( ] ) are each to be treated as separate field separators.

With the FS value RudiC used, field 1 is everything that appears in the record before the 1st open or close square bracket character (including the leading and trailing <space>). I chose to use the default FS value because I didn't think you wanted the leading and trailing <space> characters at the start of lines in your input data to be included in your output.

Hope this helps,
Don

RudiC · February 2, 2019, 4:20am

You are partly right, the field separator string will be interpreted as a regex, and always. In Scrutinizers proposal (from which I stole shamelessly), he uses the bracket expression [][] .
man regex :

So awk splits the input line at any occurrence of either [ or ] .

BTW, awk 's default FS is a bracket expression regular expression (/[ \t\n]+/) by itself.

It does. Please apply what has been said to the repective line:

 PS028,006 [KJ <Cj>] [CM< <Pr>] [QWL TXNWNJ- <Ob>]
^          ^       ^ ^        ^ ^                ^--- last separator; $NF is empty
|          +-------+-+--------+-+-------------------- all FS
+---------------------------------------------------- field 1

Is that clearer now? If you want to remove the leading space from field 1, additional measures must be taken.

Don_Cragun · February 2, 2019, 5:20am

This is a common misconception. With the input we have been discussing in this thread:

 PS028,006 [KJ <Cj>] [CM< <Pr>] [QWL TXNWNJ- <Ob>]

if that were the default ERE used for separating fields, the default first field would be the empty string before the space at the start of the line. But the actual default first field is PS028,006 (with no leading or trailing <space>s).

The actual default FS value is a single <space> character which is a regex that has a special meaning in awk (i.e., it does not have this special meaning in most other utilities). It is the only utility in the standards where <space> has this special meaning in an ERE used as a field separator. In awk , when an entire field separator ERE is a single <space> character, awk is required to skip leading and trailing <blank> and <newline> characters (where a <blank> character is any character in the current Locale's blank character class) and then fields shall be delimited by sets of one or more <blank> or <newline> characters. In the C and POSIX locales, a <blank> is either a <space> character or a <tab> character; in other locales additional characters may also be included in the list of characters in the blank character class (thereby being ignored at the start and end of a record and being treated as additional elements in field separators in other places).

RudiC · February 2, 2019, 7:08am

Thanks, Don Cragun, for this clarification.
Indeed, man gawk is way more explicit:

than is my man mawk :

which I used in my above post. man gawk does not have this statement.

Scrutinizer · February 2, 2019, 8:14am

Note that the behavior with the default FS=" " to skip and delimit using both blanks and newlines, used to be different in older Posix implementations, where blanks were used, but not newlines. mawk and gawk still support this older POSIX defined behavior, with special compatibility command line options.

compare:

$> echo "1.   222   333.
444.   555.666" | mawk '{print $1}' RS=.
1
222
444
555
666
$>

to

$> echo "1.   222   333.
444.   555.666" | mawk -W posix_space '{print $1}' RS=.
1
222

444
555
666

$>

Likewise for gawk with the --posix option.

jvoot · February 4, 2019, 3:16pm

rudic:

It does. Please apply what has been said to the repective line:
 PS028,006 [KJ <Cj>] [CM< <Pr>] [QWL TXNWNJ- <Ob>]
^          ^       ^ ^        ^ ^                ^--- last separator; $NF is empty
|          +-------+-+--------+-+-------------------- all FS
+---------------------------------------------------- field 1
Is that clearer now? If you want to remove the leading space from field 1, additional measures must be taken.

OK, thank you so much. I was under the impression that the field separator value was set to the *string* "][" rather than "]" or "[", thus I thought that $1 in the code would have been PS028,006 [KJ <Cj> , rather than PS028,006 . This was very helpful. Thank you for taking the time to explain this.