Context for use of [.symbol.] awk notation

Hi

Just wondering ... do you have an example of context that would demonstrates how usefull the awk notation [.symbol.] can efficiently be used ?

Thx :rolleyes:

Without context I'm not sure what you mean.

["Associative arrays"]? Removing duplicates in a list of a few million unsorted items or less is one very common use. awk '! ($0 in A) { A[$0] ; print }'

/[rR]egular [eE]xpressions.?/ Any time you need to match a set of characters. Like stripping non-alphanumeric / non-space characters.

awk '{ gsub(/[^a-zA-Z0-9\r\n\t ]+/, ""); } 1'
1 Like

Hi Corona,

Thank for your time but i already know how do the associativ array works.:wink:

In fact i was refering and wondering about to the "collating" notation mentionned here

it says "A collating symbol is a multi-character sequence that should be treated as a unit"

so if [.my_word.] is more or less processed the same way as /my_word/ , i don't see the added value of this specific notation so i was wondering what is behind "treated as a unit" ...

So if someone has a good example of a context in which such notation is necessary, i would be glad to have a look at it, because i think i miss something here. :o

Oh, that's a new one on me.

It looks like an internationalization feature, awk's equivalent of digraphs and trigraphs, multi-byte sequences which implement "extended" non-ASCII characters while still writing the program in pure ASCII. They're predefined, so [.STRING.] is meaningless, and there's a big list somewhere of what ASCII sequences actually translate to what Russian characters somewhere.

Of course, the list will be in Russian, so us ASCII-worlders probably don't know the right words to find it. It will also probably depend on being in the right extended-ascii set where they have any meaning and using some Russian subset of awk. This feature is often not implemented unless it's really needed.

So to us, not that useful. To someone's special Russian awk in Russia, it might be indispensable.

1 Like

In some languages (such as Welsh), the two character sequence 'ch' is treated as a single collating element and sorts differently from the two single collating elements (and characters) 'c' and 'h'. I don't understand all of those rules, but when the sound made when pronouncing the characters is as it is when pronouncing "church", the collating element used is 'ch' and when the sound made is more like 'k' (as in "Christ"), the two collating elements 'c' and 'h' are used. If I understand it correctly, in a locale for Welsh, the RE [[.Ch.]] should match the "Ch" in "Church", but should not match the "Ch" in "Christ"; and the RE [[.C.][.h.]] should match the start of "Christ", but should not match the start of "Church".

In addition to the collating element bracket expressions, there are the more common character class bracket expressions like [[:alnum:]] which will match any alphabetic or numeric character. And, the equivalence class expressions (also uncommon in English locales) like [[=e=]] which will match any character in the same equivalence class. For example, in various European language locales, [[=e=]] could match "�", "�", "�", "", "�", "", "", "", "", "", "", or "" in addition to matching "e".

And, of course, there are the matching list and non-matching list bracket expressions like [ch] (which matches a "c" or an "h") and [^ch] (which matches any single character that is not "c" and is not "h").

In what context would awk use collation, though? > < for strings, or does it have other meaning?

In standard awk , just for < and > on string operands. I believe gawk and some other versions of awk have extensions to the standards that provide built-in functions to sort arrays (which presumably would sort in collation order).

2 Likes