How to lowercase the values in a column in awk and include a dynamic counter?

owwow14 · October 30, 2015, 6:58am

Hi,
I am trying to incorporate 2 functions into my `awk` command.
I want to lower case Column 2 (which is essentially the same information in Col1, except in Col1 I want to maintain the capitalization) and I want to count from 0-N that begins and ends with the start of certain markers that I have.

The data (tab-separated) currently looks like this:

<s>
He	PRP	-
could	MD	-
tell	VB	-
she	PRP	-
was	VBD	-
teasing	VBG	-
him	PRP	-
.	.	.
</s>
<s>
He	PRP	-
kept	VBD	-
his	PRP$	-
eyes	NNS	-
closed	VBD	-
,	,	-
but	CC	-
he	PRP	-
could	MD	-
feel	VB	-
himself	PRP	-
smiling	VBG	-
.	.	.
</s>

The ideal output would be like this:

<s>
He	he	PRP	1
could	could	MD	2
tell	tell	VB	3
she		she	PRP	4
was	was	VBD		5
teasing	teasing	VBG	6
him	him	PRP	7
.	.	.	8
</s>
<s>
He	he	PRP	1-
kept	kept	VBD	2
his	his	PRP$	3
eyes	eyes	NNS	4
closed	closed	VBD	5
,	,	,	6
but	but	CC	7
he	he	PRP	8
could	could	MD	9
feel	feel	VB	10
himself		himself	PRP	11
smiling	smiling	VBG	12
.	.	.	13
</s>

The 2-step `awk` that I am trying that does not work is this:

Step 1:

awk '!NF{$0=x}1' input | awk '{$1=$1; print "<s>\n" $0 "\t.\n</s>"}' RS=  FS='\n' OFS='\t-\n' > output

Here, I do not know how to make the "-" into a counter

and Step 2 (which directly gives me an error):

awk '{print $1 "\t" '$1 = tolower($1)' "\t" $2 "\t" $3}' input > output

Any suggestions 1. on how to solved the lower and counter and 2. if it is possible to combine these two steps?

Thank you in advance

pravin27 · October 30, 2015, 7:49am

awk '/^<s>/{i=0;print $1;next}
/<\/s>/{print $1;next}
{tmp=tolower($1); print $1,tmp,$2,i++;}' OFS="\t" filename

RudiC · October 30, 2015, 8:08am

Not sure how to reconcile your written spec and the sampe output. Do you mean you want to insert a field by copying tolower($1) between $1 and $2? And, the count info should be the number of lines between <s> and </s> ?

---------- Post updated at 13:08 ---------- Previous update was at 12:57 ----------

Assuming above thoughts to be true, try

awk '
        {if ($1 ~ "<\/?s>") ST = NR
         else   {$1=$1 OFS tolower($1)
                 $3=NR-ST
                }
        }
1
' OFS="\t" file
<s>
He      he      PRP     1
could   could   MD      2
tell    tell    VB      3
she     she     PRP     4
was     was     VBD     5
teasing teasing VBG     6
him     him     PRP     7
.       .       .       8
</s>
<s>
He      he      PRP     1
kept    kept    VBD     2
his     his     PRP$    3
eyes    eyes    NNS     4
closed  closed  VBD     5
,       ,       ,       6
but     but     CC      7
he      he      PRP     8
could   could   MD      9
feel    feel    VB      10
himself himself PRP     11
smiling smiling VBG     12
.       .       .       13
</s>

owwow14 · November 2, 2015, 9:51am

rudic:

Not sure how to reconcile your written spec and the sampe output. Do you mean you want to insert a field by copying tolower($1) between $1 and $2? And, the count info should be the number of lines between <s> and </s> ?

---------- Post updated at 13:08 ---------- Previous update was at 12:57 ----------

Assuming above thoughts to be true, try
awk '
   {if ($1 ~ "<\/?s>") ST = NR
   else   {$1=$1 OFS tolower($1)
   $3=NR-ST
   }
   }
1
' OFS="\t" file
<s>
He      he      PRP     1
could   could   MD      2
tell    tell    VB      3
she     she     PRP     4
was     was     VBD     5
teasing teasing VBG     6
him     him     PRP     7
.       .       .       8
</s>
<s>
He      he      PRP     1
kept    kept    VBD     2
his     his     PRP$    3
eyes    eyes    NNS     4
closed  closed  VBD     5
,       ,       ,       6
but     but     CC      7
he      he      PRP     8
could   could   MD      9
feel    feel    VB      10
himself himself PRP     11
smiling smiling VBG     12
.       .       .       13
</s>

This solution words great for ascii characters. However, I have some characters that are non-ascii and they do not convert correctly when using

 tolower

. For instance, this is what happened:

<s>
Pero	pero	cc	0
lo	lo	da0000	1
m�s	m?s	rg	2
importante	importante	aq0000	3
,	,	fc	4
no	no	rn	5
s�lo	s?lo	rg	6
desde	desde	sp000	7
la	la	da0000	8
visi�n	visi?n	nc0s000	9
de	de	sp000	10
una	una	di0000	11
parte	parte	nc0s000	12
</s>

Is there a way to maintain the non-ascii character when using

 tolower

?

RudiC · November 2, 2015, 11:42am

With my awk version, tolower doesn't replace the non-ASCII char with a queston mark, but just returns the char as is. Unfortunately, I didn't find an awk solution for your problem. But, with bash , this might work:

X="m�s s�lo"
echo ${X^^}
M�S S�LO
Y=${X^^}
echo ${Y,,}
m�s s�lo

Don_Cragun · November 2, 2015, 3:20pm

It looks like there is a mismatch between the Locale that was used when creating the file and the Locale used when running your awk script.

Try running your awk script again with the LC_CTYPE environment variable set to a locale that uses the same character set used to write your file and that contains the accented characters in your file in class alpha.

owwow14 · November 12, 2015, 7:23am

I know this is a zombie thread, I just wanted to mention that setting the

LC

variable to UTF-8 completely solved the problem. Thank you!