Hi,
I am trying to incorporate 2 functions into my `awk` command.
I want to lower case Column 2 (which is essentially the same information in Col1, except in Col1 I want to maintain the capitalization) and I want to count from 0-N that begins and ends with the start of certain markers that I have.
The data (tab-separated) currently looks like this:
<s>
He PRP -
could MD -
tell VB -
she PRP -
was VBD -
teasing VBG -
him PRP -
. . .
</s>
<s>
He PRP -
kept VBD -
his PRP$ -
eyes NNS -
closed VBD -
, , -
but CC -
he PRP -
could MD -
feel VB -
himself PRP -
smiling VBG -
. . .
</s>
The ideal output would be like this:
<s>
He he PRP 1
could could MD 2
tell tell VB 3
she she PRP 4
was was VBD 5
teasing teasing VBG 6
him him PRP 7
. . . 8
</s>
<s>
He he PRP 1-
kept kept VBD 2
his his PRP$ 3
eyes eyes NNS 4
closed closed VBD 5
, , , 6
but but CC 7
he he PRP 8
could could MD 9
feel feel VB 10
himself himself PRP 11
smiling smiling VBG 12
. . . 13
</s>
The 2-step `awk` that I am trying that does not work is this:
Not sure how to reconcile your written spec and the sampe output. Do you mean you want to insert a field by copying tolower($1) between $1 and $2? And, the count info should be the number of lines between <s> and </s> ?
---------- Post updated at 13:08 ---------- Previous update was at 12:57 ----------
Assuming above thoughts to be true, try
awk '
{if ($1 ~ "<\/?s>") ST = NR
else {$1=$1 OFS tolower($1)
$3=NR-ST
}
}
1
' OFS="\t" file
<s>
He he PRP 1
could could MD 2
tell tell VB 3
she she PRP 4
was was VBD 5
teasing teasing VBG 6
him him PRP 7
. . . 8
</s>
<s>
He he PRP 1
kept kept VBD 2
his his PRP$ 3
eyes eyes NNS 4
closed closed VBD 5
, , , 6
but but CC 7
he he PRP 8
could could MD 9
feel feel VB 10
himself himself PRP 11
smiling smiling VBG 12
. . . 13
</s>
This solution words great for ascii characters. However, I have some characters that are non-ascii and they do not convert correctly when using
tolower
. For instance, this is what happened:
<s>
Pero pero cc 0
lo lo da0000 1
m�s m?s rg 2
importante importante aq0000 3
, , fc 4
no no rn 5
s�lo s?lo rg 6
desde desde sp000 7
la la da0000 8
visi�n visi?n nc0s000 9
de de sp000 10
una una di0000 11
parte parte nc0s000 12
</s>
Is there a way to maintain the non-ascii character when using
With my awk version, tolower doesn't replace the non-ASCII char with a queston mark, but just returns the char as is. Unfortunately, I didn't find an awk solution for your problem. But, with bash , this might work:
It looks like there is a mismatch between the Locale that was used when creating the file and the Locale used when running your awk script.
Try running your awk script again with the LC_CTYPE environment variable set to a locale that uses the same character set used to write your file and that contains the accented characters in your file in class alpha.