Convert to upper case first letter of each word in column 2

cgkmal · May 1, 2010, 12:00am

Hi guys,

I have a file separated by ",". I�m trying to change to upper case the first letter of each word in column 2 to establish a standard format on this column.

I hope somebody could help me to complete the SED or AWK script below.

The file looks like this:
(Some lines in column 2 are all uppercase, some are lower case and some are mixed)

PRODUCT No.,SCIENCE BOOKS,DESCRIPTION
Product 1,PHILOSOPHIAE NATURALIS PRINCIPIA MATHEMATICA (1687),Blah blah blah
Product 2,Dialogue concerning the two chief world systems (1632),Blah blah blah
Product 3,De Revolutionibus Orbium Coelestium (1543),Blah blah blah
Product 4,the voyage of the beagle (1845),Blah blah blah

If I use SED, I can change to lower case every letter and then to upper case only first letter of every word, but sed changes over all columns.

sed -e 's/.*/\L&/' -e 's/\<./\u&/g' file 

where:
sed 's/.*/\L&/' file   --> Changes to lower case
sed 's/\<./\u&/g' file --> Changes only first letter of each word to uppercase

Is there a way to say SED to work over a specific column?

Well, with AWK, I�ve been trying using gensub function, but I don�t know how to
replace the matched pattern with the same pattern but in upper case.

awk 'BEGIN{FS=OFS=","} NR>1{$2=tolower($2);$2=gensub(/\<[A-Za-z]/,"X","g",$2)} {print $0}'

Where:
NR>1{$2=tolower($2) --> Changes all within column 2 to lower case
$2=gensub(/\<[A-Za-z]/,"X","g",$2) --> Replaces with "X" the first letter of each word (regexp \<[A-Za-z]) within column 2.

Is there a way to use toupper with a remembered pattern?
(i.e instead put "X", use toupper("\1")  within gensub )

This awk script replaces the first letter with a constant (a capital X)
PRODUCT No.,SCIENCE BOOKS,DESCRIPTION
Product 1,Xhilosophiae Xaturalis Xrincipia Xathematica (1687),Blah blah blah
Product 2,Xialogue Xoncerning Xhe Xwo Xhief Xorld Xystems (1632),Blah blah blah
Product 3,Xe Xevolutionibus Xrbium Xoelestium (1543),Blah blah blah
Product 4,Xhe Xoyage Xf Xhe Xeagle (1845),Blah blah blah

Thanks in advance.

durden_tyler · May 1, 2010, 12:34am

I don't know about sed or awk, but Perl empowers you to make this change rather trivially -

$ 
$ 
$ cat -n f7
     1    PRODUCT No.,SCIENCE BOOKS,DESCRIPTION
     2    Product 1,PHILOSOPHIAE NATURALIS PRINCIPIA MATHEMATICA (1687),Blah blah blah
     3    Product 2,Dialogue concerning the two chief world systems (1632),Blah blah blah
     4    Product 3,De Revolutionibus Orbium Coelestium (1543),Blah blah blah
     5    Product 4,the voyage of the beagle (1845),Blah blah blah
$ 
$ perl -F, -lane '$F[1]=~s/(\w+)/ucfirst(lc($1))/ge; print join(",",@F)' f7
PRODUCT No.,Science Books,DESCRIPTION
Product 1,Philosophiae Naturalis Principia Mathematica (1687),Blah blah blah
Product 2,Dialogue Concerning The Two Chief World Systems (1632),Blah blah blah
Product 3,De Revolutionibus Orbium Coelestium (1543),Blah blah blah
Product 4,The Voyage Of The Beagle (1845),Blah blah blah
$ 
$

tyler_durden

cgkmal · May 1, 2010, 12:55am

Hi durden_tyler,

Thanks for your reply.

Perl is a excellent way, but I�m don�t know how to use it. I�ve tried your script and works fine with the sample file. In order to use in a real file that could be of 40 columns, how can I change the script to work for instance, only over column 25 or 35?

Thanks again.

durden_tyler · May 1, 2010, 1:08am

You'll have to learn it if you want to use it. I couldn't think of any other alternative.

Arrays in Perl have zero-based indexes. $F[1] is the 2nd element, which corresponds to the 2nd column of your file. On the same lines, $F[24] refers to the 25th element and $F[34] to the 35th.

tyler_durden

cgkmal · May 1, 2010, 1:48am

Your right durden, It could be nice to learn it for me.

By the moment, I changed a little bit the script considering a file of 40 columns an separated by pipe "|", but when I see the output many columns have their content disordered, looks like have been moved to other column.

The script I used is:

perl -F, -lane '$F[20]=~s/(\w+)/ucfirst(lc($1))/ge; print join("|",@F)' inputfile > output

What could it be?

Thanks in advance.

Franklin52 · May 1, 2010, 7:29am

You can play around with this example, adjust col to convert another column:

awk -F, -v col=2 ' 
NR > 1{
  n=split(tolower($col),a," ")
  $col=toupper(substr(a[1],1,1)) substr(a[1],2)
  for(i=2;i<=n;i++) {
    $col=$col " " toupper(substr(a,1,1)) substr(a,2)
  }
}1' OFS="|" infile

Scrutinizer · May 1, 2010, 8:50am

Or ksh:

#!/bin/ksh
typeset -L1 -u first
typeset -l rest
while IFS=, read field1 field2 fieldn; do
  titlefield2=""
  for i in $field2; do
    first=$i
    rest=${i#?}
    titlefield2="$titlefield2$first$rest "
  done
  printf "%s,%s,%s\n" "$field1" "${titlefield2% }" "$fieldn"
done < infile

pseudocoder · May 1, 2010, 10:08am

Hope you will also give my solution a chance, as I have put too much effort in it.

capitalize

# capitalize each word in a string
function capitalize(input,    result, words, n, i, w)
{
	result = ""
	n = split(input, words, " ")
	for (i = 1; i <= n; i++) {
		w = words
		w = toupper(substr(w, 1, 1)) substr(w, 2)
		if (i > 1)
			result = result " "
		result = result w
	}
	return result
}

# main program, for testing
{ print capitalize($0) }

$ awk -F"," '{$2=tolower($2); print $2}' file | nawk -f capitalize > tfile1
$ awk -F"," '{print $2}' file > tfile2
$ touch ofile
$ exec 3< tfile2
$ while read line1
do
read line2 <&3
sed ''/"$line2"/s//"$line1"/'' file | grep "$line1" >> ofile
done < tfile1
$ rm tfile1 tfile2

A question for the awk pro's:
While I was trying to avoid creating of temporary files, I managed to put the output of the first two commands in two variables

formatted=$(the first awk command here)
tosed=$(the second awk command here)

, but I didn't manage to split the content, so I'd really like to know how to put it in two shell arrays instead.

cgkmal · May 1, 2010, 5:52pm

Hi to all,

Frankin,

Thanks for your answer, the script it works perfect. I�ll try to understand better the use you gave it to functions you�ve used. Great!.

Scrutinizer,

I�ve tried your solution, but It looks does not change to upper case only first letter on column2, instead adds the
text within column 2 and last line isn�t printed. What could it be? Below the output I receive:

PRODUCT No.,SCIENCECIENCE BOOKSOOKS,DESCRIPTION
Product 1,PHILOSOPHIAEHILOSOPHIAE NATURALISATURALIS PRINCIPIARINCIPIA MATHEMATICAATHEMATICA (1687)1687),blah blah blah
Product 2,Dialogueialogue concerningoncerning thehe twowo chiefhief worldorld systemsystems (1632)1632),Blah blah blah
Product 3,Dee Revolutionibusevolutionibus Orbiumrbium Coelestiumoelestium (1543)1543),Blah blah blah

pseudocoder,

I appreciate really your effort in try to help me. Thanks really and certainly from each contribution we learned a lot.

But I don�t know why it doesn�t work for me. I copied the "capitalized" function code in a file called "capitalized"
and the input file saved in the same directory named only "file" like in your script. Then I ran the awk script until last
line (rm tfile1 ...) and creates this temp files with permission denied access. They cannot remove and the ofile is created
with 0 KB. I�m not sure why this happens.

Thanks again to all for just great help.

Best regards,

durden_tyler · May 1, 2010, 10:49pm

cgkmal:

... considering a file of 40 columns an separated by pipe "|", but when I see the output many columns have their content disordered, looks like have been moved to other column.

The script I used is:
perl -F, -lane '$F[20]=~s/(\w+)/ucfirst(lc($1))/ge; print join("|",@F)' inputfile > output
What could it be?
...

The character following the "-F" switch is your file's delimiter. Since your file's delimiter was a comma (",") in your first post, I put a comma over there.

The first argument of the join() function is the delimiter you want in your output. Since your first post specified the delimiter of the output as a comma (","), I put a comma over there as well.

Based on the information above, you should be able to fix the one-liner to suit your needs.

tyler_durden

cgkmal · May 2, 2010, 2:13am

That�s right durden, I had done that, I mean, replace the (-F,) to (-F"|"), in this way.

perl -F"|" -lane '$F[1]=~s/(\w+)/ucfirst(lc($1))/ge; print join("|",@F)' inputfile

And with this inputfile:

PRODUCT No.|SCIENCE BOOKS|DESCRIPTION|
Product 1|PHILOSOPHIAE NATURALIS PRINCIPIA MATHEMATICA (1687)|blah blah blah|
Product 2|Dialogue concerning the two chief world systems (1632)|Blah blah blah|
Product 3|De Revolutionibus Orbium Coelestium (1543)|Blah blah blah|
Product 4|the voyage of the beagle (1845)|Blah blah blah |

I get every letter piped.

$ perl -F"|" -lane '$F[1]=~s/(\w+)/ucfirst(lc($1))/ge; print join("|",@F)' inputfile
P|R|O|D|U|C|T| |N|o|.|||S|C|I|E|N|C|E| |B|O|O|K|S|||D|E|S|C|R|I|P|T|I|O|N||
P|R|o|d|u|c|t| |1|||P|H|I|L|O|S|O|P|H|I|A|E| |N|A|T|U|R|A|L|I|S| |P|R|I|N|C|I|P|I|A| |M|A|T|H|E|M|A|T|I|C|A| |(|1|6|8|7|)|||b|l|a|h| |b|l|a|h| |b|l|a|
h||
P|R|o|d|u|c|t| |2|||D|i|a|l|o|g|u|e| |c|o|n|c|e|r|n|i|n|g| |t|h|e| |t|w|o| |c|h|i|e|f| |w|o|r|l|d| |s|y|s|t|e|m|s| |(|1|6|3|2|)|||B|l|a|h| |b|l|a|h| |
b|l|a|h||
P|R|o|d|u|c|t| |3|||D|e| |R|e|v|o|l|u|t|i|o|n|i|b|u|s| |O|r|b|i|u|m| |C|o|e|l|e|s|t|i|u|m| |(|1|5|4|3|)|||B|l|a|h| |b|l|a|h| |b|l|a|h||
P|R|o|d|u|c|t| |4|||t|h|e| |v|o|y|a|g|e| |o|f| |t|h|e| |b|e|a|g|l|e| |(|1|8|4|5|)|||B|l|a|h| |b|l|a|h| |b|l|a|h| ||

Again thanks for your help.

Regards.

pseudocoder · May 2, 2010, 8:35am

Very strange, unfortunately I can't tell why that happens.
It pretty certainly does not have anything to do with the code.
In practice a "echo test > tfile" would also produce a no-access file.
Logically the ofile is empty, because the while loop had no access to tfile1 and tfile2.
It would be very nice to know what happens if you redirect "> tfile" the output of Franklin52 code into tfile.
In practice you also should have that permission denied access issue.
Please let me know.

cgkmal · May 2, 2010, 5:36pm

Hi pseudocoder,

I�ve run your code afresh, this time generates again the 3 files (ofile, tfile1 and tfile2) with 0KB, but without denied permission. I redirected the franklin code to each of 3 files above and were fill with the correct output.

Strange like you said, I don�t know what could happened.

Thanks for your help again

pseudocoder · May 2, 2010, 6:04pm

I almost can't believe it...
I know Franklin's code is definitely the best solution for you, but for better understanding I'd like to know what happens if you just run

awk -F"," '{$2=tolower($2); print $2}' file | nawk -f capitalize

and

awk -F"," '{print $2}' file

cgkmal · May 2, 2010, 7:15pm

I run these codes and worked, but the only change I made is use gawk or awk instead of nawk (I haven�t installed it).

$ awk -F"|" '{print $3}' file
SCIENCE BOOKS
PHILOSOPHIAE NATURALIS PRINCIPIA MATHEMATICA (1687)
Dialogue concerning the two chief world systems (1632)
De Revolutionibus Orbium Coelestium (1543)
the voyage of the beagle (1845)

$ awk -F"|" '{$3=tolower($3); print $3}' file | gawk -f capitalize
Science Books
Philosophiae Naturalis Principia Mathematica (1687)
Dialogue Concerning The Two Chief World Systems (1632)
De Revolutionibus Orbium Coelestium (1543)
The Voyage Of The Beagle (1845)

I think this probes your code works fine, I don�t now why run it in the other way failed to me.

Anyway, thanks for your interest to help.

Best regards,

pseudocoder · May 2, 2010, 7:43pm

All right, I'm glad we could work it out.
Thank you for testing and giving me feedback.

durden_tyler · May 3, 2010, 11:27am

That's wrong, cgkmal. The "-F" switch is followed by the pattern to split on. It's a regular expression pattern and not a literal string. You can enclose it within double-quotes or single-quotes, but that doesn't make it a literal string. The character "|" has a special meaning in regular expression terminology; it is the "OR" operator.

If you want to specify the literal pipe character as the delimiter, then you'll have to remove the special meaning of "OR" operator. And you do that by escaping the OR operator like so "\|".

So your command line should be:

perl -F"\|" -lane '$F[1]=~s/(\w+)/ucfirst(lc($1))/ge; print join("|",@F)' inputfile

And with this inputfile:

PRODUCT No.|SCIENCE BOOKS|DESCRIPTION|
Product 1|PHILOSOPHIAE NATURALIS PRINCIPIA MATHEMATICA (1687)|blah blah blah|
Product 2|Dialogue concerning the two chief world systems (1632)|Blah blah blah|
Product 3|De Revolutionibus Orbium Coelestium (1543)|Blah blah blah|
Product 4|the voyage of the beagle (1845)|Blah blah blah |

I get every letter piped.

$ perl -F"|" -lane '$F[1]=~s/(\w+)/ucfirst(lc($1))/ge; print join("|",@F)' inputfile
P|R|O|D|U|C|T| |N|o|.|||S|C|I|E|N|C|E| |B|O|O|K|S|||D|E|S|C|R|I|P|T|I|O|N||
P|R|o|d|u|c|t| |1|||P|H|I|L|O|S|O|P|H|I|A|E| |N|A|T|U|R|A|L|I|S| |P|R|I|N|C|I|P|I|A| |M|A|T|H|E|M|A|T|I|C|A| |(|1|6|8|7|)|||b|l|a|h| |b|l|a|h| |b|l|a|
h||
P|R|o|d|u|c|t| |2|||D|i|a|l|o|g|u|e| |c|o|n|c|e|r|n|i|n|g| |t|h|e| |t|w|o| |c|h|i|e|f| |w|o|r|l|d| |s|y|s|t|e|m|s| |(|1|6|3|2|)|||B|l|a|h| |b|l|a|h| |
b|l|a|h||
P|R|o|d|u|c|t| |3|||D|e| |R|e|v|o|l|u|t|i|o|n|i|b|u|s| |O|r|b|i|u|m| |C|o|e|l|e|s|t|i|u|m| |(|1|5|4|3|)|||B|l|a|h| |b|l|a|h| |b|l|a|h||
P|R|o|d|u|c|t| |4|||t|h|e| |v|o|y|a|g|e| |o|f| |t|h|e| |b|e|a|g|l|e| |(|1|8|4|5|)|||B|l|a|h| |b|l|a|h| |b|l|a|h| ||

...

The OR operator "|", when supplied as a delimiter to the split function, works like the undef or null string. The net effect is that it splits on each character. Note how the operation on $F[1] worked only on the second character - that's because after the split, $F[1] was the character "R" or "r" of your inputfile.

The one-liner below should make that clear:

$
$
$ echo "abcd|1234" | perl -F"|" -lane 'print $F[1]'
b
$
$ # In this one-liner, the input was split on each character due to "|"
$ # So, $F[1] was the second element of array @F i.e. "b"
$
$
$ echo "abcd|1234" | perl -F"\|" -lane 'print $F[1]'
1234
$
$ # In this one-liner, the input was split on correctly due to "\|"
$ # So, $F[1] was the second element of array @F i.e. "1234"
$

HTH,
tyler_durden