awk string-function

1in10 · July 1, 2014, 7:54pm

Sorry for setting my foot as just a technical user on holy ground here again asking and learning. After tries with strings and arrays I decided to go for an if-else-if-ladder for a database, just because it looks a little easier to me, but as it happens, my result is not the desired one.
So here is the scheme for the if-else-if-ladder in awk. Given a database.txt with four rows, the fourth one is my aim.
The last two lines of the database.txt

webster    3:48, 29.06.2014 Jun
webster    3:50, 29.06.2014 Jun

That's the scheme for the if-else-if-ladder

 if(conditional-expression1)
    {action1 ; action 2;}
    else if(conditional-expression2)
    {action1 ; action 2;}
    else if(conditional-expression3)
    {action1 ; action 2;}
    .
    .
    else
    action n;

That's my first step into it.

#declaring myarray as row four with the months
awk '(string["JanFebMarAprMayJunJulAugSepOctNovDec"]=$4); 

#defining mysubstring to be substring(named submonth) of string with startposition and length
mysubstring=submonth[string("JanFebMarAprMayJunJulAugSepOctNovDec",1,3)],

#guess here my trouble starts, no matter what if-condition, the interpreter complains about a syntax error
#it should start in row four, position one, maximumlength, matching for example [Jun]
#choosing submonth as a term, because substring is a named function
#so I put the following line, not sure about it, 'cause it doesn't work
b=split(submonth,array,search)
search="submonth"
(mysubstring($4,1,3)==submonth[Jun])
#and by now it gets something like a for-condition rather then the if-else-if-ladder
#followed by the two actions. 
($4>max) max==$4 {OFMT="%6f"; ORS ":"; print max; sum=sum+$4};
{OFMT="%-7.4f\n"; ORS ":"; print sum/NR}  ; '                 database.txt

but the sad thing stdout tells me something like this: There is no function for string defined. How to define it? I guess it would be split. While setting split, the interpreter tells me that 'search' still is no definiton for the string.

Can anybody give me a hint on that, regards and thanks in advance.

Chubler_XL · July 1, 2014, 11:08pm

Firstly I think you need to get some terms right. You talk about "rows" but I feel you mean fields. For the last line of your database.txt we have:

field 1 = "webster"
field 2 = "3:50,"
field 3 = "29.06.2014"
field 4 = "Jun"

index() is the function that matches strings and returns the integer position of the string within the text. I can demonstrate this with these simple awk programs:

$ awk 'BEGIN { print index("JanFebMarAprMayJunJulAugSepOctNovDec", "Jan") }'
1

$ awk 'BEGIN { print index("JanFebMarAprMayJunJulAugSepOctNovDec", "Jun") }'
16

$ awk 'BEGIN { print index("JanFebMarAprMayJunJulAugSepOctNovDec", "Zoo") }'
0

So here you can see we are getting close to finding the month number for a string and zero for an invalid month.

From here we could add a couple of dummy characters (ones we never expect to see in field number 4 like space) to the front of the string. Now index of Jan will be 3 and Feb 6 and Mar 9, etc. so if we divide by 3 we will have the month number:

$ awk 'BEGIN { print index("  JanFebMarAprMayJunJulAugSepOctNovDec", "Oct") / 3}'
10

Now lets put this together into a working demo awk program with it's own user-defined get_month_num() function. This should help you get along the way to doing what you want:

awk '
function get_month_num(mth) {
    return index("  JanFebMarAprMayJunJulAugSepOctNovDec", mth) / 3
}
{
   print "Month number for row " NR " is " get_month_num($4)
   if(get_month_num($4) > 5) {
       print "  This month is later than May"
    } else {
       print "   This month is May or earlier"
    }
}' database.txt

Don_Cragun · July 1, 2014, 11:13pm

1in10:

Sorry for setting my foot as just a technical user on holy ground here again asking and learning. After tries with strings and arrays I decided to go for an if-else-if-ladder for a database, just because it looks a little easier to me, but as it happens, my result is not the desired one.
So here is the scheme for the if-else-if-ladder in awk. Given a database.txt with four rows, the fourth one is my aim.
The last two lines of the database.txt
webster    3:48, 29.06.2014 Jun
webster    3:50, 29.06.2014 Jun
That's the scheme for the if-else-if-ladder
 if(conditional-expression1)
   {action1 ; action 2;}
   else if(conditional-expression2)
   {action1 ; action 2;}
   else if(conditional-expression3)
   {action1 ; action 2;}
   .
   .
   else
   action n;
That's my first step into it.
#declaring myarray as row four with the months
awk '(string["JanFebMarAprMayJunJulAugSepOctNovDec"]=$4); 

#defining mysubstring to be substring(named submonth) of string with startposition and length
mysubstring=submonth[string("JanFebMarAprMayJunJulAugSepOctNovDec",1,3)],
#guess here my trouble starts, no matter what if-condition, the interpreter complains about a syntax error
#it should start in row four, position one, maximumlength, matching for example [Jun]
#choosing submonth as a term, because substring is a named function
#so I put the following line, not sure about it, 'cause it doesn't work
b=split(submonth,array,search)
search="submonth"
(mysubstring($4,1,3)==submonth[Jun])
#and by now it gets something like a for-condition rather then the if-else-if-ladder
#followed by the two actions. 
($4>max) max==$4 {OFMT="%6f"; ORS ":"; print max; sum=sum+$4};
{OFMT="%-7.4f\n"; ORS ":"; print sum/NR}  ; '                 database.txt
but the sad thing stdout tells me something like this: There is no function for string defined. How to define it? I guess it would be split. While setting split, the interpreter tells me that 'search' still is no definiton for the string.

Can anybody give me a hint on that, regards and thanks in advance.

The first code segment shown in red above defines an array named string and sets the element of that array that has a string containing the abbreviated names of the months as a subscript to the 4th field on each line read from a file named database.txt and if that line has four or more fields and the 4th field does not evaluate to zero, awk copies that line to standard output.

The second code segment shown in red above calls a function named string() with three arguments. But, as awk told you, you have not defined a function named string() and the awk language does't provide a function named string() . Since awk doesn't know what the function string() is supposed to do, it can't run your script and it prints a diagnostic message

You have shown us code that isn't doing what you want it to do. But, you have not told us what you are trying to do, you have only shown us two lines from your four line input file, and you have not shown us what output you are trying to produce.

You have said that you want an if-else-if-ladder , but you haven't said what you want that if-else-if-ladder to do. We can't help you write code if we don't know what that code is supposed to do.

I do not understand how "Webster" or "Jun" is a length and what the maximum value of a string is in this context.

Show us ALL four lines of your input file.
Show us the output you are trying to produce.
Explain to us (in English) what your script is supposed to do to convert your sample input into that desired output.

1in10 · July 2, 2014, 8:50pm

@Don Cragun It is just about the last column of the database.txt, that is by chance row four or $4, not any of the fields ahead.
This is truly all of the input, due to the last run of the script. My database is just like this. This is the input-file, named database.txt

webster    3:48,  29.06.2014 Jun
webster    3:50,  29.06.2014 Jun
webster    23:11, 02.07.2014 Jul
webster    3:45,  02.07.2014 Jul

That means:
user, uptime, num-date, abbreviated name of the month.
Part of my confusion is due to three idioms used here, locales.gen for three different users, english, portuguese and german.
Preferably I work on the login with locales.gen in german. And from squeeze to wheezy it might has changed its location.
Reading your answer I am defining an array instead of a string. Furthermore the element of that array again is a string. And this is due to these square [ ] brackets!? So I learn that square brackets [] create, execute and show me the func. I use deliberately func, for learning now function is a restricted term in awk. (1)observation see below example
Cutting a long story short, this is what it should do:

create a string with the full names of the month

submonth=(date +%B)
#re-using the locale variable submonth for full name of the month, this is literally the fourth row or column in the 
#database.txt-file or input-file
awk  -v submonth=$(date +%B) '
#creating the string (" ..") and setting him equal to row or column number four of the database.txt or the input-file, 
#here the german version

string("JanuarFebruarM�rzAprilMaiJuniJuliAugustSeptemberOktoberNovemberDezember")=$4;

make submonth("Januar.....Dezember",1,9) with string,startpos,maxlen for the substring.

mysubstring=submonth($4,1,9),

set the first if-statement with the substring, thats where my embarrassment starts. So this is pseudo-code, but I'll put it as code.

if mysubstring("Januar") in submonth
{execute function1 ; function 2} and print result 1 + result 2 to stdout ;
else if mysubstring("Februar") in submonth
    {execute function1 ; function 2} and print result 1 + result 2 to stdout;
        else if mysubstring("M�rz") in submonth
            {execute function1 ; function 2} and print result 1 + result 2 to stdout ;

and so on, until it reaches (Dez) to finish the column $4. Which by coincidence is field four of all FNR or $0.
I am aware of this prosa, thats why I ask here.
What is called func(tion) in this here, to me is any calculation, as mentioned above, but it seems the statement print is a function too.

**(1)observation**
here is an example, that even I understood. The usage of a string and square brackets for an array with a function called search.

#!/usr/bin/awk -f
BEGIN {
# this script breaks up the sentence into words, using 
# a space as the character separating the words

string="January February March April May June July August September November December ?";
    search=" ";
    n=split(string,array,search);
    for (i=1;i<=n;i++) {
        printf("Word[%d]=%s\n",i,array);
    }
    exit;
}

Don_Cragun · July 2, 2014, 11:27pm

It is obvious that we have a huge language barrier here. My understanding of German is minimal. (I'm fluent in English, Standardese, and several computer languages.)

Row 4 of your database:

webster    3:48,  29.06.2014 Jun
webster    3:50,  29.06.2014 Jun
webster    23:11, 02.07.2014 Jul
webster    3:45,  02.07.2014 Jul

is the line shown in red. The 4th column in your database is list of the abbreviated month names in the 4th field (or column) in your database.

$0 expands to the contents of the line that is currently being processed by awk .

$4 expands to the contents of the 4th field on the line that is currently being processed by awk .

FNR is the line number of the line that is currently being processed within the file that is currently being read by awk .

The pseudo-code you have shown us is confusing us more than helping us understand what you are trying to do. With the four line database shown above, what output are you trying to produce?

Are the abbreviated month names in your database English (or C Locale) abbreviations or German abbreviations?

What awk functions are you trying to define? What arguments is each function supposed to take? What output is each function supposed to produce?

You seem to want to print "result 1" and "result 2"? Are these literal strings? If not, where to they come from?

1in10 · July 3, 2014, 8:04am

Details turning up to be bigger than they where before.
Referring to my pseudo-code-snippet the if-else-if-ladder shall

find mysubstring("Januar") in submonth
    {execute function 1; execute function 2} and print the result of both to stdout;
    else if mysubstring("Februar") in submonth
                   {execute function 1; execute function 2} and print the result of both to stdout;

On each machine there are only the abbreviated names of the month iwithin the database.txt-file of each user. So if user carla switches on, there is the portuguese abbreviated name written to her own input-file, while some frank uses on his account the english version, and me myself is using the german on my account. There is no server running here. I want to measure the total of running uptime and the average of uptime. Since I am still miles away to pump this into each of there accounts or joining it. So reading again about rows, FNR and NR, it is about the fourth field $4 in all entries or all the database.txt-input. The result of both functions, one and two, a decimal point integers, coming from a function-call within that curly { } brackets.
I try to reproduce the syntax by defininig the function that itself is called as function but named anything else

function 
    f_summing_up($4>max,max==$4,sum=sum+$4)    

#column $4 is greater than maximum and is itself 
#the maximum, f_suming_up
#handles the parameters for the first calculation    

        {OFMT="%6f"; ORS ":"; printf max; sum=sum+$4};  
#printing the default format with variable ORS":"

    f_average(sum=sum+$4)/NR))      
           
#division of sum of column four $4 
#by the number of entries
#which may contains the error, that NR refers to all NR but 
#but not the specific ones of column $4   

        {OFMT="%-7.4f\n"; ORS ":"; print sum/NR}

Don_Cragun · July 4, 2014, 10:04pm

1in10:

Details turning up to be bigger than they where before.
Referring to my pseudo-code-snippet the if-else-if-ladder shall
find mysubstring("Januar") in submonth
   {execute function 1; execute function 2} and print the result of both to stdout;
   else if mysubstring("Februar") in submonth
   {execute function 1; execute function 2} and print the result of both to stdout;
On each machine there are only the abbreviated names of the month iwithin the database.txt-file of each user. So if user carla switches on, there is the portuguese abbreviated name written to her own input-file, while some frank uses on his account the english version, and me myself is using the german on my account. There is no server running here. I want to measure the total of running uptime and the average of uptime. Since I am still miles away to pump this into each of there accounts or joining it. So reading again about rows, FNR and NR, it is about the fourth field $4 in all entries or all the database.txt-input. The result of both functions, one and two, a decimal point integers, coming from a function-call within that curly { } brackets.
I try to reproduce the syntax by defininig the function that itself is called as function but named anything else
function 
   f_summing_up($4>max,max==$4,sum=sum+$4)    

#column $4 is greater than maximum and is itself 
#the maximum, f_suming_up
#handles the parameters for the first calculation    

   {OFMT="%6f"; ORS ":"; printf max; sum=sum+$4};  
#printing the default format with variable ORS":"

   f_average(sum=sum+$4)/NR))      
   
#division of sum of column four $4 
#by the number of entries
#which may contains the error, that NR refers to all NR but 
#but not the specific ones of column $4   

   {OFMT="%-7.4f\n"; ORS ":"; print sum/NR}

Most of what you have said above makes absolutely no sense to me.

I repeat: With the following four lines in your database:

webster    3:48,  29.06.2014 Jun
webster    3:50,  29.06.2014 Jun
webster    23:11, 02.07.2014 Jul
webster    3:45,  02.07.2014 Jul

exactly what output do you hope to produce?

1in10 · July 5, 2014, 7:18am

Bloke, at least one step ahead with the statement, no matter whether or not the format looks floating point, hexadecimal, it works fine this part of the if-else-if-ladder. (sorry for the term bloke The solution for me: regex.

awk '$4 ~ /^Juni$/  {if ($2>max) max=$2; ++n; s+=$2}  END { print "Maximum\t" (max), "Average\t" (s/n)};' datafile.txt

Don_Cragun · July 5, 2014, 2:03pm

I am very glad to hear that you have solved your problem. (Even though I don't see anything here that is at all related to trying to convert abbreviated month names to unabbreviated month names that seemed to be the focus of the first post in this thread.)

The script you have shown us as your solution, will produce absolutely no output with the sample input you provided (since Juni did not appear anywhere in your sample input).

And, the maximum and average your script is printing is not the maximum and average of the values in field 2 of the matched lines! For example, with your code:

awk '$4 ~ /^Juni$/  {if ($2>max) max=$2; ++n; s+=$2}  END { print "Maximum\t" (max), "Average\t" (s/n)};' datafile.txt

and datafile.txt containing:

webster    2:48, 29.06.2014 Juni
webster    3:50, 29.06.2014 Juni
webster   10:59, 29.06.2014 Juni

you get the output:

Maximum	3:50, Average	5

even though the maximum is 10:59 instead of 3:50, and the average of 2 hours 48 minutes, 3 hours 50 minutes, and 10 hours 59 minutes is 5.8721666666 hours (5 hours, 52 minutes, and just under 20 seconds) instead of exactly 5 hours.

Of course, I already provided you with a function to convert strings of the form hh:mm, to an integral number of minutes (see awk last n lines of file) which you can add together and divide to compute averages and I provided examples to print those averages as hours, minutes, and seconds or as a floating point number of minutes. If you wanted hours, you could easily divide the minutes by 60.

I hope this helps you get what you want. Since you have refused to ever show us the output you wanted to produce in either of these threads, and since you don't seem to find my comments helpful, I won't be replying to any more of your posts.