Data processing using awk

shadyuk · January 13, 2013, 7:36pm

Hello,

I have some bitrate data in a csv which is in an odd format and is difficult to process in Excel when I have thousands of rows. Therefore, I was thinking of doing this in bash and using awk as the primary application except that due to its complication, I'm a little stuck.

Data____Output_____Calculations
968_____968________no K or M so do nothing
31K7____31700______31*K + 7*100
69K5____69500______69*K + 5*100
1M07____1070000____1*M + 5*10,000
842K____842000_____842*K
5M99____5990000____5*M + 99*10,000

Note: K = 1000, M = 1,000,000

As you can see, it's a bit of a nuisance. The file only has the Data column. What I need the script to do is the calculations and spit out the Output column. Please ignore the underscores as I couldn't get it to indent properly with the editor. Any help would be much appreciated.

Thanks

Chubler_XL · January 13, 2013, 7:55pm

How about something like this:

sed -E 's/_+/\t/g' infile > outfile

Result would be a TAB delimited file

shadyuk · January 13, 2013, 8:09pm

Sorry, maybe I didnt make it clear. The file only has the Data column. What I need the script to do is the calculations and spit out the Output column.

Thanks.

Don_Cragun · January 13, 2013, 8:39pm

This should work for input with 1, 2, or 3 digits following a K or M in your input:

awk 'BEGIN {
        f[1] = 100
        f[2] = 10
        f[3] = 1
}
$1 ~ /[KM]/ {
        split($1, a, /[KM]/)
        m = $1 ~ /M/
        $1 = a[1] * 1000 + a[2] * f[length(a[2])]
        if(m) $1 *= 1000
}
1' data

If you are using a Solaris system, use /usr/xpg4/bin/awk or nawk instead of awk.

Chubler_XL · January 13, 2013, 8:55pm

Try this:

awk '
  /[KM]$/{gsub(/$/,"0")}
  /K/{gsub(/K/,"");gsub(/$/,"00")}
  /M/{gsub(/M/,"");gsub(/$/,"0000")}
  1' infile

shadyuk · January 13, 2013, 9:07pm

Both very clever indeed. Output is correct if theres only one column. However, a little more assistance please. If the csv is as follow, how do I get it to convert the bitrate columns (data1, data2, data3) without having to parse the file first?

timestamp,data1,data2,data3
08/01/13 16:31:00,1M07,786,896K
08/01/13 16:31:10,1K23,2K44,4M34
08/01/13 16:31:20,345,2M84,437K

Don - Your method only outputs if the data is in the first column.
Chubler - Your method outputs the wrong data.

timestamp,wan1,wan2,wan3
08/01/13 16:31:00,107,786,8960000000
08/01/13 16:31:10,123,244,434000000
08/01/13 16:31:20,345,284,4370000000

Additionally, I'd like to be able to call the script and give it an input file from the shell prompt rather than explicitly calling the file in the script. For example:

$ awkscript infile

Thanks again.

Chubler_XL · January 13, 2013, 9:33pm

Try this:

#!/bin/awk -f
function expkm(a) {
  if(split(a,v,"K")>1) return v[1] (v[2]?v[2]:"0") "00"
  if(split(a,v,"M")>1) return v[1] (v[2]?v[2]:"0") "0000"
  return a
}
BEGIN {FS=OFS=","}
{for(i=2;i<=NF;i++) $i=expkm($i)}
1

Output:

$ ./awkscript infile
timestamp,wan1,wan2,wan3
08/01/13 16:31:00,1070000,786,896000
08/01/13 16:31:10,12300,24400,4340000
08/01/13 16:31:20,345,2840000,437000

shadyuk · January 13, 2013, 9:47pm

Awesome, works although I now feel like I know absolutely nothing!

Any comments to explain what you've done would be fab. Otherwise, thank you again.

Don_Cragun · January 13, 2013, 10:06pm

Sorry. I must have misunderstood what you wanted when you said:

in message #3 in this thread.

Anyway, try this:

#!/bin/ksh
if [ $# -ne 1 ] || [ ! -f "$1" ]
then    echo "Usage: $0 file" >&2
        exit 1
fi
awk 'BEGIN {
        FS = OFS = ","
        f[1] = 100
        f[2] = 10
        f[3] = 1
}
FNR>1 { for(i = 2; i <= 4; i++)
                if(split($i, a, /[KM]/) == 2) {
                        m = $i ~ /M/
                        $i = a[1] * 1000 + a[2] * f[length(a[2])]
                        if(m) $i *= 1000
                }
}
1' "$1"

This applies the requested transformations to fields 2, 3, and 4. However, I have no idea why:

timestamp,data1,data2,data3

in your input file changed to:

timestamp,wan1,wan2,wan3

I don't know whether you were saying the other transformations were the incorrect values produced by Chubler_XL's script or if those were the values you want to get from the input you gave.

Assuming the data:

timestamp,data1,data2,data3
08/01/13 16:31:00,1M07,786,896K
08/01/13 16:31:10,1K23,2K44,4M34
08/01/13 16:31:20,345,2M84,437K
13/01/13 18:51:20,1M,1M2,1M34
13/01/13 18:51:21,1M567,9K,9K8
13/01/13 18:51:22,9K76,9K543,1

is in a file named infile , that you save my script above in a file named awkscript , adjust the /bin/ksh in the first line of my script to be the absolute pathname of the Korn shell on your system, and that you make awkscript executable by running the command:

chmod +x awkscript

then the command:

awkscript infile

will produce the output:

timestamp,data1,data2,data3
08/01/13 16:31:00,1070000,786,896000
08/01/13 16:31:10,1230,2440,4340000
08/01/13 16:31:20,345,2840000,437000
13/01/13 18:51:20,1000000,1200000,1340000
13/01/13 18:51:21,1567000,9000,9800
13/01/13 18:51:22,9760,9543,1

Is this what you wanted?

PS Note that for the last three lines of this input file, this script and the script provided by Chubler_XL in message #7 in this thread produce different results. If I understand you input formats, I think this script does what you want.

Chubler_XL · January 13, 2013, 11:16pm

The argument could also be made that "1M567" should produce an output of "6670000" (1M + 567 * 10,000). If it's even a valid input.

Don_Cragun · January 13, 2013, 11:51pm

I will agree that none of the sample input used xMxxx or xKxxx, but we also provide different output for the input 2K44 . My script produces 2440 , your script produces 24400 . Both of our scripts produce 4340000 for the input 4M34 and I can't believe the intent was for xMxx and xKxx to differ by a factor of 100 instead of 1000 in the way they handle two digits after the K and M multiplier codes. Both of these appear in the lines provided in the sample input (not just in my extended test cases).

We'll have to let shadyuk tell us which one of us made the right assumption for the desired behavior given that the specification didn't cover any of these cases explicitly.

shadyuk · January 14, 2013, 6:47am

Good spot Don, the output of Chublers script is not right. Was very late last night so I didn't pick this up. The output of your script is correct.

I now have a new problem. The bit rates are combined as follow:

timestamp,data1,data2,data3
08/01/13 16:31:00,1M07/54K3,786/2K1,896K/1M54
08/01/13 16:31:10,1K23/432,2K44/76K,4M34/29K1

I need to split the output and then 'translate' each individually.

timestamp,data1_dl,data1_ul,data2_dl,data2_ul,data3_dl,data3_ul
08/01/13 16:31:00,1070000,5430,786,2100,896000,1540000
08/01/13 16:31:10,1230,432,2440,7600,4340000,2910

Thank you!!!

RudiC · January 14, 2013, 6:53am

Why not do this in the spreadsheet itself? Using gnumeric, I came up with this (I know, Mantissa is not the correct term...):

Data    MantissaFactor  Result  Result
968     968     1       968     968
31K7    31.7    1000    31700   31700
69K5    69.5    1000    69500   69500
1M07    1.07    1000000 1070000 1070000
842K    842.    1000    842000  842000
5M99    5.99    1000000 5990000 5990000
                                ^--- =if(iserr(find("K",A8)),if(iserr(find("M",A8)),1,1000000),1000)*substitute(substitute(A8,"K","."),"M",".")
                        ^--- =if(iserr(find("K",A8)),if(iserr(find("M",A8)),1,1000000),1000)*B8                                
                ^--- =if(iserr(find("K",A8)),if(iserr(find("M",A8)),1,1000000),1000)   
        ^--- =substitute(substitute(A8,"K","."),"M",".")

The functions should be EXCEL- compatible, so give it a shot...

shadyuk · January 14, 2013, 7:47am

I was originally using excel to do this but it continuously freezes when my worksheet contains tens of thousands of rows and becomes impossible to work with. Awk is far more efficient. I can then graph up what I need when I have the script output.

Thanks for the advice though

Scrutinizer · January 14, 2013, 8:24am

Try:

awk '
  NR>1{
    for(i=2; i<=NF; i++) sub("/",",",$i)
    $0=$0
    for(i=2; i<=NF; i++){
      p=0
      if($i~/G/) p=9
      if($i~/M/) p=6
      if($i~/K/) p=3
      sub(/[GMK]/,".",$i)
      $i*=10^p
    }
  }
  1
' FS=, OFS=, infile

shadyuk · January 14, 2013, 9:01am

Thank you. How do I make it so that I can call the script from shell and give it a file as per Don's original code?

./awkscript infile

Don_Cragun · January 14, 2013, 9:32am

Do the same thing I did in my original code. I also added a line to change the headers for you:

#!/bin/ksh
if [ $# -ne 1 ] || [ ! -f "$1" ]
then    echo "Usage: $0 file" >&2
        exit 1
fi
awk '
  NR==1{
    $0="timestamp,data1_dl,data1_ul,data2_dl,data2_ul,data3_dl,data3_ul"
  }
  NR>1{
    for(i=2; i<=NF; i++) sub("/",",",$i)
    $0=$0
    for(i=2; i<=NF; i++){
      p=0
      if($i~/G/) p=9
      if($i~/M/) p=6
      if($i~/K/) p=3
      sub(/[GMK]/,".",$i)
      $i*=10^p
    }
  }
  1
' FS=, OFS=, "$1"

shadyuk · January 14, 2013, 10:34am

Hello again and thanks.

Works just as i'd like it to. I amended one part so that I can define which columns it should translate as my csv may have additional ones before and after the bit rate data. Please let me know if there's a more efficient way of doing this.

#!/bin/ksh

if [ $# -ne 1 ] || [ ! -f "$1" ]
then    echo "Usage: $0 file" >&2
        exit 1
fi
awk '
  NR==1{
    $0="timestamp,data1_dl,data1_ul,data2_dl,data2_ul,data3_dl,data3_ul"
  }
  NR>1{
    for(i = 5; i <= 10; i++) sub("/",",",$i)
    $0=$0
    for(i = 5; i <= 10; i++){
      p=0
      if($i~/G/) p=9
      if($i~/M/) p=6
      if($i~/K/) p=3
      sub(/[GMK]/,".",$i)
      $i*=10^p
    }
  }
  1
' FS=, OFS=, "$1"

Thanks again everyone.

---------- Post updated at 10:34 AM ---------- Previous update was at 09:48 AM ----------

One more thing...:o

What if the bit rate data is not in sequential columns? How can I amend the script so that I tell it which columns to translate? For example:

timestamp,test1,data1,test2,data2,test3,data3
08/01/13 16:31:00,test,1M07/54K3,test,786/2K1,test,896K/1M54
08/01/13 16:31:10,test,1K23/432,test,2K44/76K,test,4M34/29K1

Note that the position of the bit rate columns doesn't always follow a pattern such as in the following example:

timestamp,test1,data1,data2,test3,data3
08/01/13 16:31:00,test,1M07/54K3,786/2K1,test,896K/1M54
08/01/13 16:31:10,test,1K23/432,2K44/76K,test,4M34/29K1

I understand that I'll have to modify the script before use but it will at least be flexible.

Thanks.

Don_Cragun · January 14, 2013, 10:37am

shadyuk:

Hello again and thanks.

Works just as i'd like it to. I amended one part so that I can define which columns it should translate as my csv may have additional ones before and after the bit rate data. Please let me know if there's a more efficient way of doing this.
#!/bin/ksh

if [ $# -ne 1 ] || [ ! -f "$1" ]
then    echo "Usage: $0 file" >&2
   exit 1
fi
awk '
  NR==1{
   $0="timestamp,data1_dl,data1_ul,data2_dl,data2_ul,data3_dl,data3_ul"
  }
  NR>1{
   for(i = 5; i <= 10; i++) sub("/",",",$i)
   $0=$0
   for(i = 5; i <= 10; i++){
   p=0
   if($i~/G/) p=9
   if($i~/M/) p=6
   if($i~/K/) p=3
   sub(/[GMK]/,".",$i)
   $i*=10^p
   }
  }
  1
' FS=, OFS=, "$1"
Thanks again everyone.

Obviously, the way I set up your output heading is incorrect since the sample input you gave us didn't match your input.

Assuming that instead of fields 2, 3, and 4 containing slash separated values that you want to convert, you now have fields 5, 6, 7, 8, 9, and 10 containing slash separated values you want to convert and that each of these fields contains a single slash character, then the 2nd for loop needs to be:

    for(i = 5; i <= 16; i++){

instead of:

    for(i = 5; i <= 10; i++){

since you're adding 6 new fields to the line in the first loop. If some of these fields don't have slashes or some of these fields contain more than one slash, you need to add the number of slashes in fields 5 through 10 to 10 for the end point for the loop. If there are a variable number of slashes in fields 5 through 10, additional logic is needed to determine the end point.

If you could set up your input so the fields that need to be modified are all at the end of the line, going back to using NF as the end point will be easier.

This would have been easier on all of us if you had given us a representative sample of what you wanted done originally instead of changing requirements every time we give you something that does what you requested!

shadyuk · January 14, 2013, 10:46am

Thanks Don and noted for next time. It was an evolving thing and I only noticed the limitations in my requests as we trudged along.

Apologies for wasting anyone's time. :o