Sorting

Ernst · May 17, 2010, 8:47pm

Let's say that I have a database that I call part ID. This database has the following grouping:

Dart1=4
Dart2=8
Dart3=12

Fork1=68
Fork2=72
Fork3=64

Bike1=28
Bike2=24
Bike3=20

Car1=44
Car2=40
Car3=36

I want to write a program that would read this database and tell me when the 3rd item of each group is smaller than the second item of the same group or when the 1st item is greater than the 2nd item of that group. Basically item1 < item2 < item3 is the standard within the same group. If this is not met within that group, The script would print the output of all the groups that are different. For example, the output would be:

Fork1=68
Fork2=72          
Fork3=64

Car1=44
Car2=40
Car3=36

Any suggestions will be appreciated!

Thanks!

durden_tyler · May 18, 2010, 12:15am

One way to do it with Perl -

$ 
$ cat f9
Dart1=4
Dart2=8
Dart3=12

Fork1=68
Fork2=72
Fork3=64

Bike1=28
Bike2=24
Bike3=20

Car1=44
Car2=40
Car3=36
$ 
$ perl -lne 'BEGIN {undef $/} while (/(\w+\d=(\d+)\n\w+\d=(\d+)\n\w+\d=(\d+))/sg) {print $1,"\n" if $2>=$3 or $3>=$4}' f9
Fork1=68
Fork2=72
Fork3=64

Bike1=28
Bike2=24
Bike3=20

Car1=44
Car2=40
Car3=36

$ 
$

tyler_durden

rdcwayx · May 18, 2010, 6:11am

awk 'BEGIN{RS="";FS="="} {if ($2>$4 || $4>$6) print $0"\n" }' urfile

Ernst · May 18, 2010, 7:56pm

Tyler,

Your script works. But I still have some issues. After I ran my surveys, my output file is as follows:
F

ork,1,68
Fork,2,72
Fork,3,64

Bike,1,28
Bike,2,24
Bike,3,20

Car,1,44
Car,2,40
Car,3,36

I used this sed command

sed -e 's/,/./' -e 's/,/=/' filename >> SW1_filename2

to make it look like this

Fork1=68
Fork2=72
Fork3=64

Bike1=28
Bike2=24
Bike3=20

Car1=44
Car2=40
Car3=36

However it won�t do it. With the commas in the format, your script does not work; with the equal sign, it works. Any idea what I am doing wrong?

---------- Post updated at 11:24 AM ---------- Previous update was at 09:46 AM ----------

Disregard my previous posting. I mistakenly enterer a comma instead of a semicolon. That's why it was not working.

Thanks!

---------- Post updated at 11:31 AM ---------- Previous update was at 11:24 AM ----------

Now, if you could explain the script to me, that would help. I need to modify it somehow. Basically what I am trying to do is to the following:
The output data for all the groups has 6 entries.But while most of the groups has only three entries filled, some groups have all 6 entries filled. What I want to do is Whenever the script reads the file and sees a group with only 3 entries filled, it needs to print the three rows that have data but strip the other three empty rows; for the groups with 6 entries filled, it just needs to print them out. That would be fantastic if you or someone else could help.

Thanks!

---------- Post updated at 07:56 PM ---------- Previous update was at 11:31 AM ----------

Hey rdcwayx,
I tried your script but got this error message.

awk: record `1.3=12
1.2=348
1.1=180
...' too long

For your info, my input file format was as follows:

554.3=432
554.2=264
554.1=96
555.3=452
555.2=284
555.1=116
556.3=488
556.2=320
556.1=152
557.3=340
557.2=172
557.1=4
558.3=356
558.2=188
558.1=20
559.3=108
559.2=276
559.1=444

Keep in mind that this represents a fraction of a huge database.

Let me know what you think.

Thanks!

durden_tyler · May 18, 2010, 11:48pm

Firstly, the Perl one-liner posted earlier will *not* work for groups of 6 lines. It works *only* for groups of 3 lines.

Secondly, it may be beneficial if you could post some test data for groups of 3 lines as well as 6 lines. And post the output as well that shows what exactly is to be done for each group.

The input you've posted above is difficult to comprehend because it does not have the blank line that separates one group from the other.

tyler_durden

Ernst · May 19, 2010, 6:24pm

Okay, I have 6 groups. Now I want the script to go through these groups and look at the structure of each group. If the .1 (within a group) is greater than the .2 or .3 OR if .2 is greater than .3 within a group; thus output these groups. In our case the output would be groups.

input file

1.6=176
1.5=172
1.4=168
1.3=14
1.2=13
1.1=12

230.3=146
230.2=147
230.1=148

3.3=20
3.2=19
3.1=18

5.6=166
5.5=122
5.4=160
5.3=103
5.2=102
5.1=100

100.6=176
100.5=172
100.4=168
100.3=20
100.2=12
100.1=16

117.3=24
117.2=82
117.1=79

Output file

117.3=83
117.2=82
117.1=79

100.6=176
100.5=172
100.4=168
100.3=20
100.2=12
100.1=16

230.3=146
230.2=147
230.1=148

I hope it is clear enough.

vgersh99 · May 19, 2010, 6:39pm

nawk -f ernst.awk inputFile

ernst.awk:

BEGIN {
  RS=FS=""
  ORS="\n\n"
}
{
  n=split($NF,t, "=")
  one=t[n]
  n=split($(NF-1),t, "=")
  two=t[n]
  n=split($(NF-2),t, "=")
  three=t[n]

  if (one > two || one > three || two > three) print
}

Ernst · May 19, 2010, 8:01pm

vgersh99,

Are the first two lines

nawk -f ernst.awk inputFile

ernst.awk:

part of the code. If yes, what is the ernst.awk?

If not, where do I enter the input file? I am trying to make sense of the code.

Thanks!

vgersh99 · May 19, 2010, 8:12pm

the block of code is the content of ernst.awk file.
Your input to the script is assumed to be in file 'inputFile'.
The calling sequence is:
nawk -f ernst.awk inputFile

rdcwayx · May 19, 2010, 9:24pm

Above code "nawk -f ernst.awk inputFile", I got output with 5 groups.

Assume your code are sorted by descending order in each group, here is my code, only get 3 groups output, and seems it is correct. Please confirm.

awk 'BEGIN{RS="";FS="=";ORS="\n\n"} 
       {if ($NF > $(NF-2) || $NF > $(NF-4) || $(NF-2) > $(NF-4)) print }' urfile


230.3=146
230.2=147
230.1=148

100.6=176
100.5=172
100.4=168
100.3=20
100.2=12
100.1=16

117.3=24
117.2=82
117.1=79

Ernst · May 20, 2010, 8:35am

rdcwayx:

Interestingly enough, your script works fine with the sample that I published (short file) but does not work for a larger file, my database for example. When I use your script for my database which contains hundreds of goupings, I have the following error message:

awk: record `1  .6=
1  .5=
1 ...' too long

---------- Post updated at 08:35 AM ---------- Previous update was at 08:23 AM ----------

vgersh99:

I do not get an error message when I run your script with my database, but the output file is empty. I do not get any data which is wrong because I know for sure some of the goupings met the conditions that I listed.

vgersh99 · May 20, 2010, 9:05am

Ernst,
what OS are you on?
It was as expected with your sample file using 'nawk' on Sun's Solaris.

durden_tyler · May 20, 2010, 9:39am

Here's a Perl solution -

$
$
$ # display the contents of the input file
$ cat input.dat
1.6=176
1.5=172
1.4=168
1.3=14
1.2=13
1.1=12
 
230.3=146
230.2=147
230.1=148
 
3.3=20
3.2=19
3.1=18
 
5.6=166
5.5=122
5.4=160
5.3=103
5.2=102
5.1=100
 
100.6=176
100.5=172
100.4=168
100.3=20
100.2=12
100.1=16
 
117.3=24
117.2=82
117.1=79
$
$
$ # Process the data with a Perl one-liner
$ perl -lne 'chomp;
           if (/^\s*$/) {
             if ($x>$y or $x>$z or $y>$z) {print foreach (@a); print}
             @a=(); $x=$y=$z="";
           } else {
             push @a,$_;
             if (/^\d+\.1=(.*)$/) {$x = $1}
             elsif (/^\d+\.2=(.*)$/) {$y = $1}
             elsif (/^\d+\.3=(.*)$/) {$z = $1}
           }
           END {if ($x>$y or $x>$z or $y>$z) {print foreach (@a); print}}
          ' input.dat
230.3=146
230.2=147
230.1=148
 
100.6=176
100.5=172
100.4=168
100.3=20
100.2=12
100.1=16
 
117.3=24
117.2=82
117.1=79
 
$
$

tyler_durden

Ernst · May 20, 2010, 11:11am

I am using:

FMLISHELL=/sbin/sh
SHELL=/sbin/sh

Durden:
You are right. your scripts work fine for the small sample I provided you with. However, they do not work for my huge file.
With my huge file, I do not get an output file. Whenever I cat the output, I do not get any data.
I did not get an error message either.

Below is my input file:

1  .6=   
1  .5=   
1  .4=   
1  .3=12 
1  .2=348
1  .1=180
10 .6=   
10 .5=   
10 .4=   
10 .3=360
10 .2=192
10 .1=24 
100.6=   
100.5=   
100.4=   
100.3=364
100.2=196
100.1=28 
101.6=   
101.5=   
101.4=   
101.3=464
101.2=296
101.1=128
102.6=   
102.5=   
102.4=   
102.3=444
...
...

Try your scripts and let me know whether or not it works for you.

Thanks!

vgersh99 · May 20, 2010, 11:14am

I asked about OS, not about shell......

Ernst · May 20, 2010, 12:16pm

Windows XP

rdcwayx · May 20, 2010, 11:18pm

So Windows XP + Cygwin?

your real data file is different with the sample data.

No blank line between each group.
first column has space, but in your sample data, there is no space !

1  .6=
1  .5=
1  .4=
1  .3=12
1  .2=348
1  .1=180

each group has 6 lines. some groups has no data in line 4, 5 or 6.

That's why our scripts can't work on your read data.

---------- Post updated at 01:18 PM ---------- Previous update was at 12:36 PM ----------

According your read date, I updated the script:

awk -F "[.=]" '{a[$1]; b[$1,$2]=$3} 
            END {for (i in a) {if (b[i,1]>b[i,2]||b[i,2]>b[i,3]||b[i,1]>b[i,3]) 
                                      for (j=6;j>=1;j--) printf "%3s.%s=%s\n", i,j,b[i,j]
                                 }
                  } ' urfile

Ernst · May 21, 2010, 9:56am

Did this script work for you? I got this error message every time I run it.

awk: syntax error near line 1
awk: bailing out near line 1

durden_tyler · May 21, 2010, 3:27pm

Well, how huge is your huge file ? 1000 lines ? 10,000 lines ? 100,000 lines ? 1 million lines ?

And, there was no output file in my suggested script. So if you executed the Perl one-liner as I had posted, you wouldn't see any output file either.

The Perl one-liner processes your input file ("input.dat" in my post) and spews the output on stdout - which is your Terminal screen by default.

If you mean displaying the output file with the use of the "cat" command, then did you redirect the output to a file first ?
If yes, then can you post what exactly you typed on your Terminal screen ? (i.e. can you copy/paste the session from your Terminal screen).

...
Below is my input file:

1  .6=   
1  .5=   
1  .4=   
1  .3=12 
1  .2=348
1  .1=180
10 .6=   
10 .5=   
10 .4=   
10 .3=360
10 .2=192
10 .1=24 
100.6=   
100.5=   
100.4=   
100.3=364
100.2=196
100.1=28 
101.6=   
101.5=   
101.4=   
101.3=464
101.2=296
101.1=128
102.6=   
102.5=   
102.4=   
102.3=444
...
...

As posted by others, your input files do not show consistent data. This is what you've posted earlier:

ernst:

Okay, I have 6 groups. Now I want the script to go through these groups and look at the structure of each group. If the .1 (within a group) is greater than the .2 or .3 OR if .2 is greater than .3 within a group; thus output these groups. In our case the output would be groups.

input file
1.6=176
1.5=172
1.4=168
1.3=14
1.2=13
1.1=12
 
230.3=146
230.2=147
230.1=148
 
3.3=20
3.2=19
3.1=18
 
5.6=166
5.5=122
5.4=160
5.3=103
5.2=102
5.1=100
 
100.6=176
100.5=172
100.4=168
100.3=20
100.2=12
100.1=16
 
117.3=24
117.2=82
117.1=79
...
I hope it is clear enough.

As you can see, the differences are listed below:

Difference # 1 : Your old input file did not have space between "1" and ".", whereas your new file has the space.

# First line of old input file
1.6=176
 
# First line of new input file
1  .6=

Difference # 2 : Your old input file has a number to the right of every single "=" character. Your new input file does not have a number to the right of every single "=" character.

# First  5 lines of old input file

1.6=176
1.5=172
1.4=168
1.3=14
1.2=13
 
# First 5 lines of new input file
1  .6=   
1  .5=   
1  .4=   
1  .3=12 
1  .2=348

Difference # 3 : Your old input file has blank lines at the end of each "group". Your new input file does not have even a single blank line.

# First 10 lines of old input file; it has two "groups" with a blank line to separate them
1.6=176
1.5=172
1.4=168
1.3=14
1.2=13
1.1=12
 
230.3=146
230.2=147
230.1=148
 
# First 10 lines of new input file; it has no blank lines anywhere in the file
1  .6=   
1  .5=   
1  .4=   
1  .3=12 
1  .2=348
1  .1=180
10 .6=   
10 .5=   
10 .4=   
10 .3=360

Needless to say, you shouldn't expect consistent solutions to inconsistent problems !

Sure thing. Since you did not mention how huge your input file is, I'll assume it has 2 million lines.

Here's what I did. I took this input file "input.dat" and kept on appending the content over and over to another file called "input.txt".

$
$ cat input.dat
1.6=176
1.5=172
1.4=168
1.3=14
1.2=13
1.1=12
 
230.3=146
230.2=147
230.1=148
 
3.3=20
3.2=19
3.1=18
 
5.6=166
5.5=122
5.4=160
5.3=103
5.2=102
5.1=100
 
100.6=176
100.5=172
100.4=168
100.3=20
100.2=12
100.1=16
 
117.3=24
117.2=82
117.1=79
$

The final line count of "input.txt" is 2 million lines roughly.
Here's some information about "input.txt".

$
$ # the line, word and character counts of "input.txt"; note that it has 2,062,500 lines
$ wc input.txt
 2062500  1687500 14625000 input.txt
$
$ # the first 10 lines of "input.txt"
$ head input.txt
1.6=176
1.5=172
1.4=168
1.3=14
1.2=13
1.1=12
 
230.3=146
230.2=147
230.1=148
$
$ # the last 10 lines of "input.txt"
$ tail input.txt
100.5=172
100.4=168
100.3=20
100.2=12
100.1=16
 
117.3=24
117.2=82
117.1=79
$

And now, I run the Perl one-liner on the file "input.txt" and redirect the output to file "output.txt".

I also feed the entire one-liner to the "time" command.

$
$
$ time perl -lne 'chomp;
           if (/^\s*$/) {
             if ($x>$y or $x>$z or $y>$z) {print foreach (@a); print}
             @a=(); $x=$y=$z="";
           } else {
             push @a,$_;
             if (/^\d+\.1=(.*)$/) {$x = $1}
             elsif (/^\d+\.2=(.*)$/) {$y = $1}
             elsif (/^\d+\.3=(.*)$/) {$z = $1}
           }
           END {if ($x>$y or $x>$z or $y>$z) {print foreach (@a); print}}
          ' input.txt >output.txt
real    0m15.125s
user    0m0.015s
sys     0m0.031s
$
$
$ wc output.txt
 937500  750000 8250000 output.txt
$
$ head output.txt
230.3=146
230.2=147
230.1=148
 
100.6=176
100.5=172
100.4=168
100.3=20
100.2=12
100.1=16
$
$ tail output.txt
100.5=172
100.4=168
100.3=20
100.2=12
100.1=16
 
117.3=24
117.2=82
117.1=79
$
$

And that's 15.125 seconds to process 2 million lines.

tyler_durden

rdcwayx · May 23, 2010, 8:18pm

I test it in Cygwin, and get the output without problem.

47 .6=
47 .5=
47 .4=
47 .3=148
47 .2=484
47 .1=316
129.6=
129.5=
129.4=
129.3=40
129.2=376
129.1=208
82 .6=
82 .5=
82 .4=
82 .3=148
82 .2=484
82 .1=316
2  .6=
2  .5=
2  .4=
2  .3=40
2  .2=208
2  .1=376
94 .6=
94 .5=
94 .4=
94 .3=52
94 .2=388
94 .1=220
67 .6=
67 .5=
67 .4=
67 .3=16
67 .2=352
67 .1=184
32 .6=
32 .5=
32 .4=
32 .3=32
32 .2=368
32 .1=200
363.6=352
363.5=496
363.4=328
363.3=160
363.2=184
363.1=508
486.6=
486.5=
486.4=
486.3=40
486.2=376
486.1=208
177.6=
177.5=
177.4=
177.3=124
177.2=460
177.1=292
178.6=
178.5=
178.4=
178.3=96
178.2=432
178.1=264
139.6=
139.5=
139.4=
139.3=96
139.2=432
139.1=264
290.6=
290.5=
290.4=
290.3=124
290.2=460
290.1=292
250.6=
250.5=
250.4=
250.3=60
250.2=396
250.1=228
251.6=
251.5=
251.4=
251.3=124
251.2=460
251.1=292
217.6=
217.5=
217.4=
217.3=100
217.2=436
217.1=268
95 .6=
95 .5=
95 .4=
95 .3=204
95 .2=372
95 .1=36
68 .6=
68 .5=
68 .4=
68 .3=64
68 .2=400
68 .1=232
64 .6=
64 .5=
64 .4=
64 .3=128
64 .2=464
64 .1=296
41 .6=
41 .5=
41 .4=
41 .3=100
41 .2=436
41 .1=268
1  .6=
1  .5=
1  .4=
1  .3=12
1  .2=348
1  .1=180
186.6=
186.5=
186.4=
186.3=172
186.2=4
186.1=340
145.6=
145.5=
145.4=
145.3=16
145.2=352
145.1=184
57 .6=364
57 .5=420
57 .4=196
57 .3=252
57 .2=28
57 .1=84
225.6=
225.5=
225.4=
225.3=108
225.2=444
225.1=276
30 .6=
30 .5=
30 .4=
30 .3=68
30 .2=404
30 .1=236
380.6=
380.5=
380.4=
380.3=80
380.2=416
380.1=248
300.6=
300.5=
300.4=
300.3=136
300.2=472
300.1=304
110.6=
110.5=
110.4=
110.3=12
110.2=348
110.1=180
111.6=
111.5=
111.4=
111.3=8
111.2=344
111.1=176
112.6=
112.5=
112.4=
112.3=64
112.2=400
112.1=232
156.6=
156.5=
156.4=
156.3=48
156.2=384
156.1=216
159.6=
159.5=
159.4=
159.3=68
159.2=404
159.1=236
81 .6=
81 .5=
81 .4=
81 .3=92
81 .2=428
81 .1=260
274.6=
274.5=
274.4=
274.3=120
274.2=456
274.1=288
275.6=
275.5=
275.4=
275.3=144
275.2=480
275.1=312
237.6=
237.5=
237.4=
237.3=200
237.2=368
237.1=32
310.6=
310.5=
310.4=
310.3=8
310.2=344
310.1=176
312.6=
312.5=
312.4=
312.3=224
312.2=392
312.1=56
314.6=
314.5=
314.4=
314.3=140
314.2=476
314.1=308
357.6=96
357.5=432
357.4=444
357.3=264
357.2=276
357.1=108