Sorting data - awk or sort or both

data:

C812F5C9B   0818053014 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
C812F5C9B   0818054514 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
C812F5C9B   0818060014 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
C812F5C9B   0818061514 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
C812F5C9B   0818063014 P S hdisk45             SOFTWARE PROGRAM ABNORMALLY TERMINATED

i have a huge data that has an output similar to the above.

i'm using the following command to try to get rid of any duplicates:

sort -k 1,1 -k 2,2 -k3,3 -u

not enough dups are being eliminated. so i was wondering if you guys have a better approach to this.

i would like to sort and unique by the first 6 characters of the 2nd field.

the numbers in the second field mean:

let's use 0818063014 as an example:

08 = month
18 = day
06 = hour
30 = minute
12 = year

is what i'm trying to do possible?

Try this (untested):

awk '!dupe[substr($2,1,6)]++' file

Sorry missed the obvious bit about sorting! This should remove duplicates only, it can then be piped to sort.

1 Like

this actually may work.

i ran it, and it appears to be getting rid of more than i want. but i feel with a little more massaging, i can get it to do what i want.

here's what im running:

sort -k 1,1 -k 2,2 -k 3,3 -u file | awk '!dupe[substr($2,1,6)]++' 

Try

sort -u -k2.1,2.9 file
C812F5C9B   0818053014 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
C812F5C9B   0818060014 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED

2.9 is the key's stop position, as

(man sort)