Alternative to sort -ur +1 required

Mike_Smith · October 3, 2012, 3:49pm

I've got scripts trawling the network and dumping parsed text into files with an Epoch timestamp in column 1. I append the old data to the new data then just want to keep the top entry if there is an identical duplicate below (column 1 needs to be ignored).

sort -ur +1 works a treat on a Solaris 8 box but on Solaris 10 the 'r' seems to break!

Can some kind soul offer a fix / workaround?

If you ask me a question please keep it dummy level as I'm not super Unix literate.

Corona688 · October 3, 2012, 4:25pm

What is that + syntax supposed to do? It's not even part of the manual page for my version of sort.

If you just want sorting on the first column, sort -ur -k 1,2 I think.

DGPickett · October 3, 2012, 4:29pm

Solaris had a sort bug once I recall, but it was more subtle. Try using the -k method of specifying fields and sort direction and such. The +1 -2 notation is obsolescent - LINUX does not have it any more, and 0-based! The -k notation is 1-based, not zero-based, which might be more normal human friendly. BTW, +1 says sort on column 2 and following. I suppose column 1 is the file name?

Some sort of persistent JAVA container could do the testing and storing without a sort, perhaps in a tree. You can put the data into a structure mapped to a flat file, for instance. One possible advantage is that you can prune the set on the fly, if you are not interested in the full set. Also, you can do controlled thread parallelism. It is a lot faster than sort or an SQL RDBMS ETL approach.

Sort can also be sped up with parallelism in bash, on nicer systems with /dev/fd/[0-9]* and ksh (or using named pipes), using sort merge and pipes:

sort -m YOUR_ARGS <(
  sort YOUR_ARGS FILE_LIST_1
 ) <(
  sort YOUR_ARGS FILE_LIST_2
   .
   .
   .
 ) <( 
  sort YOUR_ARGS FILE_LIST_N
 )
 
nicer than with named pipes (/sbin/mknod NAMED_PIPE_N p):
 
(
sort YOUR_ARGS -oNAMED_PIPE_1 FILE_LIST_1 &
sort YOUR_ARGS -oNAMED_PIPE_2 FILE_LIST_2 &
.
.
.
sort YOUR_ARGS -oNAMED_PIPE_N FILE_LIST_N &
sort -m YOUR_ARGS oNAMED_PIPE_1 oNAMED_PIPE_2 . . . oNAMED_PIPE_N
)

drl · October 3, 2012, 5:03pm

Hi.

Minor quibble:

sort (GNU coreutils) 8.13
OS, ker|rel, machine: Linux, 3.0.0-1-amd64, x86_64
Distribution        : Debian GNU/Linux wheezy/sid

allows old form:

   On older systems, `sort' supports an obsolete origin-zero syntax
`+POS1 [-POS2]' for specifying sort keys.  The obsolete sequence `sort
+A.X -B.Y' is equivalent to `sort -k A+1.X+1,B' if Y is `0' or absent,
otherwise it is equivalent to `sort -k A+1.X+1,B+1.Y'.

   This obsolete behavior can be enabled or disabled with the
`_POSIX2_VERSION' environment variable (*note Standards conformance::);
it can also be enabled when `POSIXLY_CORRECT' is not set by using the
obsolete syntax with `-POS2' present.

excerpt from info sort

Best wishes ... cheers, drl

Mike_Smith · October 3, 2012, 6:02pm

Gosh! I'll have a play with the -k option, I read the man page and didn't understand the k bit at all.

I need column 1 (Epoch time stamp) ignoring and the rest of the line taken into account for comparison. It's free text from interfaces so could be any number of words and characters.

DGPickett · October 4, 2012, 2:35pm

With -k, some options now ride inside the -k, like reverse and numeric, so they can vary key by key without ambiguity. Unique -u is global to all keys.

Do you want the whole list, or just the last day's hits or the like? You can write a low latency unique filter that does not sort, using a filtering collection. I posted one I wrote in C using a simple bisection search of an array of pointers: Group By in Unix

Mike_Smith · October 8, 2012, 10:39am

Column 1 needs to be kept but ignored by sort.

Basically there will often be two entries just with the column 1 timestamp being different, I need to keep the top entry.

-k sounds like it could be what I need but the manual is gibberish to me.

Don_Cragun · October 8, 2012, 1:16pm

A straight translation of

sort -ur +1

using the +w and -x options to specify sort keys to using -k y,z options to specify sort keys is

sort -ur -k2

When your input lines are a numeric string representing a timestamp followed by a combination of one or more space and tab characters followed by "other text" ending with the line's terminating newline, this command will sort "other text" in reverse order and discard all but one line with identical values of "other text". If two or more lines have the same "other text" but different timestamps, which timestamp will be kept is unspecified.

Mike_Smith · October 9, 2012, 6:19am

Tried -k2,2 and it binned loads of old entries which I wanted!

Say I have this data, the highest number (newest) one will always be at the top and I only want to keep the top one as everything right of, and including, column 2 is the same (nothing has changed since last sweep).
1349455502 ygtr-1b:3/1/19 10/100/Gig Ethernet SFP
1349246545 ygtr-1b:3/1/19 10/100/Gig Ethernet SFP

However if we get the following then I want both entries retaining
1349455502 ygtr-1b:3/1/19 10/100/Gig Ethernet SFP
1349246545 ygtr-1b:3/1/19 Old customer name typed in here

-k2,2 seems to ONLY consider column 2 which is no good based on my first post's criteria.

This is tricky!

Don_Cragun · October 9, 2012, 8:38am

It isn't tricky at all. You can't say that the first field doesn't matter when throwing away duplicates and at the same time say that you want the highest value for the first field to be the one that is kept when throwing away duplicates. If you care which duplicate is kept, you don't think they are duplicates. If you want both, you have to do it in two separate steps:

sort in reverse order with the data from the start of field 2 to the end of the line as your primary sort key and the numeric value in field 1 as your secondary sort key, and then
on a second pass, discard or ignore the second and subsequent lines that match (including field separators other than the separator between the 1st and 2nd fields) from the start of the 2nd field to the end of the line.

So why does it now matter which timestamp is kept when throwing away duplicates if it didn't matter when you were running this application on Solaris 8?

DGPickett · October 9, 2012, 4:18pm

To do it in one pass, sort by user and then by time, and only print the first every time the user changes. To do it in one pass without sorting, store the time using a string addressable vector keyed to the user and overwrite it for later times.

Mike_Smith · October 22, 2012, 4:14pm

@Don

Since this works perfectly on a different box and occassionally on this one I was hoping that I'd not have to reinvent this thing to fix my problem.

However I think what you're describing is basic enough for me to manage.

sort -k2 -k1,1 newest-data one-month-data archive-data | sort -ur > output

Sound about right?

And yes if there are two identical lines with just the timestamp being different then I'd like to keep the biggest number (newest)

Don_Cragun · October 22, 2012, 6:21pm

No! Absolutely not! Never! If you are feeding the data through sort twice, the first sort has absolutely no effect (unless you use a -u option in the 1st sort to discard some data and you have already learned that you can't use -u in the 1st sort). The command line you're suggesting:

sort -k2 -k1,1 newest-data one-month-data archive-data | sort -ur > output

is functionally equivalent to:

sort -ur newest-data one-month-data archive-data > output

Both sort commands sort the entire set of input lines according to the sort key specified by that sort command.

You still haven't explained why it matters what the timestamp is on lines that are otherwise identical. The commands that you had on Solaris 8 randomly kept one of the lines that matched from the start of field 2 to the end of the line. As stated before the command:

sort -ur -k2  newest-data one-month-data archive-data

will do what would have happened on Solaris 8 with your current data. If that isn't sufficient, the first step I stated for you in message #10 in this thread can be implemented using:

sort -k2r -k1nr,1 newest-data one-month-data archive-data

but you will need to write another program that reads the data written by the above sort and throws away all but the 1st line of each set of lines that are identical from the start of field 2 to the end of the line. The program that will do this is NOT sort. It is probably an awk script that compares the substring starting at the first character of column 2 and continuing to the end of the line for adjacent lines and prints $0 for the 1st line in each matching set. (Note that this is not the same as comparing fields $2 to $NF because differences in field separators matter in the first case, but are ignored in the second case.)

DGPickett · October 23, 2012, 9:54am

I believe that 'sort -u' saves just the first occurrance of the unique key, so you sort first non-unique to get the right first record saved.

However, I agree it is a bit of a shame to use so much storage and processing when you could just tuck them in an associative array and overwrite any old values, especially in cases where the data starts out sorted in some relevant way. The only drawback is that the speed of shell operations might be a drag on big volume. You can scale up sort in parallel using pipes and sort -m, but for the unsorted lookup solution at machine speed, C++ or at least JAVA can work a hash table faster, and you can pre-size the hash table big enough to get good use of RAM and VM in even a 32 bit app. I like big powers of 2, since a modulus of the hash becomes a lower bit mask. Empty hash table entries are just pointers in an array, 4 or 8 bytes cost each, which is pretty cheap, and does no harm for smaller data sets! Hash beats tree for query and churn speed, but tree does provide sorted output and scales more automatically. Linear hash (tables in power of two sizes that can double the hash table size for congested buckets) has a better dynamic scaling, but slower query and churn than straight hash. I have not found a lot of hash implementations that reveal they are linear.

binlib · October 23, 2012, 8:10pm

Have you tried the -f or -s option of uniq?

DGPickett · October 24, 2012, 12:59pm

Command 'uniq' only compares adjacent records, so as you have to sort, you might as well specify it in the sort key, with u.

Mike_Smith · December 3, 2012, 9:25am

Splitting the work seems to have cracked it so I'll post the code to help the next poor sod.

Many thanks for all your help...

NOWEPOCH=`/usr/bin/nawk 'BEGIN{print srand()}'`
v1=`echo $(($NOWEPOCH - (32 * 86400) ))`
/usr/bin/sort -k1n,1n $DATABASE $DATABASENEWER $DATABASEOLDER > $HOME/logs/junk$NOWEPOCH
/usr/bin/sort -ur -k2 $HOME/logs/junk$NOWEPOCH |\
 /usr/bin/nawk -v v1=$v1 -v t1=$TEMPFILE1 -v t2=$TEMPFILE2 ' $1 <= v1 { print >> t1 } ; $1 >= v1 { print >> t2 }'