Sorting problem: Multiple delimiters, multiple keys

Ryan · July 11, 2011, 1:43am

Hello

If you wanted to sort a .csv file that was filled with lines like this:

(H : [1, 23], M, S: [0, 59])

by date, does anybody know of a better solution than to turn the 3rd and 4th colons of every line into commas, sorting on four keys, and then turning those two commas in every line back to colons? It seems very inefficient to me. (I would just do it and not bother asking if these files weren't 50+GB.)

---------- Post updated at 09:43 PM ---------- Previous update was at 09:27 PM ----------

Meh, I'll let it run overnight.

sed 's/:/,/g' big_file.csv | sort -k 2,2 -k 3,3 -k 4,4 -k 5,5 -t',' | sed 's/,/:/3' | sed 's/,/:/3' > big_file.sorted.csv

rdcwayx · July 11, 2011, 2:02am

how about:

sort -t , -k2n

Abhishek_1984 · July 11, 2011, 3:35am

Try this one

sed 's/:/,/g' test_sort| sort -t ',' -k 2,7 |sed 's/,/:/3'|sed 's/,/:/3'> test_org_final| mv test_org_final test_sort

Ryan · July 11, 2011, 2:26pm

Sort broke @ 2am:

read failed: /tmp/sortOgLpWg: Input/ouput error

This is a real problem now. Does anybody know of a way to sort a humongous file that won't (likely) break?

alister · July 11, 2011, 2:42pm

ryan.:

Hello

If you wanted to sort a .csv file that was filled with lines like this:

<Ticker>,<Date as YYYYMMDD>,<Time as H:M:S>,<Volume>,<Corr>

(H : [1, 23], M, S: [0, 59])

by date, does anybody know of a better solution than to turn the 3rd and 4th colons of every line into commas, sorting on four keys, and then turning those two commas in every line back to colons? It seems very inefficient to me. (I would just do it and not bother asking if these files weren't 50+GB.)

---------- Post updated at 09:43 PM ---------- Previous update was at 09:27 PM ----------

Meh, I'll let it run overnight.
sed 's/:/,/g' big_file.csv | sort -k 2,2 -k 3,3 -k 4,4 -k 5,5 -t',' | sed 's/,/:/3' | sed 's/,/:/3' > big_file.sorted.csv

Are hours, minutes and seconds all zero padded? For example, 01:02:03 instead of 1:2:3 or 1:02:03? If so, you do not need to modify anything. You can use the default lexicographical sort with the date and time fields as the keys.

Also, you mentioned that hours range betwee 1-23. In case it's relevant, that's only a 23 hour day.

If the source file is 50+ GB, you are going to need a lot of ram. You'll probably need to split the file into smaller chunks, sort them individually, and then merge them with sort -m.

Regards,
Alister

Ryan · July 11, 2011, 2:49pm

Oddly the hours aren't zero padded but the minutes and seconds are. (I think it's like [1]?[0-9]:[0-5][0-9]:[0-5][0-9] in Regex-speak.)

I'm going to try to figure out how to split it up and then attempt sorting again -- thanks.

alister · July 11, 2011, 2:55pm

By the way, the consecutive seds in the pipeline can be simplified: sed 's/,/:/3 ; s/,/:/3' .

That'll save some time in context switches and copying data in and out of kernel/userland buffers.

Regards,
Alister

---------- Post updated at 02:55 PM ---------- Previous update was at 02:53 PM ----------

Also, it seems GNU sort can handle this situation, by automatically creating tmp files during the sorting process. I'm assuming you're not on Linux. If so, and if you are using GNU sort, you should paste the exact error message.

shamrock · July 11, 2011, 3:35pm

Post a few lines of your input file so it is apparent what you are talking about...also try doing it all in a single command be it sort awk perl...in order to minimize the inefficiency due to process forking.

And for grins how about...

sort -t, -k2,2n -k3,3 file

After reading your post again i realise it wont work as the hrs. field isnt zero padded but this should...

sort -t, -k2,2n -k3,3n file

Ryan · July 11, 2011, 3:54pm

This is what happened when I tried to use split:

split: output file suffixes exhausted

Going to try again with bigger splits.

Current file (example):

>> cat 2009_Trades.csv | head -3

SYMBOL,DATE,TIME,PRICE,SIZE,CORR
AAPL,20090102,7:30:01,84.00,230,0
AAPL,20090102,7:30:02,84.01,270,0

(I changed the prices since I have practically no rights to this data.)

There are more symbols but you don't see any others until X MB into the file because it is currently sorted by symbol, then date, then time.

Goal file (example):

>> cat NEW_2009_Trades.csv | head -6

Id,Symbol,TradeTime,Price,Shares,Corr
1,CSCO,2009-01-02T07:30:00,16.96,580,0
2,GOOG,2009-01-02T07:30:00,321.23,200,0
3,AAPL,2009-01-02T07:30:01,90.75,720,0
4,IBM,2009-01-02T07:30:01,87.37,200,0
5,AMZN,2009-01-02T07:30:02,54.01,90,0

alister · July 11, 2011, 4:02pm

shamrock:

Post a few lines of your input file so it is apparent what you are talking about...also try doing it all in a single command be it sort awk perl...in order to minimize the inefficiency due to process forking.

And for grins how about...
sort -t, -k2,2n -k3,3 file
After reading your post again i realise it wont work as the hrs. field isnt zero padded but this should...
sort -t, -k2,2n -k3,3n file

Process forking isn't an issue. Only a few process are created by the pipeline. Perhaps you meant the back and forth context switching between the few processes which constitute the pipeline.

Neither of your suggestions is appropriate, though. The numeric sort will only look at a leading numeric string. This means that sort will never look beyond the first colon in the time string.

Regards,
Alister

---------- Post updated at 04:02 PM ---------- Previous update was at 03:57 PM ----------

Make sure you adjust the suffix length using -a so that the number of permutations can accomodate the number of expected files.

Regards,
Alister

Ryan · July 11, 2011, 5:41pm

Wait.

Wow.

How am I supposed to "sort individually"?

If I split the file up, every time two "sorted" files are combined I still need to sort the merged file, and therefore I run into the same problem.

alister · July 11, 2011, 6:04pm

No you don't. During the sorting step, the entire file's contents are in use. During the merging step, only one line per file being merged needs to be in memory.

Think about it. If you know that two files are already sorted, you only need to compare two lines at a time, make a decision which comes first, print the correct line, read the line that follows that which was printed, rinse and repeat.

Whereas when a file is not sorted, you do not know where a line goes until you've read the entire file at least once.

Regards,
Alister

Ryan · July 11, 2011, 6:13pm

I totally misunderstood you the first time.

So, basically try to split the file by each ticker, and then write some simple code to do my own sorting, correct?

Edit: I guess that all could have been summed up in two words: "Insertion sort"

alister · July 11, 2011, 6:30pm

No. I was not describing a specific sorting algorithm (insertion, quicksort, etc...) but an approach which allows one to deal with more data than memory alone allows. External sorting - Wikipedia, the free encyclopedia

As I said earlier, GNU sort should do this external sort for you (you have yet to make it clear which platform you're working with). It checks the size of the file, checks how much memory the system has available, sees it's much too big, and decides to use temp files to store sorted chunks for subsequent merging.

Whatever sort utility you're using, I'm assuming it's doing this since your error message mentions a temp file.

Perhaps someone familiar with your operating system can give more specific advice with that read i/o error.

Is it possible that your /tmp ran out of space during the sort? That something cleared /tmp while the sort was running? That the hardware is having issues?

It would also be helpful to know the specs of your hardware (ram, available space on relevant filesystems, and such).

Alister

Ryan · July 11, 2011, 6:32pm

I'm using Ubuntu Jaunty.

But unless somebody can come up with a solution within the next hour I'm just going to write something to manually split and sort it.

alister · July 11, 2011, 6:40pm

I was just looking at a few sort implementations, they all already use TEMPDIR to store sorted chunks when dealing with very large input. You can try doing it manually, but your sort tool is probably already doing that for you.

Good luck.

shamrock · July 11, 2011, 11:52pm

Context switching isnt limited just to the pipeline processes...the fewer the processes on the system the lesser the context switching...hope you catch my drift.

The sort on my aix and hpux boxes is able to sort the entire h:m:s field not just upto the first colon...perhaps your sort has a limitation mine doesnt.

alister · July 12, 2011, 1:40am

You misunderstand how sort works.

The moment that sort sees that colon in the numeric key, it stops evaluating the key. It will not go any further. All times with the same hour will compare equal because the minutes and seconds are not looked at. At this point, to decide how to order all of those equal records, sort will then look at the rest of the line ... starting with the first character. Also, it will not do so numerically, but lexographically.

The following two excerpts are from the POSIX sort standard, Man Page for sort (POSIX Section 1) - The UNIX and Linux Forums

Unless the colon is a radix or thousands separator character in the current locale (which is not the case in any locale that I'm aware of), it's not a valid part of the numeric string. The numeric key comparison will end right there.

Looking again at what you proposed:

We can see that the third field numeric key will not be inspected beyond the hours portion. All times with the same hour will compare equal regardless of the value of minutes and seconds. To then decide how to order the lines which have compared equal, the first field will be considered and then everything from the first colon in the third field inclusive till the end of the line. This is obviously wrong.

If your sort(1) does not behave as described above, it's broken (with regard to posix compliancy).

Here's a sample to illustrate the point:

$ cat data
b,11:00:59
a,11:45:00
c,11:15:00
d,11:01:00
e,11:00:00

The following does not give the desired result because the numerical comparison ends at the first colon of the second field. Therefore, every single line compares as equal (key evaluates to 11). At that point, sort then looks at the entire line. Note how in this example the output order is essentially a sort on the first field.

$ sort -t, -k2,2n data
a,11:45:00
b,11:00:59
c,11:15:00
d,11:01:00
e,11:00:00

To get a sort on the time using numeric comparison (which is not really necessary in this case since each time component is two digits), you need to work around the colon with finer grained keys. The following almost gets us there, but note that in this case, the seconds are not considered and records with the same hour and minute are sorted according to the first field (the first two lines are incorrect):

$ sort -t, -k2.1,2.2n -k2.4,2.5n data
b,11:00:59
e,11:00:00
d,11:01:00
c,11:15:00
a,11:45:00

The correct solutions for this example:

$ sort -t, -k2.1,2.2n -k2.4,2.5n -k2.7,2.8n data
e,11:00:00
b,11:00:59
d,11:01:00
c,11:15:00
a,11:45:00
$ sort -t, -k2,2 data
e,11:00:00
b,11:00:59
d,11:01:00
c,11:15:00
a,11:45:00

Although neither of these solutions is valid for this thread's problem since they both depend on the time having a fixed format.

Regards,
Alister

Ryan · July 12, 2011, 8:11pm

Thanks for all your help and insight, alister.

I just thought I would come on here and explain what I ended up doing.

Problem: Server has too little hard-drive space to perform sort, but plenty of RAM, whereas local machine has plenty of hard-drive space to perform sort, but not enough RAM.

Solution:

-- Remote Server --

$ cat data/run.sh

# Location with enough space to hold one copy of file
tmp_dir="/devlinux_work3/<name>/tmp" 

sed 's/:/,/g' $1 | sort -k 2,2 -k 3,3 -k 4,4 -k 5,5 -t',' -T ${tmp_dir} | sed 's/,/:/3 ; s/,/:/3'

-- Local Machine --

plink <name>@<machine>.<company> "bash /devlinux_work4/<name>/data/run.sh /devlinux_work4/<name>/data/trades.csv" > trades_sorted.csv

alister · July 12, 2011, 10:10pm

You're quite welcome. And thank you for reporting back.

Regards,
Alister