If you wanted to sort a .csv file that was filled with lines like this:
<Ticker>,<Date as YYYYMMDD>,<Time as H:M:S>,<Volume>,<Corr>
(H : [1, 23], M, S: [0, 59])
by date, does anybody know of a better solution than to turn the 3rd and 4th colons of every line into commas, sorting on four keys, and then turning those two commas in every line back to colons? It seems very inefficient to me. (I would just do it and not bother asking if these files weren't 50+GB.)
---------- Post updated at 09:43 PM ---------- Previous update was at 09:27 PM ----------
Meh, I'll let it run overnight.
sed 's/:/,/g' big_file.csv | sort -k 2,2 -k 3,3 -k 4,4 -k 5,5 -t',' | sed 's/,/:/3' | sed 's/,/:/3' > big_file.sorted.csv
Are hours, minutes and seconds all zero padded? For example, 01:02:03 instead of 1:2:3 or 1:02:03? If so, you do not need to modify anything. You can use the default lexicographical sort with the date and time fields as the keys.
Also, you mentioned that hours range betwee 1-23. In case it's relevant, that's only a 23 hour day.
If the source file is 50+ GB, you are going to need a lot of ram. You'll probably need to split the file into smaller chunks, sort them individually, and then merge them with sort -m.
By the way, the consecutive seds in the pipeline can be simplified: sed 's/,/:/3 ; s/,/:/3' .
That'll save some time in context switches and copying data in and out of kernel/userland buffers.
Regards,
Alister
---------- Post updated at 02:55 PM ---------- Previous update was at 02:53 PM ----------
Also, it seems GNU sort can handle this situation, by automatically creating tmp files during the sorting process. I'm assuming you're not on Linux. If so, and if you are using GNU sort, you should paste the exact error message.
Post a few lines of your input file so it is apparent what you are talking about...also try doing it all in a single command be it sort awk perl...in order to minimize the inefficiency due to process forking.
And for grins how about...
sort -t, -k2,2n -k3,3 file
After reading your post again i realise it wont work as the hrs. field isnt zero padded but this should...
Process forking isn't an issue. Only a few process are created by the pipeline. Perhaps you meant the back and forth context switching between the few processes which constitute the pipeline.
Neither of your suggestions is appropriate, though. The numeric sort will only look at a leading numeric string. This means that sort will never look beyond the first colon in the time string.
Regards,
Alister
---------- Post updated at 04:02 PM ---------- Previous update was at 03:57 PM ----------
Make sure you adjust the suffix length using -a so that the number of permutations can accomodate the number of expected files.
No you don't. During the sorting step, the entire file's contents are in use. During the merging step, only one line per file being merged needs to be in memory.
Think about it. If you know that two files are already sorted, you only need to compare two lines at a time, make a decision which comes first, print the correct line, read the line that follows that which was printed, rinse and repeat.
Whereas when a file is not sorted, you do not know where a line goes until you've read the entire file at least once.
No. I was not describing a specific sorting algorithm (insertion, quicksort, etc...) but an approach which allows one to deal with more data than memory alone allows. External sorting - Wikipedia, the free encyclopedia
As I said earlier, GNU sort should do this external sort for you (you have yet to make it clear which platform you're working with). It checks the size of the file, checks how much memory the system has available, sees it's much too big, and decides to use temp files to store sorted chunks for subsequent merging.
Whatever sort utility you're using, I'm assuming it's doing this since your error message mentions a temp file.
Perhaps someone familiar with your operating system can give more specific advice with that read i/o error.
Is it possible that your /tmp ran out of space during the sort? That something cleared /tmp while the sort was running? That the hardware is having issues?
It would also be helpful to know the specs of your hardware (ram, available space on relevant filesystems, and such).
I was just looking at a few sort implementations, they all already use TEMPDIR to store sorted chunks when dealing with very large input. You can try doing it manually, but your sort tool is probably already doing that for you.
Context switching isnt limited just to the pipeline processes...the fewer the processes on the system the lesser the context switching...hope you catch my drift.
The sort on my aix and hpux boxes is able to sort the entire h:m:s field not just upto the first colon...perhaps your sort has a limitation mine doesnt.
The moment that sort sees that colon in the numeric key, it stops evaluating the key. It will not go any further. All times with the same hour will compare equal because the minutes and seconds are not looked at. At this point, to decide how to order all of those equal records, sort will then look at the rest of the line ... starting with the first character. Also, it will not do so numerically, but lexographically.
Unless the colon is a radix or thousands separator character in the current locale (which is not the case in any locale that I'm aware of), it's not a valid part of the numeric string. The numeric key comparison will end right there.
Looking again at what you proposed:
We can see that the third field numeric key will not be inspected beyond the hours portion. All times with the same hour will compare equal regardless of the value of minutes and seconds. To then decide how to order the lines which have compared equal, the first field will be considered and then everything from the first colon in the third field inclusive till the end of the line. This is obviously wrong.
If your sort(1) does not behave as described above, it's broken (with regard to posix compliancy).
Here's a sample to illustrate the point:
$ cat data
b,11:00:59
a,11:45:00
c,11:15:00
d,11:01:00
e,11:00:00
The following does not give the desired result because the numerical comparison ends at the first colon of the second field. Therefore, every single line compares as equal (key evaluates to 11). At that point, sort then looks at the entire line. Note how in this example the output order is essentially a sort on the first field.
$ sort -t, -k2,2n data
a,11:45:00
b,11:00:59
c,11:15:00
d,11:01:00
e,11:00:00
To get a sort on the time using numeric comparison (which is not really necessary in this case since each time component is two digits), you need to work around the colon with finer grained keys. The following almost gets us there, but note that in this case, the seconds are not considered and records with the same hour and minute are sorted according to the first field (the first two lines are incorrect):
I just thought I would come on here and explain what I ended up doing.
Problem: Server has too little hard-drive space to perform sort, but plenty of RAM, whereas local machine has plenty of hard-drive space to perform sort, but not enough RAM.
Solution:
-- Remote Server --
$ cat data/run.sh
# Location with enough space to hold one copy of file
tmp_dir="/devlinux_work3/<name>/tmp"
sed 's/:/,/g' $1 | sort -k 2,2 -k 3,3 -k 4,4 -k 5,5 -t',' -T ${tmp_dir} | sed 's/,/:/3 ; s/,/:/3'
-- Local Machine --
plink <name>@<machine>.<company> "bash /devlinux_work4/<name>/data/run.sh /devlinux_work4/<name>/data/trades.csv" > trades_sorted.csv