I have need of a sorting script that will sort a file of tab delimited data on multiple columns. I often do this kind of thing in excel, but I have need of something more automated. I guess the syntax would be something like, sort by columnName, then by otherColumnName, etc. There are allot of options for this sort of thing, such as the sort order (a-z, z-a), data types, etc, so I'm not sure where to begin. I know that the higher level interpreters like python have nice sort functions, but I don't know python really at all. I guess that for now I would hard code the sort criteria, but it also might be nice to eventually be able to pass that in, assuming that a reasonable syntax could be developed.
It looks like I could use the sort command, but I would prefer to not use column numbers, but header names. Since this will be automated to some extent, it will not be readily apparent if the wrong col number was used. I supposed I could search for the header name and determine what the correct col number is, but I'm not sure how to do that either. I'm sure there is an awk command for that, but I don't remember.
Unless you will know from the header name how a field should be sorted (forward or reverse, numeric or alphanumeric, etc.), you will probably need some manual intervention. If you just want to use header names instead of field numbers but can otherwise use the sort utility's flags to control the type of sorting to be done, we can probably help you come up with a way to do that if you give us some concrete examples of what your data looks like (including the headers). Please use code tags when you post this information; knowing whether fields are separated by <space>s, <tab>s, <colon>s, <comma>s or something else will be crucial to coming up with a working solution.
You are correct in noting that some versions of awk provide built-in sorting functions, but they are not portable and (as far as I know) limited to increasing order alphanumeric sorting.
Below is a sample of a file I just sorted in excel. It's tab delimited text and has unix EOL if that matters.
This was sorted as follows,
Cpyridne_N
Cphenol_O
fusering
nasC
all sorted in ascending order and all of these are ints. I think that I could use the sort arguments if I had the column numbers, but it would be nice to investigate what the possibilities are. Most of the scripts I start out with require some manual editing until I figure out exactly how I want to use them and get them refined. I am doing this in windows under cygwin in case that makes a difference.
You should simply insert the data into a db with a loader, then produce output. If you have Excel you have MS office and probably a db like Access. UNIX has free db's: mysql, Berekely db, etc.
So, Why does your idea look like reinventing the wheel?
The reason is that Don Cragun is right. You have to have complete metadata for each and every column to be able to sort based on arbitrary column selections. And there is too much metadata to cobble together anything useful in shell. Metadata can change. db's are meant for that contingency.
The way it is now, you have to know what you are sorting by looking at it - in effect gathering metadata. You have to use human intelligence to make decisions. Then use unix sort/Excel sort. Databases are meant to do this. You just tell db's to sort on column names, optionally: ascending or descending for each column. You can write database scripts (SQL language for example) to do stats or a lot of what you already can do in Excel.
So, stay in Excel or move to a db. I vote for a db.
Yes, a database would be nice. I am in the process of working on a database system for some of this data, but there are non-trivial issues with other parts of it not related to this, so I am finishing what I am working on using some scripting tools that I already have. The sorting is the only part of the process that is not already scripted, so I am having to open each file in excel, set up a sort, save the file, and then do some shell to get the txt file back to linux land. All of these sorts are on ints and in ascending order. I have a few hundred to do, so after the first several, I started thinking that there had to be a better way.
I am in one of those situations where the choice is between slogging through an inefficient process, or taking allot of time to set up the correct process. I don't know SQL at all, so I am reluctant to spend days getting all of that set up when the sort function is the only thing that I can't automate. Of course, after taking the time to learn and set it up, I would have learned more tools, and that is not a small thing in any way. Eventually I will do all of the steps I am doing now with SQL queries out of SQLite, using ruby scripts to populate the database (and eventually a browser interface). For now, someone is waiting on the results for this and I need to get it done as quickly as possible. Even if I have to hard code modify a separate script for each sorting criteria, that will still be much faster than all the excel, especially since I have to modify the resulting excel file. I will have to do this process again with other data, so once the scripts are set up, I should be able to automate the entire process.
I was thinking that if I was going to set up a sort script, I might as well try to make is a general a tool as possible so that it could be useful for other things. It does appear that there is allot of possible variation in how the script would need to operate in different cases, so perhaps my thinking was not realistic. Never the less, it would be a significant help with my current project and would not need to be generalized for that purpose.
Appreciating and supporting what Don and Jim say, but, on the other hand, understanding that and why you are looking for a quick fix, I brought up the following that might do the task. I'm not sure it will run correctly under cygwin - no chance to test. Runs fine, at least with your sample data, on my linux system.
Put the sort order into a file
Cpyridne_N
Cphenol_O
fusering
nasC
, making sure the items in there match your heading items (no error checking done in here). It needs to open infile twice, but reads only the header line in the first case.
$(awk 'NR==FNR{Ar[++n]=$1;next}
FNR=1 {exit}
END {printf "sort";
for (i=1;i<=n;i++)
for (j=1;j<=NF;j++)
if (Ar==$j) {printf " -k%d,%d", j, j; break};
printf "\n"}
' sortorder infile ) infile
Execute as is, i.e. the entire cmd in $(...). It will sort the headers below everything else; if you can't live with that, use head and tail commands to reverse.
I had started working on a solution similar to RudiC's suggestion last night, but fell asleep before finishing it. RudiC left out one important element; for the fields you're sorting you need to specify a numeric sort.
The following works on OS X, but I don't have a Cygwin system to test. This keeps the header line at the start of the file. The way it is written, it produces debugging information in a file named debug.out that shows the fields selected by the given sort keys, shows the sort command that is used to perform the sort, and lists the fields from each record that will be sorted by that command. (Actually, the entire record will be used as a final sort key if all of the selected keys match in some records, but since the first field in each line in your input files is a sequence number, that field will always be enough to disambiguate any records that match up to that point.)
#!/bin/ksh
awk -v dbg=1 '
BEGIN{ FS = OFS = "\t"}
FNR==NR{# We are in the 1st file. Each line is the name of a field to be used
# as a sort key, with the 1st line being the primary sort key.
key[++nk] = $1
next
}
FNR==1{ # We are on the 1st line of the 2nd file. Determine the sort command
# to use to implement the desired sort order. All keys are to be
# treated as ascending order numeric fields.
sortcmd = "sort -t \"" FS "\" -n"
for(i = 1; i <= nk; i++) {
# For each key...
for(j = 1; j <= NF; j++) {
if($j == key) {
# We have a match...
if(dbg)printf("key[%d](%s) is field %d\n",
i, key, j) > "debug.out"
if(dbg)keyf = j
sortcmd = sortcmd " -k" j "," j
break
}
}
if(j > NF) {
# This key does not have a matchine field heading.
printf("sorter: No heading matches key[%d] (%s)\n",
i, key)
ec = 1
}
}
if(ec) exit ec
if(dbg)printf("sortcmd is \"%s\"\n", sortcmd) > "debug.out"
print
next
}
{ # We have a data line. Feed it to sort.
if(dbg) {
printf("line %d key info: %s", FNR, $keyf[1]) > "debug.out"
for(i = 2; i <= nk; i++) printf("\t%s", $keyf) > "debug.out"
printf("\t%s\n", $1) > "debug.out"
}
print | sortcmd
}
END{ close(sortcmd)
}' keys data
If you don't want the debugging information, you can disable it by changing:
awk -v dbg=1 '
early in the script to:
awk -v dbg=0 '
or
awk '
or by removing all of the statements that start with if(dbg) .
Wow, thanks allot for working this out. This will really save me allot of time. It looks like it would be reasonable to make simple changes, like to alphanumeric sorting, or to change the sort order.
After a few changes to make this into a callable script run in bash, this is what I ended up with.
#!/usr/bin/bash
# call with $1 list of column headers to be sorted on, one header per line
# call with $2 name of file to be sorted
# will be prefixed to name of data file to create output file
OUTPUPREFIX="_makesdf"
# parse arguments
KEYFILE=$1
DATAFILE=$2
# make sure input is has unix EOL
dos2unix -q $KEYFILE
dos2unix -q $DATAFILE
# change to dbg=1 for debug output to logfile
#awk -v dbg=1 '
awk -v dbg=0 '
BEGIN{ FS = OFS = "\t"}
FNR==NR{# We are in the 1st file. Each line is the name of a field to be used
# as a sort key, with the 1st line being the primary sort key.
key[++nk] = $1
next
}
FNR==1{ # We are on the 1st line of the 2nd file. Determine the sort command
# to use to implement the desired sort order. All keys are to be
# treated as ascending order numeric fields.
sortcmd = "sort -t \"" FS "\" -n"
for(i = 1; i <= nk; i++) {
# For each key...
for(j = 1; j <= NF; j++) {
if($j == key) {
# We have a match...
if(dbg)printf("key[%d](%s) is field %d\n",
i, key, j) > "debug.out"
if(dbg)keyf = j
sortcmd = sortcmd " -k" j "," j
break
}
}
if(j > NF) {
# This key does not have a matching field heading.
printf("sorter: No heading matches key[%d] (%s)\n",
i, key)
ec = 1
}
}
if(ec) exit ec
if(dbg)printf("sortcmd is \"%s\"\n", sortcmd) > "debug.out"
print
next
}
{ # We have a data line. Feed it to sort.
if(dbg) {
printf("line %d key info: %s", FNR, $keyf[1]) > "debug.out"
for(i = 2; i <= nk; i++) printf("\t%s", $keyf) > "debug.out"
printf("\t%s\n", $1) > "debug.out"
}
print | sortcmd
}
END{ close(sortcmd)
}' $KEYFILE $DATAFILE > $OUTPUPREFIX"_"$DATAFILE
I have a local sort file with the list of headers to sort on, and this scripts lives with the rest of my path tools (/usr/local/bin/) so I can call it from the shell or another script.
@Don Cragun: impressive suggestion, esp. the debug stuff. Seen it before, admired it before, inclined to adopt it.
I had thought about the numeric sort, but as there are non numeric fields in the file as well, I disregarded it for the first attempt. By a slight enhancement we can make my suggestion accept "per field sort options", and this should be doable for Don's code as well:
$(awk 'NR==FNR{Ar[++n]=$1; SO[n]=$2; next}
FNR=1 {exit}
END {printf "sort";
for (i=1;i<=n;i++)
for (j=1;j<=NF;j++)
if (Ar==$j) {printf " -k%d,%d%s", j, j, SO; break};
printf "\n"}
' sortorder infile ) infile
will evaluate the options as given in the sortorder file, e.g. numeric reverse: