Sort based on certain value in a column

redse171 · May 2, 2014, 4:06pm

Hi,

i need to sort content of files based on a specific value. An example as below.

Input1.txt

Col_1
SW_MH2_ST
ST_F72_9S
SW_MH3_S6
Col_2
SW_MH3_AS7
ST_S15_9CH
SW_MH3_AS8
SW_MH3_ST
Col_3
ST_M93_SZ
ST_C16_TC

the output that i want is to sort based on "Col_X" (X is the number) as bold.

output.txt

Col_1
ST_F72_9S
SW_MH2_ST
SW_MH3_S6
Col_2
ST_S15_9CH
SW_MH3_AS7
SW_MH3_AS8
SW_MH3_ST
Col_3
ST_C16_TC
ST_M93_SZ

Tried different kind of sort command but the results is not what i expected. I have many files that i need to sort and they contains different number of "Col_X" (eg: Col_20). Could u guys pls help me on this. Thanks

jim_mcnamara · May 2, 2014, 4:23pm

Could you please post example input and expected output? The way the question is worded right now it is hard to see what you mean.

redse171 · May 2, 2014, 4:29pm

Hi jim mcnamara,

I just edited the filename in my question above as it seems confusing. Ok, the sample input is "input1.txt" and the sample output is "output.txt". Thanks

rdrtx1 · May 2, 2014, 5:21pm

try:

awk '
/^Col/ {col[++c]=$0; next;}
{data[c,NR]=$0;}
END {
   for (i=1; i<=c ; i++) {
      print col;
      for (j=1; j<=NR; j++) if (data[i,j]) ar[data[i,j]]=data[i,j];
      n=asort(ar);
      for (l=1; l<=n; l++) print ar[l];
      delete ar;
   }
} ' Input.txt

redse171 · May 2, 2014, 5:43pm

Hi rdrtx1,

i tried but got this error: "function asort never defined". It seems that i could not use "asort".

rdrtx1 · May 2, 2014, 6:07pm

try:

awk '
/^Col/ {col[++c]=$0; next;}
{data[c,NR]=$0;}
END {
   for (i=1; i<=c ; i++) {
      print col;
      for (j=1; j<=NR; j++) if (data[i,j]) ar=ar "\n" data[i,j];
      sub("^\n","", ar);
      sort="sort";
      print ar | sort;
      close(sort);
      ar="";
   }
} ' Input.txt

redse171 · May 2, 2014, 6:14pm

Hi rdrtx1,

It works perfectly! Thanks a ton. If u dont mind, can u pls explain the codes? Thanks.

Don_Cragun · May 2, 2014, 7:09pm

Maybe I didn't understand what you're trying to do. I thought the Col_n was supposed to cause the following lines up to the next Col_n line to be sorted in increasing alphanumeric order based on the nth character in the line.

Unfortunately, with the sample data given, there is no way to tell if you're trying to sort on the whole line or on the nth character since the results would be the same. If you are trying to sort on the nth character, rdrtx1's script won't do that. Assuming that there aren't any spaces or tabs in your input file (at least not before the character position that is to be sorted), the following might work:

awk '
function finish() {
	if(sc != "") close(sc)
}

/^Col_/ {
	finish()
	print
	col = substr($0, 5)
	sc = sprintf("sort -k1.%d,1.%d", col, col)
	next
}
{	print | sc
}
END {	finish()
}' File1.txt

If your input file contains:

Col_1
SW_MH2_ST
ST_F72_9S
SW_MH3_S6
Col_10
SW_MH3_AS7
ST_S15_9CH
SW_MH3_AS8
SW_MH3_ST
Col_5
ST_M93_SZ
ST_C16_TC
Col_4
Abc4123
Cde3234
Bcd2345
Def1234

rdrtx1's script will produce:

Col_1
ST_F72_9S
SW_MH2_ST
SW_MH3_S6
Col_10
ST_S15_9CH
SW_MH3_AS7
SW_MH3_AS8
SW_MH3_ST
Col_5
ST_C16_TC
ST_M93_SZ
Col_4
Abc4123
Bcd2345
Cde3234
Def123

while the script above will produce:

Col_1
ST_F72_9S
SW_MH2_ST
SW_MH3_S6
Col_10
SW_MH3_ST
SW_MH3_AS7
SW_MH3_AS8
ST_S15_9CH
Col_5
ST_C16_TC
ST_M93_SZ
Col_4
Def1234
Bcd2345
Cde3234
Abc4123

If there are spaces in your input file, you need to specify a field separator in the sort command naming a character that can never appear in your input file. If there are tabs in the input file and you want to sort based on output line positions (rather than input character counts), you would have to expand input tabs to a variable number of spaces depending on where in the input line the tab(s) appear. And if you want to sort on output print positions and there are backspace characters in the input, you will need to give a much clearer explanation of what is supposed to happen.

If you want to try the above awk script on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

redse171 · May 2, 2014, 7:32pm

Hi Don Cragun,

Thanks so much for kind explanation. Really appreciate that.
For my files right now, the Col_n is already in alphanumeric order and i just need to sort ascending the members of Col_n only, which makes rdrtx1's script work perfectly. As for your codes, i tried to understand it as it would be a big help for me if I need to sort the nth too in the future. I tried your code, but, i am wondering why "Col_5" comes before "Col_4" and why col_4 members are sorted descending? Hope u can help to explain them. Thanks.

Don_Cragun · May 2, 2014, 8:00pm

Your description of what to do was vague. You said:

I thought that meant you were trying to sort lines following lines of the form Col_n in increasing alphanumeric order based on the nth column (i.e., input character position) in those lines. So, the output produced by my script is sorted on the characters marked in red:

Col_1   # following lines are sorted on character 1
ST_F72_9S
SW_MH2_ST
SW_MH3_S6
Col_10  # following lines are sorted on character 10
SW_MH3_ST
SW_MH3_AS7
SW_MH3_AS8
ST_S15_9CH
Col_5   # following lines are sorted on character 5
ST_C16_TC
ST_M93_SZ
Col_4   # following lines are sorted on character 4
Def1234
Bcd2345
Cde3234
Abc4123

redse171 · May 2, 2014, 8:35pm

Hi,

Ok, now i get what u meant. Sorry for the confusion. Actually "Col_X" here just to represent groups. I just want to sort ascending the members of each group. thanks

Don_Cragun · May 2, 2014, 9:32pm

OK. You asked how rdrtx1's script works. Basically, it reads the data from your input file into two arrays. One containing the separator lines, and the other containing the data for each group. After it has read all of the data, it prints the group name line and uses the sort utility to sort the data it accumulated for each group. A slightly simpler awk script with comments is:

awk '
/^Col_/ {
	# Group separator found.
	# Finish sorting the previous group, if there was a previous group.
	if(NR != 1) close("sort")
	# Print separator for next group.
	print
	# Skip to next line of input.
	next
}
{	# Send data for current group to sort command...
	print | "sort"
}
END {	# Finish sorting hte last group.
	close("sort")
}' File1.txt

This will not work if the 1st line in your input file does not start with Col_ and will probably produce an error message if your input file is an empty file, but I assume neither of these is a problem. In practice, the lines shown in orange could be left off, but it is good practice to explicitly close() any pipelines you open.

redse171 · May 2, 2014, 9:39pm

Hi Don Cragun,

Many thanks!!! Really appreciate that.