Sorting blocks by a section of the identifier

Xterra · April 12, 2017, 4:25pm

I have a file that should be sorted by a string (shown in red in my example below) in the identifier. The RS is ^@M0 , something like this:

 @M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
 GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
 +
 GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
 @M04961:22:000000000-B5VGJ:1:1101:10817:7690 1:N:0:86
 ACGAGCATCATCTTGATTAAGCTCATTAGGGTTAGCCTCGGTACGGTCAGGCATCCACGGCGCTTTAAAATAGTTGTTATAGATATTCAAATAACCCTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEEFFFGGGGGG
 @M04961:22:000000000-B5VGJ:1:1101:14258:7136 1:N:0:86
 GGCATGAAAACATACAACAGCGGCTTTAACCGGACGCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATGAGACAGGCCGTTTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGEGEGEGGG
 @M04961:22:000000000-B5VGJ:1:1101:15671:7305 1:N:0:86
 GGCATGAAAACATACAAAGTAAGGGGCCGAAGCCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTAC
 +
 CCCC@CCFFGFGEGGGGGFGGGGGGGGFGGGGGGEFGGGGGGGGGCGGGGGGGGCFFG@GFFGGGGGCCGCGFGGGGGGGGGGGFFBEGG:CFF9>CGEG
 @M04961:22:000000000-B5VGJ:1:1101:10091:7763 1:N:0:86
 GAGCACATTGTAGCATTGTGCCAATTCATCCATTAACTTCTCAGTAACAGATACAAACTCATCACGAACGTCAGAAGCAGCCTTATGGCCGTCAACATAC
 +
 :=@FGEFFFGGGGGGGFBB@BEFGG?F,EFCCF@FGGGGGGECFGFG9,><3>FC@DFFGG9:383@FC9,>;,>78FC=FCDECFFDGFFCFFGGC?FF
 @M04961:22:000000000-B5VGJ:1:1101:14783:7784 1:N:0:86
 TCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGCGGGGFGG
 @M04961:22:000000000-B5VGJ:1:1101:26069:7790 1:N:0:86
 CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACAT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Then desire output is the following:

 @M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
 GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
 +
 GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
 @M04961:22:000000000-B5VGJ:1:1101:14258:7136 1:N:0:86
 GGCATGAAAACATACAACAGCGGCTTTAACCGGACGCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATGAGACAGGCCGTTTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGEGEGEGGG
 @M04961:22:000000000-B5VGJ:1:1101:15671:7305 1:N:0:86
 GGCATGAAAACATACAAAGTAAGGGGCCGAAGCCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTAC
 +
 CCCC@CCFFGFGEGGGGGFGGGGGGGGFGGGGGGEFGGGGGGGGGCGGGGGGGGCFFG@GFFGGGGGCCGCGFGGGGGGGGGGGFFBEGG:CFF9>CGEG
 @M04961:22:000000000-B5VGJ:1:1101:10817:7690 1:N:0:86
 ACGAGCATCATCTTGATTAAGCTCATTAGGGTTAGCCTCGGTACGGTCAGGCATCCACGGCGCTTTAAAATAGTTGTTATAGATATTCAAATAACCCTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEEFFFGGGGGG
 @M04961:22:000000000-B5VGJ:1:1101:10091:7763 1:N:0:86
 GAGCACATTGTAGCATTGTGCCAATTCATCCATTAACTTCTCAGTAACAGATACAAACTCATCACGAACGTCAGAAGCAGCCTTATGGCCGTCAACATAC
 +
 :=@FGEFFFGGGGGGGFBB@BEFGG?F,EFCCF@FGGGGGGECFGFG9,><3>FC@DFFGG9:383@FC9,>;,>78FC=FCDECFFDGFFCFFGGC?FF
 @M04961:22:000000000-B5VGJ:1:1101:14783:7784 1:N:0:86
 TCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGCGGGGFGG
 @M04961:22:000000000-B5VGJ:1:1101:26069:7790 1:N:0:86
 CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACAT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

I have tried several combinations of awk | sort , but haven't been able to get the desire output
Any help will be greatly appreciated!

wbport · April 12, 2017, 4:57pm

Does

sort -t: -n -7

work? Does your red number always appear in the 7th field when using a colon as a delimiter? Is the field always four digits? Finally, does the "b1" after your sort code ever change? ("b" represents a space.)

Xterra · April 12, 2017, 5:09pm

wbport
I am not quite sure how your code will output the desired result. About your questions, everything is constant but the length of the string/field

bakunin · April 12, 2017, 6:16pm

Which typical size has such a file? Background of the question is: would it be feasible to sort it in memory or will we need to work with temporary files to conserve memory. Furthermore, a rough estimation of the typical number of records would help - orders of magnitude would be sufficient.

I suppose the easiest way would be to first transform the "records" which now span multiple lines into single lines, then use a simple sort like the suggested one to sort these lines, finally transform the the lines back into the original structure. This will work if: the resulting line length is sufficiently short and the available memory is big enough to do the sorting.

I hope this helps.

bakunin

wbport · April 12, 2017, 6:19pm

The last field on the sort command, unless it is sorting data piped in from something else (like sed), is the name of the file to be sorted followed by a redirect (>) or a pipe (|) to another program. "-t:" says to use a colon as a field delimiter. "-n" says this is a numeric field but if this field is always four digits, that parameter can be taken out.

If you need a numeric sort but need to ignore the space and following digit, you can use a sed command to replace the space with a leading colon and a space. After running sort, another sed can take out the colon before the space.

Xterra · April 12, 2017, 6:38pm

Thanks guys! Sort of got it:

 awk -vRS="@M0" 'BEGIN { FS="\n"; OFS="\t" }{ print RS$1, $2, $3, $4 }' test.txt | sed '1d' | sort -t: -k 7 | tr "\t" "\n"

These sort of files contain between 15-25 million entries. So, I would like to have a "lighter" script handling this thingy
Thanks in advance!

Don_Cragun · April 13, 2017, 1:08pm

You can get rid of the sed in your pipeline with a small change to the awk script:

 awk -vRS="@M0" 'BEGIN{FS="\n"; OFS="\t"}NR>1{print RS$1, $2, $3, $4}' test.txt | sort -t: -k 7 | tr "\t" "\n"

I would also be tempted to change the:

sort -t: -k 7

to:

sort -t: -k 7,7

but as long as the field 7 values in your millions of input lines are unique (including the numbers after the space), it won't make any difference.

Xterra · April 13, 2017, 1:14pm

Don
Thanks! Is there any way to sort the data directly using awk instead of "handing" the task to sort ?

Don_Cragun · April 13, 2017, 2:01pm

Not with standard awk . GNU awk (sometimes invoked as gawk ) includes two built-in functions ( asort () and asorti ()) that can be used to sort arrays. I'm guessing that asorti () could be used to sort an array whose indices are your record's 7th colon separated field values, but these functions are not available in the awk on my system (so I have no way of testing it out). And, of course, you could write your own function in awk to sort an array. Any of these array sorting functions would require that you load your entire input file (other than skipping record #1) into an array in memory, sorting the array, and then writing the sorted array. Your current awk script never keeps more than one input record in memory. (Note that the sort utility uses temporary files when necessary if the file(s) being sorted are too large to fit in memory. You haven't said how large your input file is (in bytes) although you have said it can have up to 15 million records, so I don't know if this would be a concern trying to use awk instead of sort to sort your data on your system.)

Xterra · April 13, 2017, 2:20pm

Don
My files are ~8,000,000kb. Would that limit my ability to sort the records using awk instead of sort ?

Don_Cragun · April 13, 2017, 2:36pm

Hi Xterra,
Do you have enough memory on your system for awk to build an array larger than 8 terabytes?

Xterra · April 13, 2017, 2:40pm

I will give it a try on my cluster

bakunin · April 13, 2017, 6:51pm

It is, ahem, highly unlikely anybody can peruse something with 8TB of RAM.

If your input file is relatively static (like there will only be lines appended but rarely to never lines get deleted) you might try to create a smaller file with just your sort-key and a line-number. This file should be considerably smaller (maybe several hundred MB or a few GB) and the might be possible to sort in memory.

You would have to go through this file and either rewrite your big file or filter out the subset you are interested in using the line numbers, which will perhaps take a long time again, but if the file changes not that often (see above) you will have to redo only parts of it, so this might help anyway.

Along the same lines: wouldn't a database with an indexed table be what you want? Databases have methods to deal with files that are bigger as the available main memory. So what you are doing here is perhaps old news for DB-software

I hope this helps.

bakunin

RudiC · April 14, 2017, 2:23am

Wouldn't 8.000.000 kB (8 * 10^6 * 10^3) be 8 GB? And thus (sort of) manageable? How come we're talking terabytes?

bakunin · April 14, 2017, 5:43pm

Now, that you mention it: i think you are right. I just read Dons "8TB" and didn't recalculate myself. My bad.

I just counted one record of the posted saple to have 260 characters. As a size of 15-25 million records were mentioned: 15 * 10^6 * 260 ~ 4GB, 25 * 10^6 * 260 ~ 6GB. This should indeed be feasible to sort in memory.

bakunin

Don_Cragun · April 14, 2017, 7:59pm

I'm sorry for all of the confusion. I had originally intended to type 8GB, but hit the T instead of the G key. Then while I was reviewing it, I decided to spell it out and converted the 8TB to 8 terabytes compounding, instead of correcting, the error.

With the BSD based awk on macOS, I don't have the asorti () function and only the 1st character of values assigned to RS matters. So, the following is completely untested, but if I understand the GNU awk page correctly, I think the pipeline:

awk -vRS="@M0" 'BEGIN{FS="\n"; OFS="\t"}NR>1{print RS$1, $2, $3, $4}' test.txt | sort -t: -k 7 | tr "\t" "\n"

should be replaceable by the following single invocation of awk :

awk '
BEGIN {	FS = OFS = "\n"
	RS = "@M0"
}
NR > 1 {split($1, f, /:/)
	out[f[7]] = RS $0
	order[f[7]]
}
END {	n = asorti(order)
	for(i = 1; i <= n; i++)
		printf("%s", out[order])
}' test.txt

as long as there are no duplicates in the 7th colon separated field in any of the records in your input file. (If there are duplicates, I think all but the last record in each set of duplicates will be missing in the output produced by the above script.)

I would appreciate it if someone with access to GNU awk could try this out with the sample data in post #1 in this thread and let me know if I came close to getting it right.

Xterra · April 18, 2017, 2:40pm

Sorry for the long delay! I will give it a try