Sort and Remove duplicates

ysvsr1 · February 16, 2015, 10:02pm

Here is my task :

I need to sort two input files and remove duplicates in the output files :

Sort by 13 characters from 97 Ascending
Sort by 1 characters from 96 Ascending

If duplicates are found retain the first value in the file

the input files are variable length, convert them 250 characters fixed width files with padding spaces.

Mainframe equivalent code :

http://www.unix.com/attachments/shell-programming-and-scripting/6162d1424141975-sort-remove-duplicates-snipit-jpg

Here is the code i developed:

sort -nuc -k1.97,1.109 --key=1.96,1.96 file1 file2  | awk '{ printf ("%-250s\n",$0) }' > out.txt

Can any experts validate and correct me if something is wrong?

Don_Cragun · February 17, 2015, 1:38am

What OS and shell are you using?
What does your data look like?

Are there any spaces or tabs in the first 110 characters of any of your input lines?
What is the maximum line length of a line in your input files?
How big are your input files?

By definition, any lines that compare equal based on the sort key you provide are the same. When using the -u option, the sort utility makes no statement about which line from a set of lines having identical sort keys in the input files will be copied to the output.

RudiC · February 17, 2015, 5:09am

A few comments on your statement:

the -c option would not sort:

as Don Cragun surmises, any white space before char 96 would count the fields up and destroy your key definitions. Set the terminator to an exotic char with -t
you can use the short form -k more than one time in a statement
if lines longer than 250 chars can occur (again DC'c suspicion), your printf format will expand the line; use the precision field as well: "%-250.250s\n"

ysvsr1 · February 17, 2015, 10:00am

OS:
Linux x86_64 x86_64 x86_64 GNU/Linux

Sample Data :

YSVSR1 Kiladi    12198ASDA 21329180928AFJASDKDKDK ED AEFKF;p FK ADS 2132309183298209381 akfjalksdfjkdajfk j231 1239128390218309

Data contains spaces,tabs Alphanumeric values

Sort columns can also contain Alphanumeric Values

Are there any spaces or tabs in the first 110 characters of any of your input lines? Yes.

What is the maximum line length of a line in your input files? 211 Characters

How big are your input files? about 25000 lines in each file

Don_Cragun · February 19, 2015, 9:08pm

ysvsr1:

OS:
Linux x86_64 x86_64 x86_64 GNU/Linux

Sample Data :
YSVSR1 Kiladi    12198ASDA 21329180928AFJASDKDKDK ED AEFKF;p FK ADS 2132309183298209381 akfjalksdfjkdajfk j231 1239128390218309
Data contains spaces,tabs Alphanumeric values

Sort columns can also contain Alphanumeric Values

Are there any spaces or tabs in the first 110 characters of any of your input lines? Yes.

What is the maximum line length of a line in your input files? 211 Characters

How big are your input files? about 25000 lines in each file

In your sample input, the first field is marked above in red. Since it contains less than 96 characters, every line in your input files will have the same, empty sort keys.

If you are trying to use the characters marked in orange above as your primary sort key (characters #97 through #109 on the line) and the character marked in green above as your secondary sort key (character #96 on the line), your sort keys would still all be identical because using the -n option to sort tells it to perform a numeric comparison and to stop trying to gather characters for a key at the first character that is not part of a numeric value. So, since characters #96 and #97 on your sample input line are both alphabetic, even if you change the field delimiter to something that does not appear anywhere in your input files, your sort keys will still all be 0 unless you remove the -n option.

And, as has been said before, you can't rely sort -u to get the unique keys, if you require that the 1st line be selected out of sets of lines with identical keys. (On some systems, that might happen to be what you get sometimes, but there is no guarantee that that is the line you'll always get.)

So, instead of showing us a sort command line that you know is not giving you what you want, please explain in English exactly what you are trying to do. And, explain what you want when you say "the input files are variable length, convert them 250 characters fixed width files with padding spaces." Do you want leading spaces to be added, or do you want trailing spaces to be added?