Sort and remove duplicates in directory based on first 5 columns:

I have /tmp dir with filename as:

010020001_S-FOR-Sort-SYEXC_20160229_2212101.marker
010020001_S-FOR-Sort-SYEXC_20160229_2212102.marker
010020001-S-XOR-Sort-SYEXC_20160229_2212104.marker
010020001-S-XOR-Sort-SYEXC_20160229_2212105.marker
010020001_S-ZOR-Sort-SYEXC_20160229_2212106.marker
010020001-S-FOR-Sort-SYEXC_20160229_2212102.marker

i want to sort these files based on first 5 columns and then remove the duplicates based on those same first 5 columns:

i tried below code:

ls | sort -k1,2,3,4,5

later on i felt, there is no need to sort my files just remove the duplicates as i need only unique names, order doesn't matter, so i tried this:

ls | awk -F[_-] '!seen[$1,$2,$3,$4,$5]++'

i got:

010020001_S-FOR-Sort-SYEXC_20160229_2212101.marker
010020001-S-XOR-Sort-SYEXC_20160229_2212104.marker
010020001_S-ZOR-Sort-SYEXC_20160229_2212106.marker

If you see closely i am missing one file: i.e

010020001-S-FOR-Sort-SYEXC_20160229_2212102.marker

please note the field separator in first 5 columns.

so my desired output should be :

010020001_S-FOR-Sort-SYEXC_20160229_2212101.marker
010020001-S-FOR-Sort-SYEXC_20160229_2212102.marker
010020001-S-XOR-Sort-SYEXC_20160229_2212104.marker
010020001_S-ZOR-Sort-SYEXC_20160229_2212106.marker

help me out on this, also i want to run the for loop on the desired result set..so shall i delete the duplicate filenames or store the unique filenames at some other directory and then run for loop, need some kind of advise.

TIA

Sometimes it pays off to follow older threads to their end . Try

ls *.marker | awk  -F'[_-]' '{T = $0; sub (FS $6 ".*$", "", T)} !seen[T]++'
010020001_S-FOR-Sort-SYEXC_20160229_2212101.marker
010020001-S-FOR-Sort-SYEXC_20160229_2212102.marker
010020001-S-XOR-Sort-SYEXC_20160229_2212104.marker
010020001_S-ZOR-Sort-SYEXC_20160229_2212106.marker
1 Like

thansk RudiC:b:

for file in *.marker
do
   base_name="${file//_[0-9][0-9]*_[0-9][0-9]*[.]*/}"
   [[ "$last_base_name" = "$base_name" ]] || echo "$file"
   last_base_name="$base_name"
done

use extension regex option

ls | sed -E '$!N; /^(.*\.marker)\n\1$/!P; D'