Sort and remove duplicates in directory based on first 5 columns:

gnnsprapa · January 23, 2018, 3:31am

I have /tmp dir with filename as:

010020001_S-FOR-Sort-SYEXC_20160229_2212101.marker
010020001_S-FOR-Sort-SYEXC_20160229_2212102.marker
010020001-S-XOR-Sort-SYEXC_20160229_2212104.marker
010020001-S-XOR-Sort-SYEXC_20160229_2212105.marker
010020001_S-ZOR-Sort-SYEXC_20160229_2212106.marker
010020001-S-FOR-Sort-SYEXC_20160229_2212102.marker

i want to sort these files based on first 5 columns and then remove the duplicates based on those same first 5 columns:

i tried below code:

ls | sort -k1,2,3,4,5

later on i felt, there is no need to sort my files just remove the duplicates as i need only unique names, order doesn't matter, so i tried this:

ls | awk -F[_-] '!seen[$1,$2,$3,$4,$5]++'

i got:

010020001_S-FOR-Sort-SYEXC_20160229_2212101.marker
010020001-S-XOR-Sort-SYEXC_20160229_2212104.marker
010020001_S-ZOR-Sort-SYEXC_20160229_2212106.marker

If you see closely i am missing one file: i.e

010020001-S-FOR-Sort-SYEXC_20160229_2212102.marker

please note the field separator in first 5 columns.

so my desired output should be :

010020001_S-FOR-Sort-SYEXC_20160229_2212101.marker
010020001-S-FOR-Sort-SYEXC_20160229_2212102.marker
010020001-S-XOR-Sort-SYEXC_20160229_2212104.marker
010020001_S-ZOR-Sort-SYEXC_20160229_2212106.marker

help me out on this, also i want to run the for loop on the desired result set..so shall i delete the duplicate filenames or store the unique filenames at some other directory and then run for loop, need some kind of advise.

TIA

RudiC · January 23, 2018, 5:08am

Sometimes it pays off to follow older threads to their end . Try

ls *.marker | awk  -F'[_-]' '{T = $0; sub (FS $6 ".*$", "", T)} !seen[T]++'
010020001_S-FOR-Sort-SYEXC_20160229_2212101.marker
010020001-S-FOR-Sort-SYEXC_20160229_2212102.marker
010020001-S-XOR-Sort-SYEXC_20160229_2212104.marker
010020001_S-ZOR-Sort-SYEXC_20160229_2212106.marker

gnnsprapa · January 23, 2018, 9:00am

thansk RudiC:b:

rdrtx1 · January 23, 2018, 2:35pm

for file in *.marker
do
   base_name="${file//_[0-9][0-9]*_[0-9][0-9]*[.]*/}"
   [[ "$last_base_name" = "$base_name" ]] || echo "$file"
   last_base_name="$base_name"
done

abdulbadii · February 9, 2018, 5:50pm

use extension regex option

ls | sed -E '$!N; /^(.*\.marker)\n\1$/!P; D'