Sample records from file:
14480020180,A20180,A020180,143245765381,A00062,17284171796
14480020180,A20180,A020180,143245765381,A00062,17284171796
14480000127,A00127,A000127,143245730649,A00127,
14480020180,A20180,A020180,143245765381,A00062,17284171796
14480000127,A00127,A000127,143245730649,A00127,
14480020180,A20180,A020180,143245765381,A00062,17284171796
14480042302,A42302,A000127,143245800913,A00127,
14480020180,A20180,A020180,143245765381,A00062,17284171796
14480041999,A41999,A000127,143245801337,A00127,
14480020180,A20180,A020180,143245765381,A00062,17284171796
14480000163,A00163,A000163,143245730774,A00163,4133403
14480042302,A42302,A000127,143245800913,A00127,
Desired Output:-
14480020180,A20180,A020180,143245765381,A00062,17284171796
14480000127,A00127,A000127,143245730649,A00127,
14480000163,A00163,A000163,143245730774,A00163,4133403
14480041999,A41999,A000127,143245801337,A00127,
14480042302,A42302,A000127,143245800913,A00127,
I also want to add the fact that this file contains 40-50% (20-25 GB) of duplicate records.
And unfortunately, all columns need to considered as part of the key to determine duplicates.
The order of the data (sorted/unsorted) in the resultant file doesn't matter. Only the removal of duplicates is essential.