Sort based on Multiple Columns in UNIX

Hi,

I would like to sort a list in different ways:

1> Unique based on Field 1 with highest Field 4

For Instance Input:

1678923450;11112222333344;11-1x;2_File.xml
1678923450;11112222333344;11-1x;5_File.xml
1234567890;11113333222244;11-1x;3_File.xml

Output:

1234567890;11113333222244;11-1x;3_File.xml
1678923450;11112222333344;11-1x;5_File.xml

2> Unique based on Field 2, then Field 1 with highest Field 4

For Instance Input:

1234567890;11113333222244;11-1x;3_File.xml
1234567890;11112222333344;11-1x;1_File.xml
1678923450;11112222333344;11-1x;2_File.xml

Output:

1234567890;11112222333344;11-1x;1_File.xml
1678923450;11112222333344;11-1x;2_File.xml
1234567890;11113333222244;11-1x;3_File.xml

1>

sort -t\; -k1,1 -k4,4d infile | awk -F\; '!($1 in A) { print $0; A[$1] }'

2>

sort -t\; -k1,1 -k2,2 -k4,4d infile | awk -F\; '!($1$2 in A) { print $0; A[$1$2] }'

Thanks for the reply.

1> Sorry but got this Code Output:

1234567890;11113333222244;11-1x;3_File.xml
1678923450;11112222333344;11-1x;2_File.xml

2> Code Output as expected.

Can you please explain the awk logic you've used so I can modify further for my reports?

Sorry, should have checked my sort parameters the 4th field was supposed to be sorted in descending order to do the sort needs the r flag (not d) so your correct answers are:

sort -t\; -k1,1 -k4,4r infile | awk -F\; '!($1 in A) { print $0; A[$1] }'
sort -t\; -k1,1 -k2,2 -k4,4d infile | awk -F\; '!($1$2 in A) { print $0; A[$1$2] }'

For the awk logic:
-F\; use semicolon as the field separator, this allows awk to split the line up into $1, $2 ... $NF fields

!($1$2 in A) means only process lines where the value of field1 concatenated with field2 (your unique key) is not already in the array A

print $0 print the entire line

So in prints the first line where each unique key value occurs.

A[$1$2] add value field1 concatenated with field2 to the array A

1 Like

Thanks.

Is there a way I can add a Counter Field at the end of each line. This Counter will keep a track of each occurrence of 1st Field.

For instance:

1234567890;11113333222244;11-1x;3_File.xml;Try-1
1234567890;11112222333344;11-1x;1_File.xml;Try-2
1678923450;11112222333344;11-1x;2_File.xml;Try-1

Use something like this, using and array named C[] to keep track of your counter.

sort -t\; -k1,1 -k2,2r -k4,4r infile | awk -F\; '!($1$2 in A) { print $0 ";Try-" ++C[$1]; A[$1$2] }'
1 Like

That worked perfectly. Thanks a lot!

Hey,

Quick query.

If I have 2 Files - File1 (2M Lines) & File2 (5K Lines) & I want to find & replace the 5K Lines from File2 in File1 - What's the fastest way?

I'm currently doing it using sed which takes ages.

File1 (2M Lines)

A1;B1;C1;D1;E1
A2;B2;C2;D2;E2
A3;B3;C3;D3;E3

File 2 (5K Lines)

A2;B2;F2:G2
A4;B4;F4;G4

Pick Line by Line from File2, Find in File1, then Delete that Line in File1 & add corresponding Line from File2 to File1

Merged & Replaced Output (2M Lines)

A1;B1;C1;D1;E1
A2;B2;F2:G2
A3;B3;C3;D3;E3

---------- Post updated at 03:03 PM ---------- Previous update was at 02:57 PM ----------

I couldn't go with join since it checks only 1 column. I need to check First 2 Fields from File2 in File1.