UNIX compare, sort lines and append difference

To make it easier, i gave following example. It is not homework or classwork. Instead, i have a huge csv file dump from tsql with 15 columns and around 300 rows. I was able to extract content that needs to be really converted. Here is the extract:

ES FP,B1ES FP,70000,I,SL22,SL22 (70000)
ES FP,B2ES FP,80000,I,XX22,XX22 (80000)
VIL SJ,B1VIL SJ,20000,I,SL22,SL22 (20000)
VIL SJ,B2VIL SJ,20000,I,XX22,XX22 (20000)
VIL SJ,S3VIL SJ,25040222,N,SL22,SL22 (25040222)
VIL SJ,S4VIL SJ,25040222,N,XX22,XX22 (25040222)
MING NO,B1MING NO,240000,I,SL22,SL22 (240000)
MING NO,B2MING NO,240000,I,XX22,XX22 (240000)
BCHN SE,B1BCHN SE,20000,I,SL22,SL22 (20000)
BCHN SE,B2BCHN SE,20000,I,XX22,XX22 (20000)

desire result should be:

ES FP,I,SL22 (70000)|XX22 (80000)
VIL SJ,I,SL22 (20000)|,XX22 (20000)
VIL SJ,N,SL22 (25040222)|XX22 (25040222)
MING NO,I,SL22 (240000)|XX22 (240000)
BCHN SE,I,SL22 (20000)|XX22 (20000)

based on sorting of field f4.

to make it simple, I created two files using grep for I and N but don't know how to append the third column if f1,f2 are same using comma as delim.

=========================

UNIX compare, sort lines and append difference
Hi,

I have a file that needs to be converted:

content is:

a, b, 4
a ,b, 5
x, y, 1
a, b, 1
x, y, 3

how can i get:

a, b, 1|4|5

x,y 1|3

---------- Post updated at 10:37 AM ---------- Previous update was at 10:27 AM ----------

ES FP,B1ES FP,70000,I,SL22,SL22 (70000)
ES FP,B2ES FP,80000,I,XX22,XX22 (80000)
VIL SJ,B1VIL SJ,20000,I,SL22,SL22 (20000)
VIL SJ,B2VIL SJ,20000,I,XX22,XX22 (20000)
VIL SJ,S3VIL SJ,25040222,N,SL22,SL22 (25040222)
VIL SJ,S4VIL SJ,25040222,N,XX22,XX22 (25040222)
MING NO,B1MING NO,240000,I,SL22,SL22 (240000)
MING NO,B2MING NO,240000,I,XX22,XX22 (240000)
BCHN SE,B1BCHN SE,20000,I,SL22,SL22 (20000)
BCHN SE,B2BCHN SE,20000,I,XX22,XX22 (20000)
cat file

ES FP,B1ES FP,70000,I,SL22,SL22 (70000)
ES FP,B2ES FP,80000,I,XX22,XX22 (80000)
VIL SJ,B1VIL SJ,20000,I,SL22,SL22 (20000)
VIL SJ,B2VIL SJ,20000,I,XX22,XX22 (20000)
VIL SJ,S3VIL SJ,25040222,N,SL22,SL22 (25040222)
VIL SJ,S4VIL SJ,25040222,N,XX22,XX22 (25040222)
MING NO,B1MING NO,240000,I,SL22,SL22 (240000)
MING NO,B2MING NO,240000,I,XX22,XX22 (240000)
BCHN SE,B1BCHN SE,20000,I,SL22,SL22 (20000)
BCHN SE,B2BCHN SE,20000,I,XX22,XX22 (20000)

awk -F, '{$1==$1 && $2==$2;a=$1;b=$4;c=$6;getline;d=$NF;print a","b","c"|"d}' file

ES FP,I,SL22 (70000)|XX22 (80000)
VIL SJ,I,SL22 (20000)|XX22 (20000)
VIL SJ,N,SL22 (25040222)|XX22 (25040222)
MING NO,I,SL22 (240000)|XX22 (240000)
BCHN SE,I,SL22 (20000)|XX22 (20000)
1 Like

That a lot, how can i even sort on f1 again and append f2.

i.e

from

ES FP,I,SL22 (70000)|XX22 (80000)
VIL SJ,I,SL22 (20000)|XX22 (20000)
VIL SJ,N,SL22 (25040222)|XX22 (25040222)
MING NO,I,SL22 (240000)|XX22 (240000)
BCHN SE,I,SL22 (20000)|XX22 (20000)

to

ES FP,I,SL22 (70000)|XX22 (80000)
VIL SJ,I,SL22 (20000)|XX22 (20000),N,SL22 (25040222)|XX22 (25040222)
MING NO,I,SL22 (240000)|XX22 (240000)
BCHN SE,I,SL22 (20000)|XX22 (20000)

Is it always two consecutive lines, or, like in your a, b, 1|4|5 example, can the lines be spread over the file?

yes, it should be fine. Basically, i am looking for three columns that can be feed back to process the file.

so from following:

ES FP,SL22 (70000)|XX22 (80000)
VIL SJ,SL22 (20000)|XX22 (20000)
VIL SJ,SL22 (25040222)|XX22 (25040222)
MING NO,SL22 (240000)|XX22 (240000)
BCHN SE,SL22 (20000)|XX22 (20000)
BCHN SE,SL22 (100000)|XX22 (100000)
DLAR LN,SL22 (50000)|XX22 (50000)

it needs to be converted to:

ES FP,SL22 (70000)|XX22 (80000)
VIL SJ,SL22 (20000)|XX22 (20000),SL22 (25040222)|XX22 (25040222)
MING NO,SL22 (240000)|XX22 (240000)
BCHN SE,SL22 (20000)|XX22 (20000),SL22 (100000)|XX22 (100000)
DLAR LN,SL22 (50000)|XX22 (50000)

---------- Post updated at 02:32 PM ---------- Previous update was at 12:50 PM ----------

lines can be anywhere in the file and it can be more than more two lines.

---------- Post updated at 02:36 PM ---------- Previous update was at 02:32 PM ----------

also i sorted so it is always consecutive.

---------- Post updated at 04:08 PM ---------- Previous update was at 02:36 PM ----------

Any thoughts , thanks in advance.

$2==$2 means to compare the second column to next second column right? if so why you have used

$2==$2

as they are not equal when $1==$1 right?

1 Like

Actually I do not understand why $1==$1 && $2==$2; is there. IMO it has no function and could just be left out.

--
This should be equivalent to the suggestion in post #2:

awk '{p=$NF; getline; print $1, $4, p "|" $NF}' FS=, OFS=, file
2 Likes

Thanks for explanation, was looking around as well why $2=$2. Also is it possible to sort further and have uniq $1 and append $4 based on different value of $2. i,e in second/third line, $2 changes from B to S so append $3 of line 3 to $2 with comma.

so from following :

ES FP,B,SL22 (70000)|XX22 (80000)
VIL SJ,B,SL22 (20000)|XX22 (20000)
VIL SJ,S,SL22 (25040222)|XX22 (25040222)
MING NO,B,SL22 (240000)|XX22 (240000)
BCHN SE,B,SL22 (20000)|XX22 (20000)
BCHN SE,S,SL22 (100000)|XX22 (100000)
DLAR LN,S,SL22 (50000)|XX22 (50000)

become:

ES FP,SL22 (70000)|XX22 (80000)
VIL SJ,SL22 (20000)|XX22 (20000),SL22 (25040222)|XX22 (25040222)
MING NO,SL22 (240000)|XX22 (240000)
BCHN SE,SL22 (20000)|XX22 (20000),SL22 (100000)|XX22 (100000)
DLAR LN,SL22 (50000)|XX22 (50000)

any thoughts ?