bash - joining lines in a file

Cultcha · March 1, 2011, 4:48am

I�m writing a bash shell script and I want to join lines together where two variables on each line are the same ie.

12345variablestuff43212morevariablestuff
12345variablestuff43212morevariablestuff
34657variablestuff78945morevariablestuff
34657variablestuff78945morevariablestuff
98789variablestuff48975morevariablestuff

I want the file to end up looking like this:

12345variablestuff43212morevariablestuff12345variablestuff43212morevariablestuff
34657variablestuff78945morevariablestuff34657variablestuff78945morevariablestuff
98789variablestuff48975morevariablestuff

There can be any number of matching rows (16 max at the moment) as there seems to be a lot of duplication but I only need to join the first 4 rows as they appear on the file. (I have at the moment deleted the duplicates, but it seems that the duplication is actually necessary in rare cases :rolleyes:)
Thanks so much in advance for any help with this!

michaelrozar17 · March 1, 2011, 6:15am

With sed..

sed '/^[0-9]\+/{N;s/\(^[0-9]\+.*\)\n\(\1.*\)/\1\2/}' inputfile > outfile

Cultcha · March 1, 2011, 6:56am

Thanks for you response.

This hasn't done the trick for me though. The file that I'm working on is fixed length and contains both text and numerics throughout. The variables that I am matching on have 7 digits followed by 1/2 characters, if there is no second character this is a space.

Many thanks again in advance!

michaelrozar17 · March 1, 2011, 7:07am

I was guessing that the file you posted is not your real file you work upon. Have modified the command to match only the numerics at the beginning between the lines 1 and 2,3 and 4 etc..If this does not work post your sample data of the real file.

sed '/^[0-9]\+/{N;s/\(^[0-9]\+\)\(.*\)\n\(\1.*\)/\1\2\3/}'  inputfile > outfile

Cultcha · March 1, 2011, 8:43am

That's done the trick!!

Thanks a million!!

---------- Post updated at 08:43 AM ---------- Previous update was at 07:16 AM ----------

Sorry to come back to you on this again.

It seems that not all rows have joined appropriately. Here's the input:

 
1234567DR$MYSELF              $ANDI                $9876543H $          12346.87$     497.52$     123.56$XY$ $12
1234567DR$MYSELF              $ANDI                $9876543H $              0.00$       0.00$       0.00$TR$ $10

LinuxLearner · March 1, 2011, 9:35am

awk '{printf $0}' file_name > out_file

Cultcha · March 1, 2011, 9:43am

Hi, thanks for the response, but there are two variables that should match before joining the lines.

The first and the fourth columns have to match.

These will change after one line/two lines/three lines etc... throughout.

michaelrozar17 · March 2, 2011, 12:33am

cultcha:

Sorry to come back to you on this again.

It seems that not all rows have joined appropriately. Here's the input:
 
1234567DR$MYSELF              $ANDI                $9876543H $          12346.87$     497.52$     123.56$XY$ $12
1234567DR$MYSELF              $ANDI                $9876543H $              0.00$       0.00$       0.00$TR$ $10

Did you check the line count after running the sed command..? I guess since the posted above 2 lines are lengthy, it appears as if they are not joined. Pls check again.

Cultcha · March 2, 2011, 5:31am

From my test file, I should have 19 lines after the merge. I have 23. Maybe I would be best to use AWK to substitute the new lines based on the two conditions?

michaelrozar17 · March 2, 2011, 6:15am

If the conditions are complex then awk would do the job better. Post your sample real data with your requirements correctly.

Cultcha · March 2, 2011, 9:27am

Hi again, thanks for your help with this!

This file contains records matched on column 1 and column 4.

There can be up to four rows for each record, these rows appear sequentially on the file. However, duplicate rows can appear in the file also, so I would like to only join up to four rows and then delete duplicates after the join (as occasionally 2/3 of the four rows are duplicates of each other validly).

Where column 1 and column 4 are the same, I want to join the lines together (really I just want to add the different data (ie. columns 8 and 10) to the end of the first row.

 
1234567D $J                  $P                  $1234567N $          26575.00$   1670.89$   4527.81$XY$ $ 1$
1234567D $J                  $P                  $1234567N $              0.00$      0.00$      0.00$SY$ $51$
2456789B $P                  $T                  $8888888U $           5577.00$    157.00$    756.00$XY$ $13$
2456789B $P                  $T                  $9999999B $          30938.00$   1916.36$   5223.85$SY$ $25$
3333333I $L                  $G                  $1111111H $          39068.00$   2609.11$   6808.98$SY$ $52$
4444444F $GE                 $WI                 $6656656G $          21850.00$    683.13$   3032.06$XY$ $50$
4444444F $GE                 $WI                 $6656656G $              0.00$      0.00$      0.00$SY$ $ 2$
5555555H $J                  $BU                 $4545454D $          46698.00$   3159.91$   8179.99$SY$ $52$
6666666J $FR                 $CU                 $6372232D $          13448.00$     25.44$   1180.20$GP$ $51$
6666666J $FR                 $CU                 $6372232D $              0.00$      0.00$      0.00$SY$ $ 1$
7777777U $CH                 $TO                 $4444444P $           8667.00$    624.42$   1556.19$SY$ $ 9$
7777777U $CH                 $TO                 $4444444P $              0.00$      0.00$      0.00$XY$ $ 1$
1234112P $JI                 $TI                 $4582809N $          23117.00$      0.00$      0.00$R $ $52$
1234112P $JI                 $TI                 $9508160H $          14243.00$    768.00$   2299.00$SY$ $17$
4445566P $RI                 $B                  $4521440T $           7788.00$      0.00$      0.00$X1$ $52$
4445566P $RI                 $B                  $4567892P $          17234.00$    497.00$   2279.00$XY$ $26$
4445566P $RI                 $B                  $4567892P $              0.00$      0.00$      0.00$GP$ $17$
4445566P $RI                 $B                  $4567892P $              0.00$      0.00$      0.00$HU$ $ 5$
4445566P $RI                 $B                  $4567892P $              0.00$      0.00$      0.00$SY$ $ 4$
3334455E $G                  $GR                 $1239875F $          22149.00$    680.78$   3061.65$SY$ $ 2$
3334455E $G                  $GR                 $1239875F $              0.00$      0.00$      0.00$XY$ $50$
0000001V $RI                 $RA                 $1212121R $           1400.00$      0.00$      0.00$R $ $52$
0000001V $RI                 $RA                 $4556455F $          15384.00$    734.61$   2257.14$SY$ $16$
0000001V $RI                 $RA                 $4556455F $              0.00$      0.00$      0.00$R $ $32$
0000001V $RI                 $RA                 $4556455F $              0.00$      0.00$      0.00$XY$ $ 4$
0000001V $RI                 $RA                 $4556455F $              0.00$      0.00$      0.00$X1$ $ 2$
2222222N $K                  $G                  $7878787R $          35340.00$   2327.12$   6126.20$SY$ $52$
2222222N $K                  $G                  $6205125L $          37110.00$   1237.00$   1237.00$X1$ $52$
6565656S $PR                 $GU                 $5645564D $          60000.00$   4135.32$  10585.32$SY$ $52$
6565656S $PR                 $GU                 $9595959F $           2848.00$      0.00$      0.00$R $ $45$

This is a test file that I'm using. Let me know if you need any other info!

pravin27 · March 3, 2011, 1:24am

This is what are you looking ?

awk -F"\$" '{if(!a[$1$4]){a[$1$4]=$0;printf NR==1?$0:"\n"$0}else{printf "%s%s",$8FS,$10}}' infile

Cultcha · March 3, 2011, 6:44am

Thanks so much! This was exactly what I was looking for.

Just in case this is useful to anyone else at any stage I then replaced the ^M characters created with $s using:

sed 's/^M/$/g' file1 > file2

^M is typed as ctrl + v + enter

thanks again!!