concatenate lines using shell scripting

dtdt · September 6, 2009, 1:47pm

i have a mega file in this format:

a,
b,
c,
d,

a2,
b2,
c2,
d2,

a3,
b3

i want to combine lines until = meet. the result should be :
a,b,c,d,
a2,b2,c3,d2
a3,b3

need your help. thanks

bakunin · September 6, 2009, 1:56pm

You can of course do this via shell scripting, but using sed would be a faster and by far clearer way to do it.

Why do you want to do it with a shell script?

bakunin

dtdt · September 6, 2009, 2:11pm

sorry, i didn't mean that. sed or awk are all good.

---------- Post updated at 01:11 PM ---------- Previous update was at 12:58 PM ----------

would you please show me how to do it in sed or awk? thx

danmero · September 6, 2009, 2:27pm

awk '/=/{$0="\n"}1' ORS="" file

protocomm · September 6, 2009, 2:28pm

perhaps.....bur certainly better....

awk '{if($0 !~ /=/) ORS=""}{if($0 ~/=/) ORS="\n"} {print}' | sed -e 's/=//g' -e 's/,$//g' file

bakunin · September 6, 2009, 2:44pm

in sed it is easy: we have two types of lines, the ones reading "=" and the others. When we encounter a "="-line, we want to print out what we have so far, minus the newlines. If we encounter one of the other lines we want to store its contents until we encounter a "="-line.

sed has a so-called "hold space", think of it as a variable, where you can store things until you need them. We append everything to this hold space until we encounter a "="-line, then we recall the hold space, filter out all embedded newlines and print it, then start over.

In the following script i have put in comment for your understanding, remove them, because sed doesn't allow inline commenting in scripts. Furthermore, you can put the whole script on one line, replacing linefeeds with semicolons:

sed -n '/^=/ {                   # if a line starts with "="
          s/.*//                 # delete this lines content
          x                      # exchange the pattern space (empty) and the hold space
          s/\n//g                # delete newlines
          p                      # then print what you have
     }
     /^=/ ! {                    # if a line doesn't start with "="
          H                      # append it to the hold space
     }' /your/file > newfile

for short:

sed -n '/^=/{s/.*//;x;s/\n//gp};/^=/!{H}' /your/file > newfile

I hope this helps.

bakunin

danmero · September 6, 2009, 2:45pm

Did you check the OP requirements , certainly not.

dtdt · September 6, 2009, 2:51pm

geeez, why it doesn't work on my file.

awk '/=/{$0="\n"}1' ORS="" $DSTDIR/ttt > $DSTDIR/ttt2

both ttt, ttt2 are still the same. what did i do wrong?

---------- Post updated at 01:51 PM ---------- Previous update was at 01:46 PM ----------

it doesn't do the trick for me. it duplicates para after para...

not concatnate those lines.

bakunin:

in sed it is easy: we have two types of lines, the ones reading "=" and the others. When we encounter a "="-line, we want to print out what we have so far, minus the newlines. If we encounter one of the other lines we want to store its contents until we encounter a "="-line.

sed has a so-called "hold space", think of it as a variable, where you can store things until you need them. We append everything to this hold space until we encounter a "="-line, then we recall the hold space, filter out all embedded newlines and print it, then start over.

In the following script i have put in comment for your understanding, remove them, because sed doesn't allow inline commenting in scripts. Furthermore, you can put the whole script on one line, replacing linefeeds with semicolons:
sed -n '/^=/ {                   # if a line starts with "="
   s/.*//                 # delete this lines content
   x                      # exchange the pattern space (empty) and the hold space
   s/\n//g                # delete newlines
   p                      # then print what you have
   }
   /^=/ ! {                    # if a line doesn't start with "="
   H                      # append it to the hold space
   }' /your/file > newfile
for short:
sed -n '/^=/{s/.*//;x;s/\n//gp};/^=/!{H}' /your/file > newfile
I hope this helps.

bakunin

bakunin · September 6, 2009, 2:52pm

I just noticed that in your example text the trailing commata were also stripped off. To achieve this i would like to add a line to my script, the purpose should be evident:

sed -n '/^=/ {                   # if a line starts with "="
          s/.*//                 # delete this lines content
          x                      # exchange the pattern space (empty) and the hold space
          s/\n//g                # delete newlines
          s/[,;]$//
          p                      # then print what you have
     }
     /^=/ ! {                    # if a line doesn't start with "="
          H                      # append it to the hold space
     }' /your/file > newfile

I hope this helps.

bakunin

dtdt · September 6, 2009, 10:32pm

i guess i have some problem with file on unix/windows. i have to go out now. will check later when i come back. so far, no good.

thanks for all ur help

---------- Post updated at 10:32 PM ---------- Previous update was at 02:57 PM ----------

it's so freaking me out right now.

only protocomm's solution partially work. that means i can see it worked in window's word, but not on note pad. when i open the output file in liunx server, it's not working like note pad.

i have no idea what's happening. any suggestions?

thanks all

bakunin · September 6, 2009, 11:11pm

Since you don't show us any real example of what your file looks like we are left to guesses. Therefore the following is again - a guess (btw.: my code worked perfectly on AIX, Solaris and Ubuntu Linux - more Unix/Linux dialects i don't have at hand but probably nothing will change the result).

Files are different in Windows/DOS and UNIX. The reason is that newlines are encoded differently. In UNIX dialects a newline is encoded in a single character, a "^M" (control-m), the linefeed character. In DOS/Windows a newline is encoded in two characters, a CR (carriage return) and a LF (line feed).

Create a file under UNIX (issue "cat > unixfile", then start typing, press "control-D" to end the input) and get a file created under Windows. Now issue "od -ax dosfile | more" and "od -ax unixfile | more" in two windows and observe the difference.

If this is the case with your file try "ASCII mode" when you are transferring it via ftp from DOS to UNIX and vice versa. The ASCII mode (instead of "binary" mode, which is default) takes care of exactly this thing.

I hope this helps.

bakunin

dtdt · September 6, 2009, 11:26pm

would you please PM your email? I'd like to set you the file as attachment. this is the first time i met this and i completely lost.

durden_tyler · September 7, 2009, 12:00am

One way to do it in Perl:

$ 
$ cat f2
a,
b,
c,
d,
=
a2,
b2,
c2,
d2,
=
a3,
b3
=
$ 
$ perl -ne 'chomp; $x.=$_; END{$x=~s/[,]*=/\n/g; print $x}' f2
a,b,c,d
a2,b2,c2,d2
a3,b3
$ 
$

If you use vi or vim to open your file in Unix/Linux and see "^M" characters at the end of each line, then that could be one of your problems.

tyler_durden

dtdt · September 7, 2009, 12:14am

thanks all. the f* problem is the format with the file. it's really strange, i got the file after parsing a web page. i used the program hundreds of times, and this is the first time i have this issue. after i run a dos2unix convert, all works! thanks a lot.

concatenate lines using shell scripting

a, b, c, d,

a2, b2, c2, d2,

a3, b3

a,
b,
c,
d,

a2,
b2,
c2,
d2,

a3,
b3