sed: How to modify files in a complex way

pinkypunky · December 4, 2009, 8:58am

Hello,

I am new to sed and hope that someone can help me with the following task.

I need to modify a txt file which has format like this:

xy=CreateDB|head.queue|head.source|head.definition|rtf.edit|rtf.task|rft.cut
abc|source|divine|line4|5|true

into something like:

head.queue=abc
head.definition=source
head.definition=divine
rtf.edit=line4
rtf.task=5
rft.cut=true

Can someone help me with it?

quirkasaurus · December 4, 2009, 9:55am

hope you don't mind a ksh solution:

num=0
sed -e 's/|/ /g' << EOF |
xy=CreateDB|head.queue|head.source|head.definition|rtf.edit|rtf.task|rft.cut
abc|source|divine|line4|5|true
EOF
while read crud ; do

  (( num +=1 ))

  if [[ $num -eq 1 ]]; then
    set -A col `echo $crud`
    continue
  fi

  ### set -A data `echo $crud` ### this second array not necessary.

  a=1

  ### for datum in `echo ${data[*]}` ; do
  for datum in $crud ; do

    echo ${col[$a]}=$datum
    (( a += 1 ))

  done

done 2>&1 |
  tee $0.log

works for a dynamic number of columns.
Did you notice that your first column is extra?

The only problem I see with this is if you have embedded spaces elsewhere in your data.
In that case, you may have to toggle your IFS variable.

IFS="|"

IFS=" "

depending on where you're at in the script.

pinkypunky · December 4, 2009, 10:00am

Hello quirkasaurus,

Thank you for the code, but what I want more is that how can I rearrange the content into the way I want with SED or AWK solution, is it possible? (cuz I really want to learn it.)

quirkasaurus · December 4, 2009, 10:15am

oy...

yes it is... with awk, not sed.

Even easier, sorry, i default to ksh:

cat << EOF |
xy=CreateDB|head.queue|head.source|head.definition|rtf.edit|rtf.task|rft.cut
abc|source|divine|line4|5|true
EOF

nawk -F\| '{
  if ( NR == 1 ){
    split( $0, headers );
    next;
    }

  for ( x = 1; x <= NF; x++ ){
    print headers[x] "=" $x;
    }
  }'

---------- Post updated at 07:15 AM ---------- Previous update was at 07:10 AM ----------

hmmm... still a problem with that extra first field.
Anyways, here's the fix for that:

cat << EOF |
xy=CreateDB|head.queue|head.source|head.definition|rtf.edit|rtf.task|rft.cut
abc|source|divine|line4|5|true
EOF

nawk -F\| '{
  if ( NR == 1 ){
    for ( x = 2; x <= NF; x++ ){
      headers[x-1]=$x;
      }
    next;
    }

  for ( x = 1; x <= NF; x++ ){
    print headers[x] "=" $x;
    }
  }'

pinkypunky · December 4, 2009, 10:30am

Hello quirkasaurus,

Thank you for your help. I have a general question, why in this case using AWK is more suitable than SED? Is it because SED is more for modifying data (that is, in my case, data are more like being "rearranged" than being "modified" and therefore AWK is suitable?)

quirkasaurus · December 4, 2009, 2:07pm

sed is a "stream-editor" and afaik it does not have the ability to remember variables and print them later.

awk -- is a fully functional programming language, containing complete flow control, arrays, hash arrays, regular expressions, system calls, multiple file input and output, etc...

sed is fine for relatively straightforward inline modifications.

awk is more suitable for problems that require logic and flow control and greater complexity.

binlib · December 4, 2009, 8:56pm

Sed is Turing-complete. You can do everything in sed.

sed '
s/^[^|]*|//
N
:o
s/\([^|]*\)|\(.*\n\)\([^|]*\)|/\1=\3\n\2/
P
s/^[^\n]*\n//
to

cfajohnson · December 4, 2009, 11:49pm

That example demonstrates why awk is more suitable for anything more than simple selection and replacement.

pinkypunky · December 7, 2009, 10:42am

Hello,

Thank you for the code. Now i need more of it (sorry I am really new with awk)

I need 2 different scripts which can rearrange the data like this:

xy=CreateDB|head.queue|head.source|head.definition|rtf.edit
abc|source|divine|line4
def|source||line2

head.queue=abc
head.source=source
head.definition=divine
rtf.edit=line4

head.queue=def
head.source=source
head.definition=
rtf.edit=line2

quirkasaurus has helped me with this, but how can I add a blank line between different set of data value?

I also need a reverse of this, from:

head.queue=abc
head.source=source
head.definition=divine
rtf.edit=line4

head.queue=def
head.source=source
head.definition=
rtf.edit=line2

to:

xy=CreateDB|head.queue|head.source|head.definition|rtf.edit
abc|source|divine|line4
def|source||line2

Thank you for your help!

ahmad.diab · December 7, 2009, 11:23am

To add extra blank line just add the below bold red line to the same old nawk:-

nawk -F\| '{
  if ( NR == 1 ){
    for ( x = 2; x <= NF; x++ ){
      headers[x-1]=$x;
      }
    next;
    }

  for ( x = 1; x <= NF; x++ ){
    print headers[x] "=" $x;
    }
print ""
  }'

to reverse the action use the below nawk:-

nawk -F"\=" '
/head.queue/{c=5 ; b="xy=CreateDB""|"$1 ; s=$2 ; next}
c--{b=b"|"$1 ; s=s"|"$2}
!c--{print b,"\n",s ; b="" ;s=""}
' file.txt | nawk '!a[$0]++'

:D:D:D

pinkypunky · December 7, 2009, 1:32pm

Hello Ahmad,

Thanks for the help, but the secondScript not really working properly.

When converted the Input file to:

head.queue=abc
head.source=source
head.definition=divine
rtf.edit=line4
rtf.task=5
rft.cut=true
rft.abc=
htc.lom=

head.queue=def
head.source=source
head.definition=divine
rtf.edit=line3
rtf.task=
rft.cut=true
rft.abc=
htc.lom=

Not all the data were reversed and there is a space in the front of the data set.

Can you help out?

Thank you in advance

rdcwayx · December 7, 2009, 6:44pm

Others have given the solution. My question is,

why put the file name xy=CreateDB in first line, can it be seperated into a single line? With that, the code can be easier and more Efficient.

pinkypunky · December 8, 2009, 4:23am

Hello rdcwayx

I do need the same format as I posted. The xy=CreateDB has to be in the same line with the |head.queue|head.source|head.definition|....

Can anyone help out?

ahmad.diab · December 8, 2009, 6:02am

the first thread you put there were only 4 rows but now you are having 8 rows
kindly modify my code as below:-

nawk -F"\=" '
/head.queue/{c=8 ; b="xy=CreateDB""|"$1 ; s=$2 ; next}
--c{b=b"|"$1 ; s=s"|"$2}
!c{print b,"\n",s ; b="" ;s=""}
' file.txt | nawk '!a[$0]++'

where c= number of raws in each paragraph

pinkypunky · December 8, 2009, 7:10am

Hello Ahmad,

That is the point, I need more of a generic code, cause the rows gonna be flexible and so as the field name like |head.queue|... , they are not fixed, it can be anything written

ahmad.diab · December 8, 2009, 7:33am

The below code is for general cases (full option) :-

nawk -F"\=" '
/^[^ *$]/{b=b"|"$1 ; s=s"|"$2}                        
/^[ ]*$/{print b,"\n",s ; b="" ;s=""}
' file.txt

:D:D:D:D

pinkypunky · December 8, 2009, 7:46am

Hello Ahmad,

Thank you so much for the help.

Can you help me out with this, now I got the result as:

|head.queue|head.source|head.definition|rtf.edit|rtf.task|rft.cut|rft.abc|htc.lom
|abc|source|divine|line4|5|true||
|head.queue|head.source|head.definition|rtf.edit|rtf.task|rft.cut|rft.abc|htc.lom
|def|source|divine|line3||true||

Can I have something like :
|head.queue|head.source|head.definition|rtf.edit|rtf.task|rft.cut|rft.abc|htc.lom
abc|source|divine|line4|5|true||

Thank you very much

ahmad.diab · December 8, 2009, 7:59am

you just say that the filed queue will be different as above why you want to eliminate them?
and also you said that the paragraphs is dynamic !!

if you can assure that any 2 raw variable (head.queue) are the same in each paragraph you can get your desired o/p by adding the below to the code provided to you.

nawk -F"\=" '
/^[^ *$]/{b=b"|"$1 ; s=s"|"$2}
/^[ ]*$/{print b,"\n",s ; b="" ;s=""}
' file.txt | nawk -F"|" '!a[$1$2]++'

!a[$1$2]++ is just for example you can put the raw numbers that exist in each paragraph.

:D:D:b:

pinkypunky · December 8, 2009, 9:27am

opps... so sorry

What I mean is, instead of this:

|head.queue|head.source|head.definition|rtf.edit|rtf.task|rft.cut|rft.abc|htc.lom
|abc|source|divine|line4|5|true||
|head.queue|head.source|head.definition|rtf.edit|rtf.task|rft.cut|rft.abc|htc.lom
|def|source|divine|line3||true||

Can I have something like :
|head.queue|head.source|head.definition|rtf.edit|rtf.task|rft.cut|rft.abc|htc.lom
abc|source|divine|line4|5|true||
def|source|divine|line3||true||

Without that the |head.queue|... line will be repeated for each of the data value

Sorry that I didnt make it clear.

ahmad.diab · December 8, 2009, 9:32am

already assigned the code to you in the above thread:D:D:D

nawk -F"\=" '
/^[^ *$]/{b=b"|"$1 ; s=s"|"$2}
/^[ ]*$/{print b,"\n",s ; b="" ;s=""}
' file.txt | nawk -F"|" '!a[$1$2]++'