sed or awk editing help

mychmose · October 29, 2010, 3:12pm

Hi all

I use aix (sadly).

I've got a file consisting of fields separated by commas, I need a sed or awk command that will delete all spaces between two commas as long as there are only spaces between the commas.

eg

,abc,    ,sd , ,dr at

would become

,abc,,sd ,,dr at

I have tried sed -e 's/, .*,/,,/g' but it does not work.

any ideas?

thanks in anticipation

DGPickett · October 29, 2010, 3:29pm

Leading and trailing space: fair game or verboten? This does just all-spaces:

 sed '
  s/,  *,/,,/g
  s/,  *,/,,/g
 '

Narrative: Substitute for comma, then space, then any number of spaces, then a comma: a comma comma. Repeat for adjscent fields of spaces (can be in second piped sed for speed).

This does leading and trailing, too:

sed '
  s/,  */,/g
  s/  *,/,/g
 '

Narraitve: Substitute for a comma, then a space, then any number of spaces: a comma (trim leading spaces). Substitute for a space, then any number of spaces, then a comma: a comma (trim trailing spaces).

danmero · October 29, 2010, 3:41pm

# echo ',abc, ,sd,                 ,dr at' | sed 's/, */,/g'
,abc,,sd,,dr at

DGPickett · October 29, 2010, 3:44pm

You get leading spaces of fields with data, which is sometimes nice but outside this requirement.

durden_tyler · October 29, 2010, 11:38pm

If Perl is an option then -

$
$
$ echo "   ,abc,  def,ghi,   ,     ,  jk lm,   " | perl -plne 's/(^| *), */,/g'
,abc,def,ghi,,,jk lm,
$
$

tyler_durden

Scrutinizer · October 30, 2010, 1:58am

This takes out space-only fields at the beginning and end as well, while leaving all other spaces intact.

sed 's/\(^\|,\) *\(,\|$\)/\1\2/g'

GNU sed -r :

sed -r 's/(^|,) *(,|$)/\1\2/g'

ygemici · October 30, 2010, 4:10am

additional If you have spaces and also tabs you can use

# sed 's/,[[:space:]]*,/,,/g' infile

ctsgnb · October 30, 2010, 3:01pm

sed 's: *, *:,:g' infile

DGPickett · October 30, 2010, 5:34pm

this misses odd instances with adjacent spaces, since you used the opening , as a trailing , -- You must do it twice.

---------- Post updated at 05:28 PM ---------- Previous update was at 05:27 PM ----------

Also kills leading and trailing spaces, more than was requested originally, but gets past the double run need of patterns with two commas.

---------- Post updated at 05:34 PM ---------- Previous update was at 05:28 PM ----------

scrutinizer:

This takes out space-only fields at the beginning and end as well, while leaving all other spaces intact.
sed 's/$^\|,$ *$,\|$$/\1\2/g'
GNU sed -r :
sed -r 's/(^|,) *(,|$)/\1\2/g'

Does that work -- usually putting ^$ inside |(\) is NG.

You have to do it twice for ", , ,"!

None of us took care of the first or last field except your sed -r (but needs to run twice)! I would put '^' second in the (|) list, as less often true, for speed.

ygemici · October 30, 2010, 6:13pm

i think there is no so this essentiality in this example
and you and me can not know if necessary

Scrutinizer · October 31, 2010, 2:00am

Hi, I found out this works by trying it out. Good point about the ^ in the second part and especially the ,,, or more situation. I hadn't thought of that...:o . I also realized that looking for * is less efficient than * . So I think the best option is to run twice, not by piping , but by using a loop and leaving the g flag in (out is less efficient):

sed ':a;s/\(,\|^\)  *\(,\|$\)/\1\2/g;ta'

or GNU sed.

sed -r ':a;s/(,|^)  *(,|$)/\1\2/g;ta'

or some older seds:

sed -e ':a' -e 's/\(,\|^\)  *\(,\|$\)/\1\2/g;ta'

of course every space can be replaced by [ \t] if the need arises.

ygemici · October 31, 2010, 7:06am

scrutinizer:

Hi, I found out this works by trying it out. Good point about the ^ in the second part and especially the ,,, or more situation. I hadn't thought of that...:o . I also realized that looking for * is less efficient than * . So I think the best option is to run twice, not by piping , but by using a loop and leaving the g flag in (out is less efficient):
sed ':a;s/$,\|^$  *$,\|$$/\1\2/g;ta'
or GNU sed.
sed -r ':a;s/(,|^)  *(,|$)/\1\2/;ta'
or some older seds:
sed -e ':a' -e 's/$,\|^$  *$,\|$$/\1\2/g;ta'
of course every space can be replaced by [ \t] if the need arises.

I am sorry but I dont agree some ideas with you and DGPickett
whats happening there?

I say if there is even other chars (expect null character ) in between pattern match then there is no diffrerent use twice in this example.
because we search spaces or tabs in between two commas..

[root@rhnserver ~]# cat file1
 
 ,1234,
,123344,
1
, ,
,  ,
,BBB,
,BBB ,
, BBB ,
,  BBB  ,

[root@rhnserver ~]# sed 's/,[0-9]*,/CHANGE/g' file1
 
CHANGE
CHANGE
1
, ,
, ,
,BBB,
,BBB ,
, BBB ,
, BBB ,

[root@rhnserver ~]# sed 's/,[0-9][0-9]*,/CHANGE/g' file1
 
CHANGE
CHANGE
1
, ,
, ,
,BBB,
,BBB ,
, BBB ,
, BBB ,

well, when we search single pattern or probably if there are matched unwanted strings for ex contain just numbers then get wrong results

[root@rhnserver ~]# cat file1
 
 
,1234,
,123344,
1
, ,
,  ,
,BBB,
,BBB ,
, BBB ,
,  BBB  ,

[root@rhnserver ~]# sed 's/[0-9][0-9]*/CHANGE/g' file1
 
 ,CHANGE,
,CHANGE,
CHANGE
,,
, ,
,  ,
,BBB,
,BBB ,
, BBB ,
,  BBB  ,

[root@rhnserver ~]# sed 's/[0-9]*/CHANGE/g' file1  --> wrong we must use twice for correct results
CHANGE
CHANGE CHANGE,CHANGE,CHANGE CHANGE
CHANGE,CHANGE,CHANGE
CHANGE
CHANGE,CHANGE,CHANGE
CHANGE,CHANGE CHANGE,CHANGE
CHANGE,CHANGE CHANGE CHANGE,CHANGE
CHANGE,CHANGEBCHANGEBCHANGEBCHANGE,CHANGE
CHANGE,CHANGEBCHANGEBCHANGEBCHANGE CHANGE,CHANGE
CHANGE,CHANGE CHANGEBCHANGEBCHANGEBCHANGE CHANGE,CHANGE
CHANGE,CHANGE CHANGE CHANGEBCHANGEBCHANGEBCHANGE CHANGE CHANGE,CHANGE
CHANGE

or

[root@rhnserver ~]# sed 's/,[0-9][0-9]*/CHANGE/g' file1
 
 CHANGE,
CHANGE,
1
,,
, ,
,  ,
,BBB,
,BBB ,
, BBB ,
,  BBB  ,

[root@rhnserver ~]# sed 's/,[0-9]*/CHANGE/g' file1
 CHANGECHANGE
CHANGECHANGE
1
CHANGECHANGE
CHANGE CHANGE
CHANGE  CHANGE
CHANGEBBBCHANGE
CHANGEBBB CHANGE
CHANGE BBB CHANGE
CHANGE  BBB  CHANGE

or

one difference state because matched string + null + string
and we dont want this

[root@rhnserver ~]# cat file1
 
 ,1234,
,123344,
1
,,   ----> we dont this
, ,
,  ,
,BBB,
,BBB ,
, BBB ,
,  BBB  ,

[root@rhnserver ~]# sed 's/,[0-9]*,/CHANGE/g' file1
 
 CHANGE
CHANGE
1
CHANGE  --> change
, ,
,  ,
,BBB,
,BBB ,
, BBB ,
,  BBB  ,

[root@rhnserver ~]# sed 's/,[0-9][0-9]*,/CHANGE/g' file1
 
 CHANGE
CHANGE
1
,,  --> no change
, ,
,  ,
,BBB,
,BBB ,
, BBB ,
,  BBB  ,

regards
ygemici

Scrutinizer · October 31, 2010, 7:14am

@ygemici, your input samples do not reflect what DGPickett and I were talking about. Compare this:

$ echo '  ,          abc,             ,sd ,      ,   ,     ' | sed 's/,[0-9]*,/CHANGE/g'
  ,          abc,             ,sd ,      ,   ,
$ echo '  ,          abc,             ,sd ,      ,   ,     ' | sed 's/,[0-9]*,/CHANGE/g'
  ,          abc,             ,sd ,      ,   ,

with this:

$  echo '  ,          abc,             ,sd ,      ,   ,     ' |sed -r ':a;s/(,|^)  *(,|$)/\1\2/;ta'
,          abc,,sd ,,,

Without the loop the solution would be incomplete and we would end up with this:

$ echo '  ,          abc,             ,sd ,      ,   ,     ' |sed -r 's/(,|^)  *(,|$)/\1\2/'
,          abc,             ,sd ,      ,   ,

Which is what DGPickett was talking about.

ygemici · October 31, 2010, 7:58am

i was mentioned already issue twice examples to in here what question
and i think in this loop is unnecessary of course as you wish

Scrutinizer · October 31, 2010, 8:27am

I do not understand what you are referring to? A loop or running sed twice is necessary if you want to do a replace in two adjacent fields. This is because sed starts to do a replace at the character after where it stopped the last time. If you include the second comma in the pattern then the field that follows is not matched because the first comma was matched the previous time.

A couple of suggestions leave out the second comma, but then all whitespace gets deleted even in fields where the are other characters, which is not what the OP was asking.

I added the possibility of doing the replace not only in fields between comma's, but also for the two fields that have only one comma, namely the first and the last. Which is not precisely what the OP asked, but likely what he requires, since he speaks of fields separated by commas.

S.

ygemici · October 31, 2010, 9:35am

I dont understant you too..
i say already when necessary or "not necessary" twice usage ..
and i explain with examples when DGPickett says in this example usage is mandatory..I answer to him related this..I hope this is enough.

And lets come your examples
I said loop is not necessary

with global flag

echo '  ,          abc,             ,sd ,      ,   ,     ' |sed -r 's/(,|^)  *(,|$)/\1\2/g'
,          abc,,sd ,,   ,

with loop

echo '  ,          abc,             ,sd ,      ,   ,     ' |sed -r ':a;s/(,|^)  *(,|$)/\1\2/;ta'
,          abc,,sd ,,,

and solution non-loop

echo '  ,          abc,             ,sd ,      ,   ,     ' |sed -r 's/(,|^)   *|  (,|$)/\1\2/g'
,          abc,,sd ,,,

Regards
ygemici-sed lover team

Scrutinizer · October 31, 2010, 9:54am

Your suggestion, take a look at the sd field, it does not stay intact.

echo '  ,          abc,             ,  sd   ,      ,   ,     ' | sed -r 's/(,|^)   *|  (,|$)/\1\2/g'
,          abc,,sd ,    , ,

With the loop:

$  echo '  ,          abc,             ,  sd   ,      ,   ,     ' | sed -r ':a;s/(,|^)  *(,|$)/\1\2/;ta'
,          abc,,  sd   ,,,

ygemici · October 31, 2010, 3:06pm

Best way non-loop solution every time for me but all the same you wish also..

with loop

 time sed -r ':a;s/(,|^)  *(,|$)/\1\2/;ta' testfile

real    0m4.116s
user    0m0.596s
sys     0m0.166s

with non-loop

 time sed -r 's/(  *,|,|^)  *(,|$)/\1\2/g;s/,  *,/,,/g' testfile

real    0m1.773s
user    0m0.915s
sys     0m0.110s

Scrutinizer · October 31, 2010, 7:13pm

It was not a matter of which method is faster, but of which method actually works. Have a look at the 4th field that contains spaces and the letters sd, and what comes after that in post #17. That field should remain unchanged, but it is not, so IMO your original "non-loop" suggestion is not working properly. This needs to be done by either running the sed twice or through a loop like I suggested.

This time you are leaving your single step approach and you are presenting yet another alternative, with two search and replace statements (which you said earlier would not be necessary), which is like a loop with the use of the g flag.

One thing I noticed just now is that in post#11 in there was a g flag used but in the middle the sed -r option the g flag was accidentally missing. It is a little bit more efficient to leave that in as I noted in that same post. When we add the g-flag as was intended:

sed -r ':a;s/(,|^)  *(,|$)/\1\2/g;ta'

There is no significant difference on my system.

---------- Post updated 01-11-10 at 00:13 ---------- Previous update was 31-10-10 at 23:00 ----------

I tested you suggestion but it does not work properly for the last field:

$ echo '     ,          abc,             ,  sd   ,      ,   ,     ,      ' | sed -r ':a;s/(,|^)  *(,|$)/\1\2/g;ta' | od -c
0000000   ,                                           a   b   c   ,   ,
0000020           s   d               ,   ,   ,   ,  \n
0000034

cuts the spaces in the last field like it should, whereas

$ echo '     ,          abc,             ,  sd   ,      ,   ,     ,      ' | sed -r 's/(  *,|,|^)  *(,|$)/\1\2/g;s/,  *,/,,/g' | od -c
0000000   ,                                           a   b   c   ,   ,
0000020           s   d               ,   ,   ,   ,
0000040      \n
0000042

leaves them in.

DGPickett · November 1, 2010, 12:33pm

(There are spaces in the second line third=final field):

$ sed '
  s/, *,/,,/g
 ' <<! |cat -e
aaa,   bbb,ccc   ,    ,    ,dddd
    ,eee,   
!
aaa,   bbb,ccc   ,,    ,dddd$
    ,eee,   $
$

See, one pass misses the adjacent field of spaces, and without the (), ^ and $ extended regex, the first and last fields are not done. You can work around that with two passes, space-space-asterisk to ignore the empty fields possibly from the first pass, and with extra commas:

$ sed '
  s/.*/,&,/
  s/,  *,/,,/g
  s/,  *,/,,/g
  s/,\(.*\),/\1/
 ' <<! |cat -e
aaa,   bbb,ccc   ,    ,    ,dddd
    ,eee,   
!
aaa,   bbb,ccc   ,,,dddd$
,eee,$
$