sed and cut behaving differently

amicon007 · May 1, 2010, 9:56am

I have attached a file with few records. First 2 characters of each record are binary characters. I can remove it by

and it works fine. But

is behaving differently and removing more than expected characters. Can someone help me in accomplishing it through sed? Thanks in advance.

Reboot · May 1, 2010, 10:06am

I am unable to open the attached file due to its extention...

amicon007 · May 1, 2010, 10:24am

Open it in 'vi'.

---------- Post updated at 06:24 AM ---------- Previous update was at 06:10 AM ----------

may be sed takes them as one binary char. Checking...

vidyadhar85 · May 1, 2010, 10:25am

why you wanna do it with sed??
and how many characters are getting removed??

ahmad.diab · May 1, 2010, 10:31am

sed 's/^.{2}//' sample

in solaris use below:-

sed 's/^.\{2\}//' sample

amicon007 · May 1, 2010, 10:33am

Well, I have the file of GBs size and sed works 5-6 times faster than cut. Strangely just tried stripping 1 char:

and it worked same as stripping 2 chars using cut.

Any logic in this?

ahmad.diab · May 1, 2010, 10:40am

sed 's/^.//' sample  ->-> will strip the first char only

I have tried it.

but

sed 's/^.{2}//' sample ->-> will strip the first 2 char .{2} means 2 char

will delete the first 2 char..and I have tried it too.

also

sed 's/^..//' sample will delete the first 2 char >> I have tried it on Solaris 10.

I don't know where is your problem. what is your OS?
BR

amicon007 · May 1, 2010, 10:44am

Just try stripping off first 2 characters using sed and using cut. Check the difference in the results.

pseudocoder · May 1, 2010, 10:48am

I can not confirm that, both behave like they should:

$ cut -c3- sample.conf > sample-cut.conf
$ sed 's/^..//' sample.conf > sample-sed.conf
$ diff sample-cut.conf sample-sed.conf
$

amicon007 · May 1, 2010, 10:51am

I am having the difference:

$ diff sample-cut.conf sample-sed.conf
1,10c1,10
< 147405037|44846|44846|8705|20100401000000|20100516000000|20100408220743|20100523235959|20100408220743|||20100326014658|S|15092360154
< 26537555|44849|44849|8705|20100401000000|20100516000000||||||20100326014658|S|15077793658
< 230042230|44857|44857|8705|20100401000000|20100516000000||||||20100326014658|S|15098928810
< 43398728|44848|44848|8705|20100401000000|20100516000000|20100401092126|20100516235959|20100401092126|||20100326014658|S|15080179924
< 236218845|44848|44848|8705|20100401000000|20100516000000||||||20100326014658|S|15100098523
< 22029612|44859|44859|8705|20100401000000|20100516000000|20100402165043|20100517235959|20100402165043|||20100326014658|S|15077092386
< 242395460|44846|44846|8705|20100401000000|20100516000000||||||20100326014658|S|15100863598
< 121527978|44846|44846|8705|20100401000000|20100516000000||||||20100326014658|S|15088997374
< 254748690|44846|44846|8705|20100401000000|20100516000000||||||20100326014658|S|15103592530
< 146234438|44846|44846|8705|20100401000000|20100516000000|20100415163904|20100530235959|20100415163904|||20100326014658|S|15092152331
---
> 47405037|44846|44846|8705|20100401000000|20100516000000|20100408220743|20100523235959|20100408220743|||20100326014658|S|15092360154
> 6537555|44849|44849|8705|20100401000000|20100516000000||||||20100326014658|S|15077793658
> 30042230|44857|44857|8705|20100401000000|20100516000000||||||20100326014658|S|15098928810
> 3398728|44848|44848|8705|20100401000000|20100516000000|20100401092126|20100516235959|20100401092126|||20100326014658|S|15080179924
> 36218845|44848|44848|8705|20100401000000|20100516000000||||||20100326014658|S|15100098523
> 2029612|44859|44859|8705|20100401000000|20100516000000|20100402165043|20100517235959|20100402165043|||20100326014658|S|15077092386
> 42395460|44846|44846|8705|20100401000000|20100516000000||||||20100326014658|S|15100863598
> 21527978|44846|44846|8705|20100401000000|20100516000000||||||20100326014658|S|15088997374
> 54748690|44846|44846|8705|20100401000000|20100516000000||||||20100326014658|S|15103592530
> 46234438|44846|44846|8705|20100401000000|20100516000000|20100415163904|20100530235959|20100415163904|||20100326014658|S|15092152331

I have SunOS

pseudocoder · May 1, 2010, 11:06am

Obviously your sed doesn't like the leading characters of the input lines, which are "^".
Just try

sed 's/^.//' sample

If I were you I'd create a fresh file and create a couple test lines like following:

^@abcd
^@1234
^@9999
^@wxyz

and run sed again and see what happens.

amicon007 · May 1, 2010, 11:34am

It does recognize:

There seems some difference with handling of binary character between my sed and cut.

drl · May 1, 2010, 11:51am

Hi.

I ended up with a long script that compared the results of cut, 3 variations of sed, and the binary editor bbe. I did these on Linux and Solaris 10 (but bbe omitted on Solaris). I used cmp as the first test, and then diff for the detailed comparison. On Linux, GNU diff bails out quickly, simply saying that the binary files differ.

The sed variations that I used were:

sed 's/^..//' ...
sed 's/^.\{2\}//' ...
sed 's/^.{2}//' ...

They failed on both Linux and Solaris:

OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0

OS, ker|rel, machine: SunOS, 5.10, i86pc

I used the cut output as the standard. The bbe output compared successfully.

Observations:

1) mixed-mode files are not best-practice

2) cut knows about bytes, and appears to use its byte "knowledge" as its character knowledge

3) sed is advertised as:

sed - stream editor for filtering and transforming text

not necessarily collections of arbitrary bytes in mixed-mode files

4) At the center where I worked, we said that we could make processes (almost) as fast as you desired as long as you didn't care about the results. If you get good results in a reasonable time from a particular process, then use it. You may end up wasting more time trying to find the fastest method than if you just let the original process run. This is a kind of case of premature optimization, along with the notion that people time is the most expensive (in most cases).

5) I did not time bbe, but it might be as fast as sed -- it is:

bbe is a sed-like editor for binary files. It performs binary transfor-
       mations on the blocks of input stream.

I can post the script and results, however, as I said, they are lengthy. As usual, it is possible that I have incorporated an error of some kind, but in general I agree with the OP ... cheers, drl

alister · May 1, 2010, 7:47pm

A quick peek at sample.conf shows that the first byte in the lines is a null byte. It's a safe bet that what you are seeing is a sideeffect of c string functions interpreting a null byte as end of string.

For experimentation's sake, does the discrepancy persist if you substitute a 001 byte for the 000 bytes?

tr \\000 \\001 < withnull > withoutnull

If it does not, mystery solved. If it does, weird. Could that sed implementation be filtering out control characters?

Regards,
Alister