Retaining spaces between words

RcR · September 10, 2007, 1:45am

Retaining Spaces within a word

--------------------------------------------------------------------------------

Hi Experts,

I have a 2 GB flat file which have unicode field, some of them are blanks and its size is 4000 character. In the existing system SED command removes the spaces. Because of this field itself....it is taking almost three days to complete the file processing. I removed sed and used tr command...it worked in less than a minute. Now the challenging part is the character fields have more than one space, I am tr -s ' ' '' to remove the spaces, but it is removing the spaces inbetween the characters which is more than one space.

My sample record is this:

Earlier following is the command used to remove spaces:

sed s/[[:space]]|/|/g; s/[ \t]$//g < File1 > File2

Output was:
262774372|58959454|Rajiv Rajiv|tuerueeu||erueirei
647585858|784783434|Ramesha Ramesha|tyuu5u4o||ruieieiei

Time taken to process file was 3.5 days

Later I added tr command before the sed to remove spaces faster by adding the following

tr -s ' ' '' < File1 > File2
sed 's/[[:space]]|/|/g; s/[ \t]$//g;s/^[ \t]*//g;' < File 2 > File3

Output was:
262774372|58959454|Rajiv Rajiv|tuerueeu||erueirei
647585858|784783434| Ramesha Ramesha|tyuu5u4o||ruieieiei

Time taken to process file was less than a minute, since the big spaces are translated faster.

I am not able to retain the spaces between the characters as is, since tr -s will squeeze the space to one space.

The value | Rajiv Rajiv | -> changed to |Rajiv Rajiv|

I have to retain the space..... ie., |Rajiv Rajiv|

Please let me know if you have any workaround...

Thanks,
Rajiv

denn · September 11, 2007, 2:58pm

The following should work for you.

tr -d "[= =]" < infile > outfile

additionally [:space:] similar to your sed statement is also supported in tr

RcR · September 11, 2007, 10:51pm

Denn,

It is eliminating all the spaces that exists between the words.

eg., if I have a data like this

      "Rajiv |   Rajiv   Rajiv    Rajiv   |Rajiv                 Rajiv"

       If I use the command suggested by you will result in the output
        "Rajiv|RajivRajivRajiv|RajivRajiv"

        I need the output in the following format
        "Rajiv|Rajiv   Rajiv    Rajiv|Rajiv                 Rajiv"

Thanks,
Rajiv

deepakpv · July 29, 2008, 10:03pm

Hi Rajiv,

Did you get the Solution for the above Problem?
please help me. I am also facing the similar problem.

Thanks,
Deepak

RcR · July 29, 2008, 10:09pm

Yes, I was able to achieve it....

here is the command....

cat filename | awk 'BEGIN{FS=OFS="|"} {for(i=1;i<=NF;i++)gsub("(^[[:space:]]*)|([[:space:]]*$)","",$i)};1' | awk 'NF > 0' > Output_Filename.txt

Thanks,
Rajiv

RcR · July 29, 2008, 10:13pm

Hi Deepak,

You should have to change the delimiter, in my case delimiter was pipe '|' so you should change OFS="|" with whatever delimiter you have.

Thanks,
Rajiv

deepakpv · July 30, 2008, 2:01am

Hi Rajiv,

Thanks for your reply.
I am using this command sed 's/ | /|/g' temp.dat>temp1
After reading your post i am scared to use sed command.

Thanks,
Deep

era · July 30, 2008, 2:29am

There are many different sed implementations. Try it and measure it. Then if it's slow, feel free to be scared.

RcR's original approach is what I would recommend, with a couple of minor tweaks.

sed 's/^[[:space:]]*//;s/[[:space:]]*|[[:space:]]*/|/g;s/[[:space:]]*$//' file

If your sed is too slow, try the same with perl -p