Retaining spaces between words

Retaining Spaces within a word

--------------------------------------------------------------------------------

Hi Experts,

I have a 2 GB flat file which have unicode field, some of them are blanks and its size is 4000 character. In the existing system SED command removes the spaces. Because of this field itself....it is taking almost three days to complete the file processing. I removed sed and used tr command...it worked in less than a minute. Now the challenging part is the character fields have more than one space, I am tr -s ' ' '' to remove the spaces, but it is removing the spaces inbetween the characters which is more than one space.

My sample record is this:

262774372|58959454 | Rajiv Rajiv | tuerueeu | | erueirei
647585858|784783434 | Ramesha Ramesha| tyuu5u4o| | ruieieiei

Earlier following is the command used to remove spaces:

sed s/[[:space]]|/|/g; s/[ \t]$//g < File1 > File2

Output was:
262774372|58959454|Rajiv Rajiv|tuerueeu||erueirei
647585858|784783434|Ramesha Ramesha|tyuu5u4o||ruieieiei

Time taken to process file was 3.5 days

Later I added tr command before the sed to remove spaces faster by adding the following

tr -s ' ' '' < File1 > File2
sed 's/[[:space]]|/|/g; s/[ \t]$//g;s/^[ \t]*//g;' < File 2 > File3

Output was:
262774372|58959454|Rajiv Rajiv|tuerueeu||erueirei
647585858|784783434| Ramesha Ramesha|tyuu5u4o||ruieieiei

Time taken to process file was less than a minute, since the big spaces are translated faster.

I am not able to retain the spaces between the characters as is, since tr -s will squeeze the space to one space.

The value | Rajiv Rajiv | -> changed to |Rajiv Rajiv|

I have to retain the space..... ie., |Rajiv Rajiv|

Please let me know if you have any workaround...

Thanks,
Rajiv

The following should work for you.

tr -d "[= =]" < infile > outfile

additionally [:space:] similar to your sed statement is also supported in tr

Denn,

It is eliminating all the spaces that exists between the words.

eg., if I have a data like this

      "Rajiv |   Rajiv   Rajiv    Rajiv   |Rajiv                 Rajiv"

       If I use the command suggested by you will result in the output
        "Rajiv|RajivRajivRajiv|RajivRajiv"

        I need the output in the following format
        "Rajiv|Rajiv   Rajiv    Rajiv|Rajiv                 Rajiv"

Thanks,
Rajiv

Hi Rajiv,

Did you get the Solution for the above Problem?
please help me. I am also facing the similar problem.

Thanks,
Deepak

Yes, I was able to achieve it....

here is the command....

cat filename | awk 'BEGIN{FS=OFS="|"} {for(i=1;i<=NF;i++)gsub("(^[[:space:]]*)|([[:space:]]*$)","",$i)};1' | awk 'NF > 0' > Output_Filename.txt

Thanks,
Rajiv

Hi Deepak,

You should have to change the delimiter, in my case delimiter was pipe '|' so you should change OFS="|" with whatever delimiter you have.

Thanks,
Rajiv

Hi Rajiv,

Thanks for your reply.
I am using this command sed 's/ | /|/g' temp.dat>temp1
After reading your post i am scared to use sed command.

Thanks,
Deep

There are many different sed implementations. Try it and measure it. Then if it's slow, feel free to be scared.

RcR's original approach is what I would recommend, with a couple of minor tweaks.

sed 's/^[[:space:]]*//;s/[[:space:]]*|[[:space:]]*/|/g;s/[[:space:]]*$//' file

If your sed is too slow, try the same with perl -p