File Containing Extra delimiter should be removed

The input file is this:

a|b|c|d
x|y|z|n|||||||||
p|q|r|s|||
g|h|i|
w|e|r||

Now as per requirement , each row should have only 3 delimiter.

Now the 2nd row & last row has an extra delimiter, How to remove that ? In some large file having 100K data , there can be 100 such rows having extra pipe , how to remove it ?

The output should be as per above input sample:

a|b|c|d
x|y|z|n
p|q|r|s
g|h|i|
w|e|r|

Please, as we do not know your environment as you do, tell us about it:

your shell?
your OS?
the version of your OS?

Furthermore: i guess that your file can contain only ONE extra delimiter per line and the delimiter will contain no data, like:

a|b|c|d|e|||

Otherwise, you will have to explain what to do with such cases.

If this is so and you have a run-of-the-mill UNIX system you can try the following:

sed 's/^\(\([^|]*|\)\{3\}[^|]*\).*/\1/' /path/to/your/file

This will display the changed file only to screen. If you are satisfied with the outcome use:

sed 's/^\(\([^|]*|\)\{3\}[^|]*\).*/\1/' /path/to/your/file > /path/to/newfile

to save these results.

Explanation of the regexp:

[^|]*| matches a single cell, an arbitrary number of non-delimiters followed by a delimiter. This regexp is repeated three times: \([^|]*|\)\{3\} , then followed by an optional field content of non-delimiters: \([^|]*|\)\{3\}[^|]* .

All this is surrounded by brackets to use it as a back-reference. Any further content of the line is then included only to replace everything by the back-reference above so that effectively the rest of the line is deleted.

I hope this helps.

bakunin

Thank you very much for your reply. I will check the env. details and will post.

linux/bash

regarding the scenario, I should have clearified it in first instance.

Use case is - Each row only have 4 Pipe delimiter not more than that. If there are 2/3/4 extra delimiter, it will need to be removed. Last field can/can't contain data as that is a nullable field.

Another approach using awk:-

awk -F\| '{NF=4}1' OFS=\| file
1 Like

Hi.

I liked Yoda's solution.

I modified it to look at the first line, and use that as a model -- if that line is correct, then the following lines will be modified to conform to that.

For example, for the original data on z1:

a|b|c|d
x|y|z|n|||||||||
p|q|r|s|||
g|h|i|
w|e|r||

This

awk -F\| 'NR==1{n=NF}{NF=n}1' OFS=\| z1

produces:

a|b|c|d
x|y|z|n
p|q|r|s
g|h|i|
w|e|r|

Whereas for data like this on z2:

a|b|c|d|e
x|y|z|n|m||||||||
p|q|r|s|t||
g|h|i|j|
w|e|r|s||

the same code

awk -F\| 'NR==1{n=NF}{NF=n}1' OFS=\| z2

produces

a|b|c|d|e
x|y|z|n|m
p|q|r|s|t
g|h|i|j|
w|e|r|s|

Best wishes ... cheers, drl