Questions on removing unexpected line breaks

Nekki_Basara · September 4, 2012, 11:50pm

I am a newbie in Linux and I am having trouble with a piece of data on hand.
The source data is like

a|b|c|d
e|f|g
|h
i|j|k|l
m|n|o
|p
1|2|3|4
5|6|7|
8
a|b|c|d
e|f|g|h

For each line, there should be 4 fields separated by the "|", but unfortunately there are unexpected line breaks that make it a mess. :wall:
How to clean up the mess by reformating the lines to make it like

a|b|c|d
e|f|g|h
i|j|k|l
m|n|o|p
1|2|3|4
5|6|7|8
a|b|c|d
e|f|g|h

I guess it should be like checking the number of "|" in each line and if the number of "|" in a line is <3, then the line break of that line have to be removed.
But I have no idea on what should be used, say, sed?

Did someone encounter such issue before?
If so, could somone share how it could be tackled?

Thanks a lot.

Scrutinizer · September 5, 2012, 12:28am

To remove the line break in awk if the number of fields (NF) are less than 4, you could for example do this:

awk -F\| 'NF<4{getline p; $0=$0 FS p}1' file

pamu · September 5, 2012, 1:02am

Try this...

tr '\n' " " < file  | sed -e 's/ //g' -e 's/.\{7\}/&\n/g'

Nekki_Basara · September 5, 2012, 2:16am

I tried to use this command but the result is not as expected


a|b|c|d
e|f|g||h
i|j|k|l
m|n|o||p
1|2|3|4
5|6|7|
8|a|b|c|d
e|f|g|h

some lines get more than 4 fields!

---------- Post updated at 02:16 PM ---------- Previous update was at 02:14 PM ----------

this one works well on the sample!

could you please kindly explain on what was the code doing?
my linux knowledge is so limited

pamu · September 5, 2012, 2:34am

tr '\n' " " < file  | sed -e 's/ //g' -e 's/.\{7\}/&\n/g'

tr '\n' " " < file # Here i replace new line "\n" to the space " ".
# the result of this all the lines come to one single line.

sed -e 's/ //g' #Here i replace space " " with "". To remove space from the string.

-e 's/.\{7\}/&\n/g' # Here I add new line after every 7 elements of the string...

Hope this helps you..

pamu:)

elixir_sinari · September 5, 2012, 2:42am

That sed solution might not work if your real data has strings within the pipes. Try:

awk '{while(gsub(/[|]/,"&")!=3 || $0 ~ /[|]$/){getline p;$0=$0 p}}1' file

Blank lines in input will be removed by this. If you want to retain them, use:

awk -F\|  'NF{while(gsub(FS,"&")!=3 || $0 ~ /[|]$/){getline p;$0=$0 p}}1' file

Also, I hope that the last field value is not null.

pamu · September 5, 2012, 3:13am

Hi elixir_sinari,

Yes. You are right..

Thanks for giving more robust solution...

elixir_sinari · September 5, 2012, 3:15am

Actually, it's not very robust. The command will mostly hang (loop infinitely) if the last (fourth) field has a null value.

awk -F\|  'NF{while(gsub(FS,"&")!=3 || $0 ~ /[|]$/){if(getline p) $0=$0 p;else break}}1' file

This will at least break out of the loop but will not give the desired result, in such a case.

pamu · September 5, 2012, 3:34am

Yes. lit bit..

I know mine solution below also looks lit bit lengthy but i think this issue might got resolved with this...

sed -e 's/^|//g' -e 's/|$//g' file | awk -F\| 'NF<4 || $4 == ""{getline p; $0=$0 FS p}1'

Scrutinizer · September 5, 2012, 7:58am

Ah yes.. This should work better

awk -F\| '$4==x{getline p; $0=$0 p}1' infile

Nekki_Basara · September 6, 2012, 9:50pm

awk '{while(gsub(/[|]/,"&")!=3 || $0 ~ /[|]$/){getline p;$0=$0 p}}1' file

This code seems working great on the sample!
Could someone kindly explain what it is doing? The regular expression here is kinda difficult for me...:wall:

awk -F\|  'NF{while(gsub(FS,"&")!=3 || $0 ~ /[|]$/){getline p;$0=$0 p}}1' file

Besides, I get error when I run this code.
I run it on a the sample file called "test1.txt" and get the following error.

awk: run time error: regular expression compile failed (missing operand)
|
FILENAME="test1.txt" FNR=1 NR=1

Is that I have missed something? I have put in the filename already.

---------- Post updated at 09:15 AM ---------- Previous update was at 08:58 AM ----------

This one also works!
Could you please kindly explain what's the meaning of the code?

---------- Post updated at 09:29 AM ---------- Previous update was at 09:15 AM ----------

---------- Post updated at 09:50 AM ---------- Previous update was at 09:29 AM ----------

sorry to all...i am confused...
there are so many solutions!
but each solution yields different results!!!

Here is the real data format for my case

12345|123456|999|D|1|123|1.2345|12.345|23.4567|||||||
987654|123456|999|O|12|99|2.3456|123.4567|345.6789|||||||Y
987654|123456|999|O|12|99|3.4567|123.4567|345.6789|||||||Y
987654|123456|999|O|12|99|4.5678|12.345|23.4567|||||||Y
987654|123456|999|O|12|99|5.6789|123.4567|345.6789|||||||Y
987654|123456|999|O|12|99|6.7890|123.4567|345.6789|||||||Y
987654|123456|999|H|1|1|34.5678|56.7890|67.8901||
|||||Y
987654|123456|999|E|1|1|2.3456|2.3456|2.34|||||||Y
.
.
.

totally i have 614293 lines in my data

I tried the following code which yielded 595647 lines in the result. (this one give me an error "new-line character seen in unquoted field" when i process it further using another script...:wall:)

sed -e 's/^|//g' -e 's/|$//g' source_data.csv | awk -F\| 'NF<16 || $16 == ""{getline p; $0=$0 FS p}1'>test1.csv

then i tried the following code which yielded 595433 lines in the result

awk '{while(gsub(/[|]/,"&")!=15 || $0 ~ /[|]$/){getline p;$0=$0 p}}1' source_data.csv>test2.csv

then i tried the following which yielded 595647 lines (this one also give me an error "new-line character seen in unquoted field" when i process it further using another script...:wall:)

awk -F\| '$16==x{getline p; $0=$0 p}1' A27.csv >test3.csv

I am totally confused on what's caused of the difference noted...and which code i should use...:wall:

pamu · September 7, 2012, 12:58am

That's why it is advisable to provide real data....

Please provide input and desired output.

Scrutinizer · September 7, 2012, 3:23am

I think the problem lies in the the last field. In your sample if it is empty or there are fewer than x fields, then that means that the lines should be merged. In your actual data that is not always the case; sometimes the lines do not necessarily need to be merged even if the last field contains no value..

Perhaps merging needs to occur only if a line has fewer than X fields OR ( it has X fields AND the last field is empty AND the next line has fewer than X fields ) ?

bakunin · September 7, 2012, 9:29am

If i have counted correctly you have 15 fields in each line. This means there have to be 14 delimiters - if there are fewer, merge the next line to this, otherwise leave it alone.

The following should do what you want:

sed -n ':start
        /[^|]*\(|[^|]*\)\{14\}/ !{
              N
              s/\n//
              b start
        }
        p' /path/to/infile > /path/to/outfile

This will even connect lines broken into several pieces, but consecutive lines will have to add up to correct ones, otherwise the script will fail to produce correct results.

That is, if a line with 14 fields is followed by a line with 16 fields, it will produce one line with 30 fields, not two with 15 fields each.

If i have miscounted the fields or your file format changes, you can correct this in the counter "\{n\}", which repeats the previous expression "\(|[^|]*\)" (delimiter, followed by optional non-delimiter) n times.

I hope this helps.

bakunin