Remove line break at specific position

qranumo · August 13, 2018, 8:02am

Hi,

I need to remove line breaks from a file, but only the ones at specific position.

Input file:

this is ok
this line is divided at posit
ion 30. The same as this one,
also position 30
the rest of lines are ok
with different lengths
The longest ones are always s
plitted at same position

Expected output:

this is ok
this line is divided at position 30. The same as this one,also position 30
the rest of lines are ok
with different lengths
The longest ones are always splitted at same position

I have tried awk and sed, but I didn't manage to get exactly this behaviour. Could you please help?

joeyg · August 13, 2018, 8:08am

Is this a homework assignment?

What are the rules for this?
(Logical intent to remove the characters)

And what have you tried?

wisecracker · August 13, 2018, 8:17am

Following on from joeyg's comments...

Which OS, which shell, preferred tools and...

What if the line actually contains those 30 characters and is complete, do we look for a '.'?

qranumo · August 13, 2018, 8:23am

Hi Joeyg,

It's not a homework. I have a file (more than 450000 lines) where all longest lines were cropped at position 171. I created a short file as an example with lines splitted in two when longer than 30 characters. I need all those lines to be concatenated, and the only way I can imagine is to remove the line break that are in specific position (position 30 in the example) as there are no string in common

------ Post updated at 01:23 PM ------

Hi wisecracker,

Let's suppose that all lines with 30 characters are not complete.

I'm using a unix emulator in windows (cygwin) but I could run it in Solaris if needed.

joeyg · August 13, 2018, 8:42am

So every extra CR/LF is at position 30?
Is that what you are looking for some kind of process to strip out - only those at position 30?

qranumo · August 13, 2018, 8:51am

That is. I want to remove the CR/LF that are at position 30.

joeyg · August 13, 2018, 9:33am

1) What did you try?
2) I can imagine a solution to simply count number of characters in the line. Then if the 30th character if CR to remove it. However, how do you plan to handle lines that SHOULD have the CR at position 30 - some lines may?

qranumo · August 13, 2018, 9:38am

1) This is what I've tried so far:

sed ':a;N;$!ba;s/^\(.\{30\}\)\n/./g'

sed 's/^\(.\{30\}\)\n/./g'

awk '{ gsub(/\n/,"",$30); print $0}'

2) I don't consider that any line should have a CR at position 30

joeyg · August 13, 2018, 10:45am

I am looking at your third option (awk) but do not see anything obviously wrong. What is happening to your data with this approach?
One thing, in the back of my head, is that some files terminate <CR> while others terminate <CR><LF>. And the unix2dos and dos2unix commands.
Have you done a hexdump/octaldump of the file to see the actual characters?

There are a few threads on this here --

Don_Cragun · August 13, 2018, 11:17am

Assuming that there are no tab characters on any line (or that you count a tab character as always occupying one position) and that you want to keep CR/LF character pairs as line terminators, the following should work on Cygwin for your sample:

awk '
substr($0, 30, 1) == "\r" {
        printf("%29.29s", $0)
        next
}
1' file

On Solaris systems, you'll need to use /usr/xpg4/bin/awk or nawk instead of awk .

Note that the above code might not work on any system if your files have CR/LF line separators instead of line terminators AND it will not work on UNIX text files that have LF line terminators.

qranumo · August 13, 2018, 1:41pm

I'm afraid they're not working:

##File with LF only or CR/LF at end of line:

$ awk '
substr($0, 30, 1) == "\n" {
        printf("%29.29s", $0)
        next
}
1' shorttest1.txt
this is ok
this line is divided at posit
ion 30. The same as this one,
also position 30
the rest of lines are ok
with different lengths
The longest ones are always s
plitted at same position

##->Output is same as file

##File with CR only at the end of each line

]$ awk '
substr($0, 30, 1) == "\r" {
        printf("%29.29s", $0)
        next
}
1' shorttest1.txt
plitted at same positionays s

##-> Output is a mix of two last lines

------ Post updated at 07:41 PM ------

It works if we look for character at position 29 (the one before the end of line):

$ awk '
substr($0, 29, 1) == "t" {
        printf("%29.29s", $0)
        next
}
1' shorttest1.txt
this is ok
this line is divided at position 30. The same as this one,
also position 30
the rest of lines are ok
with different lengths
The longest ones are always s
plitted at same position

But it's not a solution as I don't have any string that matches all lines

wisecracker · August 13, 2018, 3:01pm

In actual fact with the conditions you set, the <CR><LF> pair, Don's version works as predicted. You suggested CygWin so the odds are that text files will have line terminators of the above pair.
Don will be using a 2007 version of bash on OSX 10.13.6 and I am using the current Linux Mint version.

bazza@amiga-MacBookPro:~$ bash --version
GNU bash, version 4.4.19(1)-release (x86_64-pc-linux-gnu)
Copyright � 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
bazza@amiga-MacBookPro:~$ cd ~/Desktop/Code/Shell
bazza@amiga-MacBookPro:~/Desktop/Code/Shell$ ./split_lines.sh
this is ok
this line is divided at position 30. The same as this one,also position 30
the rest of lines are ok
with different lengths
The longest ones are always splitted at same position
bazza@amiga-MacBookPro:~/Desktop/Code/Shell$ _

If you are now suggesting that there my be any one of the four line termination methods:
1) <CR>
2) <LF>
3) <CR><LF>
4) <LF><CR>
...then this is a completely different requirement.
EDIT:
What does the "t" denote?
Are we expecting "\t" tabs as well?

------ Post updated at 08:01 PM ------

Not sure if this adds to the end of my other post but here goes:
Using a modified version of Don's code for UNIX style newlines only:

awk '
substr($0, 29, 1) != "" {
        printf("%29.29s", $0)
        next
}
1' < /tmp/text

Results Linux Mint 19:

bazza@amiga-MacBookPro:~$ cd ~/Desktop/Code/Shell
bazza@amiga-MacBookPro:~/Desktop/Code/Shell$ ./split_lines.sh
this is ok
this line is divided at position 30. The same as this one,also position 30
the rest of lines are ok
with different lengths
The longest ones are always splitted at same position
bazza@amiga-MacBookPro:~/Desktop/Code/Shell$ _

RudiC · August 13, 2018, 5:22pm

Let's stop poking in the dark - data needed!

From the top of my head - post the result of od -ctx1 yourinputfile so we can see what we're dealing with.

qranumo · August 14, 2018, 2:28am

Hi all,

First of all, sorry if I didn't explain myself properly. My original file has <CR><LF> as line termination pattern for every single line. But as Don suggested that it will not work with <LF> I removed them using tr:

cat shortest1.txt | tr -d "\n" > nolf.txt

And using dos2unix command, result file had only <LF> as line terminators. That's why I tried with three different formats.

My bash version:

$ bash --version
GNU bash, version 4.4.12(3)-release (i686-pc-cygwin)
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

And finally, I have tried with wisecracker modification it works:

$ awk '
substr($0, 29, 1) != "" {
        printf("%29.29s", $0)
        next
}
1' shorttest1.txt
this is ok
this line is divided at position 30. The same as this one,also position 30
the rest of lines are ok
with different lengths
The longest ones are always splitted at same position

Thanks a lot to everyone!!

Don_Cragun · August 14, 2018, 4:17am

With UNIX text files as input, your code can be simplified to just:

awk '
length() == 29 {
	printf("%s", $0)
	next
}
1' file

on Cygwin. You'll still need to use /usr/xpg4/bin/awk or nawk instead of awk if you want to run this on a Solaris system.

RudiC · August 14, 2018, 8:38am

Or, (untested!):

awk '{ORS = length() == 29?"":RS} 1' file