Breaking long lines into (characters, newline, space) groups

rowie718 · May 14, 2009, 10:58am

Hello,

I am currently trying to edit an ldif file. The ldif specification states that a newline followed by a space indicates the subsequent line is a continuation of the line. So, in order to search and replace properly and edit the file, I open the file in textwrangler, search for "\r " and remove it, thus making all continued lines into single lines. Thats the first step. I make my changes to the ldif file at that point.

Now, after editing, I want to break any lines with more than 79 characters, (some of which are hundreds of characters long) into this: 79 characters, newline, space, next 79 characters, newline, space, next 79 characters, newline, space, etc.

using this simple sed command:

sed 's/./\
 /80' myfile > newfile

works for the first 79 characters of line x, breaks it properly, but then moves on to the next line in the ldif, leaving line x broken into: 79 characters, newline, space, remaining chunk of line x which is hundreds of characters, next line in ldif. Only partial success!

So heres the question. Is there a way to use sed to run this command every 79th character until the end of the line? If not, alternately, should I use a loop in the script using some sort of conditional statement like, if there are lines longer than 79 characters, rerun the sed command. (so that it will go and now break the remaining hundreds of characters that were not broken in the original sed run. and continue looping till all lines are broken into (79 character, newline, space) chunks? How could I set up that condition? I dont know how to search for lines longer than x characters.

Thanks a lot for any help on this!

rowie718 · May 14, 2009, 3:02pm

I came up with this script. It seems hackish and very inefficient, but it works. I would love for someone to help me come up with a better way since this script takes almost 10 full minutes to parse a text file into less than 7000 lines.

#!/bin/ksh

echo "where is the ldif file located that you would like to parse?"
read response
ldiffile=$response

while read line
do

x=`echo $line | wc -c`

while [ $x -gt 79 ]
do

sed 's/./\
 /79' $ldiffile > /test.ldif
mv /test.ldif $ldiffile
x=$x-79

done

done < $ldiffile

I just realized this script is substituting the 79th character with the newline and space. From what Ive been reading, I can add an ampersand before the newline escape in the sed replacement pattern. However when I put an ampersand there, it ruins the ldif file, cutting lines and inserting groups of blank lines. Ive searched all through a million forums, mostly suggesting using escaped parentheses to remember a pattern and then \1 to recall it with the newline after that. It doesnt work for me. Any which way I try to recall the 79th character in the replacement string and add to it, I get this crazy blank line effect on my file. I am on os x 10.4.11 server. Frustrating! How do I make it so the newline will come after the 79th character and not as a substitute?

Thanks again for any help you can offer!

cfajohnson · May 14, 2009, 8:21pm

For a file that size, you should use awk.

Why not simply:

read ldiffile

You don't need an external command to get the length of a variable's contents:

x=${#line}

awk 'length > 79 { while ( length($0) > 79 ) {
    printf "%s\n ", substr($0,1,79)
    $0 = substr($0,80)
  }
  if (length) print
  next
}
{print}' "$FILE"

rowie718 · May 15, 2009, 12:31am

Thank you so much for your help cfajohnson!

I put together your suggestions and tested them. Its almost there, but there were 2 problems. The first I fixed fairly easily. The 79th character is the newline in the orignal ldif, so I shouldve expressed it as wanting 78 characters. I deducted 1 from anywhere I saw 79 or 80 in your awk command and that seemed to do the trick. The second problem is trickier. Take a 240 character line as an example. When the awk command breaks it, and adds the space in the second chunk, it does not take into account that the last character of that second chunk should be at the same ending position as the first chunk. As it is currently written, all chunks after the first break align 1 character to the right because of the space.

Example:

123456789012345678901234567890.....(240 character long string repeating)

currently breaks into :
123456789012345678901234567890123456789012345678901234567890123456789012345678
 901234567890123456789012345678901234567890123456789012345678901234567890123456
 789012345678901234567890123456789012345678901234567890123456789012345678901234
 567890

but should actually end up more like this, so that every line has 78 characters, 
plus newline (including the space we've added):

123456789012345678901234567890123456789012345678901234567890123456789012345678
 90123456789012345678901234567890123456789012345678901234567890123456789012345
 67890123456789012345678901234567890123456789012345678901234567890123456789012
 34567890

The script currently looks like this:

#!/bin/ksh

echo "where is the ldif file located that you would like to parse?"
read ldiffile

awk 'length > 78 { while ( length($0) > 78 ) {
    printf "%s\n ", substr($0,1,78)
    $0 = substr($0,79)
  }
  if (length) print
  next
}
{print}' $ldiffile > /out.txt

Thanks again for your help, I really appreciate it.

ghostdog74 · May 15, 2009, 1:08am

if you have Python, here's an alternative solution

import textwrap
t=textwrap.TextWrapper(subsequent_indent=" ",width=78)
for line in open("file"):
    for i in t.wrap(line):
        print i

output

# more file
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890

# ./test.py
123456789012345678901234567890123456789012345678901234567890123456789012345678
 90123456789012345678901234567890123456789012345678901234567890123456789012345
 67890123456789012345678901234567890123456789012345678901234567890123456789012
 34567890

rowie718 · May 15, 2009, 10:24am

Thank you ghostdog,

First let me state that I am totally unfamiliar with python. However, if it solves this problem for me, I would be glad to learn a bit and use it. There are a few issues I noticed upon trying the code you provided, ranked in order of importance:

1) The code seems to eliminate blank lines from the source text. I need it to not do that. Example:

1111

2222

becomes

1111
2222

2) I dont know how to output to a file rather than the standard output. I apologize for the rookie question here.

3) Ideally I would like for there to be a way to interactively input the location of the file so it doesnt need to be hardcoded. If this is too much to ask though, I can live without it.

Generally I would prefer to use sed/awk since I have some familiarity with them and bash scripting, however I will use whatever solutions are presented that fully solve this problem. I really appreciate the assistance.

Cheers.

cfajohnson · May 15, 2009, 12:03pm

awk 'length > 79 {
    n=1
    while ( length($0) > 78 + n ) {
    printf "%s\n ", substr($0,1,78 + n)
    $0 = substr($0,79 + n)
    n=0
  }
  if (length) print
  next
}
{print}' "$FILE"

rowie718 · May 15, 2009, 12:48pm

Awesome! That works perfectly cfajohnson...thanks a bunch for your help with this. Awk is clearly the way to go, parses it in a split second

For anyone who comes across this, and happens to be dealing with ldif files, here is the final script for parsing the file:

#!/bin/ksh

echo "Where is the ldif file located that you would like to parse?"
read source_ldif
echo "Where would you like to output the parsed version to?"
read out_ldif

awk 'length > 78 {
    n=1
    while ( length($0) > 77 + n ) {
    printf "%s\n ", substr($0,1,77 + n)
    $0 = substr($0,78 + n)
    n=0
  }
  if (length) print
  next
}
{print}' "$source_ldif" > "$out_ldif"

cfajohnson · May 15, 2009, 6:54pm

This version accepts an arbitrary line length:

## adjust length to taste
## or prompt for value
## or get value from environment
## or on the command line
## or wherever
length=66

awk -v x=${length:-79} 'length > x {
    n=1
    while ( length($0) > x - 1 + n ) {
    printf "%s\n ", substr($0,1,x - 1 + n)
    $0 = substr($0,x + n)
    n=0
  }
  if (length($0)) print
  next
}
{print}'

rowie718 · May 16, 2009, 8:54pm

I studied up a bit on awk, and generally I understand the script. There is one part I dont get though, maybe you could explain:

if (length ($0)) print
next

What does this do and why is it necessary? I get what the words mean, but I dont understand its purpose in the script. I tested various source text files with various text content, with lines of all lengths, but no matter what, if i remove this code, the script still seems to work perfectly.

In the variable length script, I didnt understand this either:

x=${length:-79}

What does the colon minus 79 mean? If you want to allow the user to set the length, could you "read x" and then set the variable as x=x-2? If youre setting the length to something else, why does the number 79 enter this version of the script?

Thanks for any explanation, Id like to understand awk better in general. I will be studying up on it.

cfajohnson · May 16, 2009, 9:11pm

rowie718:

I studied up a bit on awk, and generally I understand the script. There is one part I dont get though, maybe you could explain:
if (length ($0)) print
next
What does this do and why is it necessary? I get what the words mean, but I dont understand its purpose in the script. I tested various source text files with various text content, with lines of all lengths, but no matter what, if i remove this code, the script still seems to work perfectly.

It doesn't work perfectly is the line is missing. Given this text file:

abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXY
abcdefghijklmnopqrstuvwxyz

If you call the script with a length of 13, this is the result:

abcdefghijklm
 nopqrstuvwxy
 z
ABCDEFGHIJKLM
 NOPQRSTUVWXY
abcdefghijklm
 nopqrstuvwxy
 z

If you remove that line, the result is:

abcdefghijklm
 nopqrstuvwxy
 ABCDEFGHIJKLM
 abcdefghijklm
 nopqrstuvwxy

It prints whatever is left after all lines of length have been removed.

That's shell parameter expansion, not part of awk. If length isn't defined, it substitutes 79.

It's a default value.

rowie718 · May 17, 2009, 3:29am

Thanks for the explanation. Im learning, but strangely, it seems that our systems are treating this code differently.

This script (without the if and next statements towards the end):

#!/bin/ksh

echo "How many characters per line?"
read length
echo "Where is the ldif file located that you would like to parse?"
read source_ldif
echo "Where would you like to output the parsed version to?"
read out_ldif

awk -v x=${length:-79} 'length > x {
    n=1
    while ( length($0) > x - 1 + n ) {
    printf "%s\n ", substr($0,1,x - 1 + n)
    $0 = substr($0,x + n)
    n=0

  }

}
{print}' "$source_ldif" > "$out_ldif"

applied to your source text:

abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXY
abcdefghijklmnopqrstuvwxyz

with a length of 13, results in properly parsed text:

abcdefghijklm
 nopqrstuvwxy
 z
ABCDEFGHIJKLM
 NOPQRSTUVWXY
abcdefghijklm
 nopqrstuvwxy
 z

I tried to figure out which version of awk I have, but apparently it is not easy to do so. I am running os x client 10.4.11. On Apple's opensource distribution page, they list "awk-7" as an available download. I wonder how I can find out the version I am using, and if it makes sense that it's a different version of awk that accounts for the difference in output.

cfajohnson · May 17, 2009, 9:01am

You're right; it is unnecessary; this works as is:

awk -v x=${1:-79} 'length > x {
    n=1
    while ( length($0) > x - 1 + n ) {
    printf "%s\n ", substr($0,1,x - 1 + n)
    $0 = substr($0,x + n)
    n=0
  }
}
{print}' "$file"

The next exercise is to modify it so that the amount of indent can be specified.

cfajohnson · May 17, 2009, 10:07am

if [ $# -eq 0 ]
then
  echo "USAGE: ${0##*/} FILE [width [ indent ]]"
  exit 1
fi

file=$1

awk -v width=${2:-79} -v indent=${3:-1} '
length > width {
    n = width
    while ( length($0) > n ) {
    printf "%s\n%" indent "s", substr($0,1, n), " "
    $0 = substr($0, n)
    n = width - indent
  }
}
{print}' "$file"

rowie718 · May 18, 2009, 8:32am

Awesome. The extra steps you took prompted me to study up on shell parameter expansion and printf formatting, which I now understand much better. Passing the arguments to the script during the initial execution of the script, and without read, is something I wouldnt have thought to do as well. I really learned a lot here.....thanks for taking the extra steps cfajohnson, I really appreciate it!