Removing inserted newlines from a fileld of fixed width file.

enigma_1 · August 18, 2009, 5:23pm

Hi champs!

I have a fixed width file in which the records appear like this

11111 <fixed spaces such as 6> description for 11111 <fixed spaces such as 6> some more field to the record of 11111
22222 <fixed spaces such as 6> description for 22222 <fixed spaces such as 6> some more field to the record of 22222
33333 <fixed spaces such as 6> description 
for 33333 <fixed spaces such as 6> some more field to the record of 33333
44444 <fixed spaces such as 6> description for 44444 <fixed spaces such as 6> some more field to the record of 44444

As you see, the record for 33333 is split into two records because of newline inserted in description of 33333. I want these extraneous newlines from description field to be removed for records where ever they appear in the file.
Clues can be : check the file for length 11 -32 for each record and if newline is present strip it off.
Any other solution is welcome too.
I want the output to be :

11111 <fixed spaces such as 6> description for 11111 <fixed spaces such as 6> some more field to the record of 22222
22222 <fixed spaces such as 6> description for 22222 <fixed spaces such as 6> some more field to the record of 22222
33333 <fixed spaces such as 6> description for 22222 <fixed spaces such as 6> some more field to the record of 33333
44444 <fixed spaces such as 6> description for 44444 <fixed spaces such as 6> some more field to the record of 44444

it is not fixed that line break will appear after 'description' only..it can appear anywhere in the second field.But it is sure that it will appear in second field only, incase it appears.
This is just the sample record for understanding, code should not be dependent on it.The code can be dependent on positioning if required.
It is a fixed width file that means each filed is identified by length in the record.

Please let me know if you need more clarification.

vgersh99 · August 18, 2009, 5:54pm

To keep the forums high quality for all users, please take the time to format your posts correctly.

First of all, use Code Tags when you post any code or data samples so others can easily read your code. You can easily do this by highlighting your code and then clicking on the # in the editing menu. (You can also type code tags

```text
 and 
```

by hand.)

Second, avoid adding color or different fonts and font size to your posts. Selective use of color to highlight a single word or phrase can be useful at times, but using color, in general, makes the forums harder to read, especially bright colors like red.

Third, be careful when you cut-and-paste, edit any odd characters and make sure all links are working property.

Thank You.

The UNIX and Linux Forums

---------- Post updated at 05:54 PM ---------- Previous update was at 05:38 PM ----------

something to start with...

'len' is a known/expected length of ALL the records (assuming they are of the same length) - defaulted to '73'.

Assumption: there's only ONE extra new-line per 'broken' record.
nawk -f enigma.awk myFile
OR
nawk -v len=63 -f enigma.awk myFile

enigma.awk:

BEGIN {
  len=(!len)?73:len
}
length < len {
   if (length(s)) { print s OFS $0;s=""}
   else s=$0
   next
}
1

enigma_1 · August 19, 2009, 2:27pm

Thanks vgersh !!

The code you provided worked for me for the records broken into two.
But I have some more problems. Hope you can help.
As ytou mentioned in your assumption that record is divided into two records only.
Unfortunately In my file I have just one record which is divided into three records.

Sample:


33333 <fixed spaces such as 6> description 
for 
33333 <fixed spaces such as 6> some more field to the record of 33333

which needs to be :

33333 <fixed spaces such as 6> description for 33333 <fixed spaces such as 6> some more field to the record of 33333

Can we have some modification to the enigma.awk program to take care of record break to three records?? If I can ask for more, Can we have the code to take care of any level of record break heirarchy for each record?
I guess you need some identification for each records start.

In my file each new record starts from column(length)= 16. If any record starts from before length 16, it is continuation of previous record.

Thank you once again!

vgersh99 · August 19, 2009, 3:03pm

enigma.awk:

BEGIN {
  len=(!len)?73:len
}
length < len {
   if (length(s)) { s=s OFS $0}
   else s=$0
   if (length(s) == len) { print s; s=""}
   next
}
1

Franklin52 · August 19, 2009, 3:06pm

What is the length of a record?

Regards

bakunin · August 19, 2009, 3:59pm

I might be wrong, but isn't this the very type of problems the "fmt" simple optimal formatter tool was created for?

"fmt -w <your desired line length here>" should do the trick.

I hope this helps.

bakunin

vgersh99 · August 19, 2009, 4:10pm

good tip - forgot about fmt - thanks.

enigma_1 · August 19, 2009, 4:57pm

Length of each valid record in the above example and solution by vgersh is 73,
In my original file it is 1700 and 1699. any record of length 1700 or 1699 is valid and need not be processed.

---------- Post updated at 04:53 PM ---------- Previous update was at 04:49 PM ----------

vgersh,

output from the modified file is 0 records, it is skipping all the rows in the file.
Also as I mentioned a valid record can have 1700 or 1699 as length and starts from length 16 onwards.
-any length shorter than this is problematic, i.e, record is divided.

any record starting in less than 16 is continuation of previous.

The first code worked for records broken to two, but there are records broken to three too.
If you can tweak the code again will be great.[COLOR="\#738fbf"]

vgersh99 · August 19, 2009, 6:18pm

there's no 'tweaking' needed:

nawk -v len=1700 -f enigma.awk myFile

enigma.awk:

BEGIN {
  len=(!len)?73:len
}
length < len {
   if (length(s)) { s=s OFS $0}
   else s=$0
   if (length(s) == len) { print s; s=""}
   next
}
1