Delimited records splitted into different lines

ginrkf · May 5, 2016, 1:51am

Hi

I am using delimited sequence file. Delimter we are using is pipe .But for some of the records for one of the column the values are getting split into different lines as shown below

"113"|"0155"|"2016-04-27 07:59:04"|"1930"|"TEST@TEST"|"2016-04-27 11:04:04.357000000"|"BO"|"Hard BO"|"10"|"5.1.0 e This is a permanent error. Please verify the address(es) and try again.
<TEST>:
123.123.12 does not like recipient.
Remote host said: 123 Address rejected test@test
Giving up on 1200.
--- "|"I"|"191"|"212"|"DAM"|"PIl"

But my expectation is it should come in sing line before processing the file

"113"|"0155"|"2016-04-27 07:59:04"|"1930"|"TEST@TEST"|"2016-04-27 11:04:04.357000000"|"BO"|"Hard BO"|"10"|"5.1.0 e This is a permanent error. Please verify the address(es) and try again.<TEST>:123.123.12 does not like recipient.Remote host said: 123 Address rejected test@testGiving up on 1200.--- "|"I"|"191"|"212"|"DAM"|"PIl"

Please help me on this

RudiC · May 5, 2016, 3:26am

How are those files produced?

ginrkf · May 5, 2016, 3:48am

Its a file which we are getting from different source.Not very sure how they are getting produced.Usually the error comes for the 10h column data

ginrkf · May 7, 2016, 8:53pm

could some one please help on this

jim_mcnamara · May 7, 2016, 9:16pm

Not much to go on. You have spurious line feeds in the middle of the line. I am guessing.
Assumption: if the FS = "|" and the first column is a quoted number, this is the correct start of the the line. Anything else is bad.

Is this correct? And is the file from a Windows application or changed by anything like windows FTP? Why I am asking -> because the carriage control may be messed up as well.

Don_Cragun · May 7, 2016, 9:24pm

If you provide a clear description of your problem (instead of just one sample of a problem with no description), you will be more likely to get a response.

What are the line delimiters in your input file? UNIX single newline character delimiters or DOS carriage-return/newline character pair delimiters? Are the delimiters on split lines the same as in a correctly formatted line? Or, is the delimiter on split lines different from the delimiter on correctly formatted lines? What delimiters do you need in your output file?

How are we supposed to know when a line is complete?

Is your verification program supposed to know that there should be a specific number of fields in each input line? Is that number of fields the same for every file your verification program will process?

Is an invalid input line ALWAYS split between a pair of double quotes? Can we assume that a line needs to be combined with the next line from the input file if and only if an input line does not end with a double quote character?

Are invalid records always split by converting a space to a newline? How do we know whether or not a character needs to be added when lines are joined? Is a space character ALWAYS supposed to be added when lines are joined?

What operating system and shell are you using?

What have you tried to solve this problem on your own?

Scrutinizer · May 8, 2016, 5:11am

The standard way of tackling this would be something like this:

awk -F\| 'NF<15{while(getline s>0) $0=$0 s}1'  file

Which should work with your sample..

However as others have pointed out, unless you provide more information it will be difficult to tell if this would be a solution to your problem..

---
Note that this will not work if the newlines appear in the very last field.

Also note that the csv format allows for newlines within quoted fields, so the sample you posted seems to within specification, so by joining the lines you are effectively changing the content by removing the newlines..

If you want to process the file you do not need remove the newlines in order to process the file. For example to print field 10 (without the enclosing double quotes), you could do something like this:

$> awk -F\| 'NF<15{while(getline s>0) $0=$0 RS s}{gsub(/^"|"$/,x,$10); print $10}' file
5.1.0 e This is a permanent error. Please verify the address(es) and try again.
<TEST>:
123.123.12 does not like recipient.
Remote host said: 123 Address rejected test@test
Giving up on 1200.
--- 
$>

RudiC · May 8, 2016, 6:39am

Above works well for the one record sample in post#1 as indicated. If you want to apply it to multiple records, try a small adaption (still far from bullet proof):

awk -F\| '{while (NF < 15 && getline s>0) $0=$0 RS s}1'  file

And, as stated before in this thread, a more precise and detailed specification would help to taylor a solution for you

Scrutinizer · May 8, 2016, 7:28am

Yes indeed that is just silly. Thanks RudiC. That is of course how it should be (but then RS should not be there)

awk -F\| '{while (NF < 15 && getline s>0) $0=$0 s}1' file

or the 2nd example with RS maintained:

awk -F\| '{while(NF<15 && getline s>0) $0=$0 RS s}{gsub(/^"|"$/,x,$10); print $10}'

bakunin · May 8, 2016, 5:54pm

If the "records" are not only defined as an amount of a fixed number of fields but also being of fixed length (each record has the same number of characters) and, of course, if the line breaks are real UNIX-linefeeds instead of may DOS-CR/LFs or whatever - then this is the standard textbook application for the fmt utility, no?

fmt -<number of characters the record is supposed to have> /path/to/file

I hope this helps.

bakunin

ginrkf · May 8, 2016, 7:58pm

Hi All

Sorry for not providing the correct information.Please find the details below of my input file.I am getting encrypted file where I used to decrypt and do a DOS2UNIX conversion.Once I did that I will load it using my job

File Details
-------------

Line Delimiter is \n
Record Delimiter is |
Line Delimiter on the split lines is same as in a correctly formatted line.
In my output file I need the same line delimiter as comes in the input file which is \n
The number of fields are same for all the records which is 15 and all the column values are coming with double quotes.
Is an invalid input line ALWAYS split between a pair of double quotes? Can we assume that a line needs to be combined with the next line from the input file if and only if an input line does not end with a double quote character?--We cant do like that as all the records are coming with double quotes.
Are invalid records always split by converting a space to a newline? How do we know whether or not a character needs to be added when lines are joined? Is a space character ALWAYS supposed to be added when lines are joined?--Space dose not matter, as I just want the split lines to append to a single line

Hope this information will help.I am pretty new to this process,so please let me know if you need more details

Don_Cragun · May 8, 2016, 8:09pm

If the broken records always break inside a double quoted string as in your sample input, an easy fix is just to use:

awk '{printf("%s%s", $0, (substr($0, length, 1) == "\"") ? "\n" : "")}' file

If the broken records sometimes break immediately before or after a pipe symbol (as long as it isn't after the 14th pipe symbol, you can use most of the other suggestions in this thread.

You didn't answer my question about operating system and shell. So, if you're using a Solaris/SunOS operating system, you'll need to change the above suggestion to use /usr/xpg4/bin/awk or nawk instead of awk .

Aia · May 8, 2016, 10:26pm

Save as cleaner.pl
Run as perl cleaner.pl ginrkf.file > ginrkf.filtered

#!/usr/bin/env perl

use strict;
use warnings;

my $len = 15;
{
    local $/ = '|';
    my $field;
    while(<>) {
        /\|/ and ++$field;
        s/\n//g unless $field == $len;
        print;
        $field = 0 if $field == $len;
    }
}

ginrkf · May 9, 2016, 11:12pm

Thanks a lot for all the help.Below code is working for my issue

awk '{printf("%s%s", $0, (substr($0, length, 1) == "\"") ? "\n" : "")}' file