Replace CRLF between pipe (|) delimiter with LF

Hi Folks!

Need a solution for the following :-

Source data
-------------

123|123|<CRLF><CRLF><CRLF>|321<CRLF>

Required output
------------------

123|123|<LF><LF><LF>|321<CRLF>

<CRLF> represents carriage return
<LF> represents line feed

Being hunting high and low for a proper awk or sed statement to get the ball rolling but could not yet turn up with anything proper.

Appreciate your expertise!

Zz

Does <CRLF> represent a (binary control character) carriage return OR a <CR><LF> combination? Which should persist at line end? Are you aware that both the original data as well as the result will be difficult to be dealt with by usual *nix text tools?
Are there more lines like above in your file? How are those separated?

Hi Rudi,

<CRLF> represents the binary control character for carriage return. ^M character from a vi perspective.

The record delimiter is <CRLF> and should remain as it is. It is the <CRLF> within a field (between two pipes) that should be converted to <LF>.

And yes, there are way more lines in the which has similar issues in the data unfortunately.

I do understand its going to be tricky handling with the usual unix tools but I am looking for a possibility if any just to try it out :slight_smile:

Zz

Sorry I keep nagging. How are the lines separated, and how differs that from the in-field control characters? Sure there's NO <LF> char?
Please post the output of

od -tx1c file

.

No worries Rudi! I am the needy one here hehe..

0000000  32  30  31  36  2d  31  31  2d  33  30  7c  32  30  31  36  2d
          2   0   1   6   -   1   1   -   3   0   |   2   0   1   6   -
0000020  32  30  31  37  7c  32  30  31  36  2d  31  31  2d  33  30  7c
          2   0   1   7   |   2   0   1   6   -   1   1   -   3   0   |
0000040  31  32  33  34  7c  73  6f  6d  65  66  69  6c  65  2e  74  78
          1   2   3   4   |   s   o   m   e   f   i   l   e   .   t   x
0000060  74  7c  50  72  6f  64  75  63  74  69  6f  6e  7c  4e  6f  7c
          t   |   P   r   o   d   u   c   t   i   o   n   |   N   o   |
0000100  7c  7c  4c  4f  7c  7c  43  65  6e  74  65  72  7c  7c  4e  6f
          |   |   L   O   |   |   C   e   n   t   e   r   |   |   N   o
0000120  7c  7c  7c  31  32  33  34  7c  49  6d  70  6f  72  74  61  6e
          |   |   |   1   2   3   4   |   I   m   p   o   r   t   a   n
0000140  74  7c  3c  20  24  32  30  20  4d  69  6c  6c  69  6f  6e  7c
          t   |   <       $   2   0       M   i   l   l   i   o   n   |
0000160  51  75  61  72  74  65  72  6c  79  7c  7c  7c  7c  32  30  31
          Q   u   a   r   t   e   r   l   y   |   |   |   |   2   0   1
0000200  31  2d  30  32  2d  32  34  7c  7c  7c  53  6f  6d  65  20  64
          1   -   0   2   -   2   4   |   |   |   S   o   m   e       d
0000220  65  73  63  72  69  70  74  69  6f  6e  20  68  65  72  65  7c
          e   s   c   r   i   p   t   i   o   n       h   e   r   e   |
0000240  0d  0a  0d  0a  0d  0a  0d  0a  0d  0a  0d  0a  55  70  64  61
         \r  \n  \r  \n  \r  \n  \r  \n  \r  \n  \r  \n   U   p   d   a
0000260  74  65  20  73  6f  6d  65  74  68  69  6e  67  7c  74  65  73
          t   e       s   o   m   e   t   h   i   n   g   |   t   e   s
0000300  74  66  69  6c  65  2e  74  78  74  7c  48  69  73  68  61  6d
          t   f   i   l   e   .   t   x   t   |   H   i   s   h   a   m
0000320  0d  0a
         \r  \n
0000322

Sample record of how it appears in the file. The CRLF can honestly appear in any one of the columns prior to the last.

Zz

That's one single line, obviously. And, obviously, as anticipated, we're talking of <CR><LF> combinations. How do you tell one line from another? Do they all have the same field count? Do they all have the same <CR> count?

Yes Rudi. It is a single sample record.

The field count should be consistent. As in, if there are 7 fields there ought to be 6 pipes in the data and that is how you group a record as one.

For example :-

2016-11-30|2016-2017|2016-11-30|123|123.xlsm|Production|No|||AHB||Center||No|||2222|Unit Important|< $20 Million|Quarterly||||2011-02-24|||Some descripto|





Mandatory Fiel|xlsm|Hisham
2016-11-30|2016-2017|2016-11-30|3123|123.xlsm|Production|No|||AHB||Center||No|||2222|Unit Important|< $20 Million|Quarterly||||2011-02-24|||Some descripto|





Mandatory Fiel|xlsm|Hisham

Each record has 30 fields (29 pipes). So an entire record should contain 29 pipes for it be considered one single record.

Which is where the question comes. How can we remove the <CRLF> alone between the pipes.

Zz

Try

awk -F\| '{while (NF<30) {getline X; $0 = $0 X}; gsub ("\r", "\n"); sub ("\n$", "\r")} 1' file

Did not work unfortunately Rudi.

I have attached a sample file to make life easier.

Zz

WHAT "did not work"?
It did exactly what was requested when I tested it - inside a line just <LF>, at the end <CR><LF>.

EDIT: And does for your test2.txt file

You are right Rudi! It works!!

My bad.. i was looking at the same file after running the command. Was mixed up with some other thoughts at the moment.

Will run it through the entire file and see how it goes.

Thanks loads for your time and expertise Rudi!

Zz