Replace \n char in Data

rajeshkumare · November 20, 2018, 11:05am

File is pipe delimited with 17 fields. We may get \n char (1 or more \n in one field or multi fileds) in data in any field.
Need to replace \n in data with space and not the Ture \n that is line separator.

I tried below awk command it did not work as expected.

awk '(NR-1)%2{$1=$1}1' RS=\| ORS=\| TestInput |egrep -v "^\|$">TestOUT

Input:

455000|TTTT|97233|UUUUUU REP||Juli||EEEE||P.O. Box 550 MMMMMMMMMM JJ 55555|||||||
333333|DDD|97233|UUUUUU REP||AMAR||AJAY||P.O. Box 69 MMMMMMMMMM JJ 6666 
JJJ BBBB P.O. Box 4 MMMMMMMMMM JJ 44444
delmer Speidel P.O. Box 242 MMMMMMMMMM JJ 99456, See File For More.....|||||||
888888|Director|97382|UUUUUU REP||ANTHONY|K|JOSHI||1144 JNM ROAD LLLLLLLLLLLLL JJ 82513, Laurie Ideker, Leon Sanderson
coralie Emmons P.O. Box 34 LLLLLLLLLLLLL JJ 82513
wanda Knowles P.O. Box 958 LLLLLLLLLLLLL JJ 82513, See File For More...|||||||
999999|President|97692|UUUUUU REP||See||File|||||||||

Expected Output:

455000|TTTT|97233|UUUUUU REP||Juli||Surwald||P.O. Box 550 MMMMMMMMMM JJ 55555|||||||
333333|DDD|97233|UUUUUU REP||AMAR||AJAY||P.O. Box 69 MMMMMMMMMM JJ 6666 JJJ BBBB P.O. Box 4 MMMMMMMMMM JJ 99456 delmer Speidel P.O. Box 242 MMMMMMMMMM JJ 99456, See File For More.....|||||||
888888|Director|97382|UUUUUU REP||ANTHONY|K|JOSHI||1144 JNM ROAD LLLLLLLLLLLLL JJ 82513, Laurie Ideker, Leon Sanderson coralie Emmons P.O. Box 34 LLLLLLLLLLLLL JJ 82513 wanda Knowles P.O. Box 958 LLLLLLLLLLLLL JJ 82513, See File For More...|||||||
999999|President|97692|UUUUUU REP||See||File|||||||||

vgersh99 · November 20, 2018, 11:27am

awk -F'|' 'NF==17 {print;next}
             {s=(s)?s $0:$0
               if (split(s,a)==17) {print s;s=""}
             } 
             END {if (s) print s}' myFile

RudiC · November 20, 2018, 11:44am

Hmmm - how (by which algorithm / rule) has the EEEE in the first input line turned into Surwald in the expected Output? And how the third line's 44444 into 99456 ?

Try also

awk -F'|' '
        {while (NF<17)  {getline X
                         $0 = $0 " " X
                        }
        }
1
' file

rajeshkumare · November 21, 2018, 6:24am

Thanks Rudi C,

Its my bad.

Expected output:

455000|TTTT|97233|UUUUUU REP||Juli||EEEE||P.O. Box 550 MMMMMMMMMM JJ 55555|||||||
333333|DDD|97233|UUUUUU REP||AMAR||AJAY||P.O. Box 69 MMMMMMMMMM JJ 6666 JJJ BBBB P.O. Box 4 MMMMMMMMMM JJ 44444 delmer Speidel P.O. Box 242 MMMMMMMMMM JJ 99456, See File For More.....|||||||
888888|Director|97382|UUUUUU REP||ANTHONY|K|JOSHI||1144 JNM ROAD LLLLLLLLLLLLL JJ 82513, Laurie Ideker, Leon Sanderson coralie Emmons P.O. Box 34 LLLLLLLLLLLLL JJ 82513 wanda Knowles P.O. Box 958 LLLLLLLLLLLLL JJ 82513, See File For More...|||||||
999999|President|97692|UUUUUU REP||See||File|||||||||

------ Post updated 11-21-18 at 11:24 AM ------

Thanks a lot Rudi C,

1)One more thing is in my file I may get Extra pipes(|) also( other than Field delimeter | ) then how to handle
2)Can u please explain me your command

RudiC · November 21, 2018, 5:05pm

1) If you tell us how to tell separator pipes from in-field-pipes, then someone could come up with some smart algorithm to handle that.
2) That little command keeps reading / appending new lines until the field count is 17; then: print (default action after "1" (= TRUE)).

rajeshkumare · November 22, 2018, 1:38am

1)I'm not sure how to identify data pipes, but if we get | in 2 or 3 specific fields like Address and other.
So can we handle ?

Don_Cragun · November 22, 2018, 6:37am

If your field delimiter is sometimes a field delimiter and sometimes data, you need to be able to very clearly identify each occurrence of that character as either data or delimiter. If you can't specify a clear rule that unambiguously determines whether a given character is a delimiter or data, there is no way to identify field boundaries.

And when you have field delimiters that are sometimes data AND record delimiters that are sometimes data, you have a real mess.

Your best choice would be to choose a different field delimiter that cannot ever appear as data.

Corona688 · November 22, 2018, 10:35am

This works for your input data:

awk -F'|' '$1 ~ /^[0-9]+$/ { if(T) print T; T=$0; next; }
{ T = T " " $0; }
END { if(T) print T; }' allnum.txt

...but cannot be 100% reliable as Don Cragun says. It relies on the first field being all numbers, and if the broken line ever manages to imitate that, it will be fooled. And if | ever appears in a record nothing good will happen.

rajeshkumare · November 25, 2018, 12:25pm

Thanks Rudi C,
In my file total fields are 17 and expected pipes are 16
Your command is working fine in case of extra pipes also i.e more than 16 pipes. Can you please help me with expalanation how its working in case of extra pipes in data.
Please find below input and output after applying your command.
I will be very Thankful to you !!!!

Input:
Below rows have extra pipes than expected:
1st row (19 pipes),2nd row (18 pipes, \n in data ),3rd row (19 pipes)

Below rows have no Extra pipes i.e 16 pipes as expected.
4th (row has \n in data ),5th row has no extra pipes i.e 16 pipes

Input:

3071454|Organizer|324888|Filing|Bailey | Stock | Harmon | Cottam P.C.||||||||||||
333333|DDD|97233|UUUUUU REP||AMAR||AJAY||P.O. Box 69 MMMMMMMMMM JJ 6666||
JJJ BBBB P.O. Box 4 MMMMMMMMMM JJ 44444
delmer Speidel P.O. Box 242 MMMMMMMMMM JJ 99456, See File For More.....|||||||
3182134|Organizer|339933|Filing|BAILEY | STOCK | HARMON | COTTAM P.C.||||Registered Agent|221 E 21st St, Cheyenne, Laramie County, WY  82001|||||||
888888|Director|97382|UUUUUU REP||ANTHONY|K|JOSHI||1144 JNM ROAD LLLLLLLLLLLLL JJ 82513, Laurie Ideker, Leon Sanderson
coralie Emmons P.O. Box 34 LLLLLLLLLLLLL JJ 82513
wanda Knowles P.O. Box 958 LLLLLLLLLLLLL JJ 82513, See File For More...|||||||
999999|President|97692|UUUUUU REP||See||File|||||||||

Output:

3071454|Organizer|324888|Filing|Bailey | Stock | Harmon | Cottam P.C.||||||||||||
333333|DDD|97233|UUUUUU REP||AMAR||AJAY||P.O. Box 69 MMMMMMMMMM JJ 6666|| JJJ BBBB P.O. Box 4 MMMMMMMMMM JJ 44444 delmer Speidel P.O. Box 242 MMMMMMMMMM JJ 99456, See File For More.....|||||||
3182134|Organizer|339933|Filing|BAILEY | STOCK | HARMON | COTTAM P.C.||||Registered Agent|221 E 21st St, Cheyenne, Laramie County, WY  82001|||||||
888888|Director|97382|UUUUUU REP||ANTHONY|K|JOSHI||1144 JNM ROAD LLLLLLLLLLLLL JJ 82513, Laurie Ideker, Leon Sanderson coralie Emmons P.O. Box 34 LLLLLLLLLLLLL JJ 82513 wanda Knowles P.O. Box 958 LLLLLLLLLLLLL JJ 82513, See File For More...|||||||
999999|President|97692|UUUUUU REP||See||File|||||||||

RudiC · November 25, 2018, 6:41pm

Not sure I understand your question. Additional lines will be read and appended to $0 until there are 17 fields in $0. No distinction is made between pipe field separators and "extra pipes". Should your input have many "extra pipes" in early fields, that method may fail and still leave you with truncated lines.
Should that become a problem, see posts #5 and #7.