Remove newline character from column spread over multiple lines in a file

Prathmesh · September 18, 2019, 1:16pm

Hi,

I came across one issue recently where output from one of the columns of the table from where i am creating input file has newline characters hence, record in the file is spread over multiple lines. Fields in the file are separated by pipe (|) delimiter. As header will never have newline character, I am trying to compare if other rows have same number of fields as that of header and if number of fields in particular row is less than number of fields in header line then I am removing newline character at the end of the line. I was able to do this for row spread over two lines but, I am not getting correct output for lines spread over multiple lines.

Below is test input file and expected output file -

Input file -

$ cat input
id|country|desscription|Language
1|UNITED STATES|WASHINGTON, D.C.|English
2|UNITED KINGDOM|Capital of UK is LONDON|English
3|NEPAL|Capital of NEPAL is
KATHMANDU|Nepali
4|QATAR|DOHA
is capital of
QATAR|Urdu
5|INDIA|capital
of
INDIA
is DELHI|Hindi
$

Expected output file -

id|country|desscription|Language
1|UNITED STATES|WASHINGTON, D.C.|English
2|UNITED KINGDOM|Capital of UK is LONDON|English
3|NEPAL|Capital of NEPAL is KATHMANDU|Nepali
4|QATAR|DOHA is capital of QATAR|Urdu
5|INDIA|capital of INDIA is DELHI|Hindi

Below code worked for row spread over two lines -

$ awk -F"|" '{if(NR==1){COL=NF}}{if(NF < COL){ sub(/\n/, ""); T=$0; getline; print T $0; next}}1' input
id|country|desscription|Language
1|UNITED STATES|WASHINGTON, D.C.|English
2|UNITED KINGDOM|Capital of UK is LONDON|English
3|NEPAL|Capital of NEPAL is KATHMANDU|Nepali
4|QATAR|DOHA is capital of
QATAR|Urdu5|INDIA|capital
of INDIA
is DELHI|Hindiis DELHI|Hindi
$

I also tried below code but it is not giving expected output -

 $ awk -F"|" '{if(NR==1){COL=NF}}{
> L_NF=NF
> C_NR=NR
> NL=$0
> CNT=0
> while(L_NF != COL)
> {
> C_NF=NF
> sub(/\n/, "");
> getline;
> NL=NL" "$0;
> CNT=+1
> L_NF=C_NF+NF
> }
> print NL
> }
> {
> for(i=0;i<=CNT;i++)
> {
> next
> }
> {
> print $0
> }}' input
id|country|desscription|Language
1|UNITED STATES|WASHINGTON, D.C.|English
2|UNITED KINGDOM|Capital of UK is LONDON|English
3|NEPAL|Capital of NEPAL is  KATHMANDU|Nepali 4|QATAR|DOHA  is capital of
QATAR|Urdu 5|INDIA|capital  of
INDIA  is DELHI|Hindi is DELHI|Hindi
$

Can someone please help me in this?

RudiC · September 18, 2019, 1:24pm

Try

awk -F"|" '
NR == 1         {COLMX = NF
                }
                {while (NF < COLMX)     {getline T
                                         $0 = $0 " " T
                                        }
                }
1
 ' file

Prathmesh · September 18, 2019, 2:00pm

Thanks RudiC. This worked perfectly well.

Yoda · September 19, 2019, 9:09am

Another approach using awk :-

awk '{t+=gsub(/\|/,"&")}t!=3{ORS=FS}t==3{ORS=RS;t=0}1' file

Prathmesh · September 26, 2019, 12:04am

Thanks for your reply.
Can you please explain how this approach works for me? I am having difficulty in understanding it.

system · May 20, 2020, 2:14pm

Moderator comments were removed during original forum migration.

RavinderSingh13 · May 20, 2020, 2:46pm

Adding one more solution which may help future users too(hail auto bumping )

awk '{printf("%s%s",$1!~/^[0-9]/?FNR==1?"":OFS:ORS,$0)} END{print ""}' Input_file

Thanks,
R. Singh