How to add the line to previous line in | delimited text?

Narasimhasss · July 7, 2016, 4:13am

Hi All,

I am new to Unix and I have one challenge and below are the details.
I have pipe delimited text file in that data has span into multiple lines instead of single line.
Sample data.
Data should be like below for entire file.

41|216|398555|77|provided complete NP outcome data           constituted

But data in few lines like below.

72|192|402632|580|Completed OHTS Phase 2
, through 
March

I need to keep the data in single line like above and remove the spaces.
I have fixed length number in 3rd column of the data file.
Could you please help me on this to fix this issue.

Thanks,
Narasimha

Scrutinizer · July 7, 2016, 4:55am

Hi, please post in the proper forum.

Try:

awk '/[|]/{if(NR>1)print p; $1=$1; p=$0; next}{$1=$1; p=p FS $0} END{if(NR)print p}' file

$1=$1 is used to remove excess spaces

Narasimhasss · July 7, 2016, 5:58am

Thanks for your help Scrutinizer,

This command not changing in the file and still for few lines i am seeing the issue.

Thanks,
Narasimha

---------- Post updated at 03:28 PM ---------- Previous update was at 03:08 PM ----------

Hi Scrutinizer,

OMG....

what a miracle you did it is working fine I used this command out put to another file like below.

awk '/[|]/{if(NR>1)print p;
 $1=$1; p=$0; next}{$1=$1; p=p FS $0}
 END{if(NR)print p}' test.txt > test_new.txt.

I will use this logic to all my files.

Thanks a Lot.

Thanks,
Narasimha

Narasimhasss · July 7, 2016, 12:48pm

Hi Scrutinizer,

I have used this command for files and all most all files it works except the files which are ending with | "pipe line".attached screenshot for your reference.
Could you please help me on this.

Thanks,
Narasimha

Narasimhasss · July 8, 2016, 6:49am

Hi All,

Can we do like this?

First find the number of | pipe lines in the line, if the number of pipe lines are not correct then the line should be added to the previous line.

I used the below command to find the number of pipes in the line

awk -F "|" ' { print NF-1 } ' test.txt

but after I could not find the correct command if the number of pipes less than expect then needs to add this line to previous line.

Could you please help me.

Thanks,
Narasimha

RudiC · July 8, 2016, 7:39am

Try

awk '{printf "%s%s", NF==5?RS:"", $0} END {printf RS}' FS="|" file
41|216|398555|77|provided complete NP outcome data           constituted
72|192|402632|580|Completed OHTS Phase 2, through March

Narasimhasss · July 8, 2016, 12:43pm

Hi RudiC,

Thanks for your updates. I tried your command but it is merging all lines into single line.

My requirement is like below.

I have pipe delimited text file, in that data has span into multiple lines.

Sample data.
Data should be like below for all rows of the file.

Code:

41|216|398555|77|provided complete NP outcome data           constituted

But data in few lines like below.

Code:

72|192|402632|580|Completed OHTS Phase 2
,through 
March

Thanks,
Narasimha

RudiC · July 8, 2016, 1:09pm

Proposal applied to your sample in post#7 (which, btw, doesn't differ from the ones in post#1):

awk '{printf "%s%s", NF==5?RS:"", $0} END {printf RS}' FS="|" file
41|216|398555|77|provided complete NP outcome data constituted
72|192|402632|580|Completed OHTS Phase 2,through March

I can't see that (and why) it should concatenate everything into one single line...

To remove the surplus spaces as requested in post#1, I added a gsub call:

awk '{gsub (/  */, " "); printf "%s%s", NF==5?RS:"", $0} END {printf RS}' FS="|" file

MadeInGermany · July 9, 2016, 5:18am

The point is to replace the condition /[|]/ that means NF>=2 by the required one, NF==5 or NF>=5

Narasimhasss · July 11, 2016, 7:22am

Sorry to all for not using the codes.I used the code tag now.Please let me know still I need to do any changes while posting in the forum.

Thanks RudiC for your command,

I used your commands and working fine but after executing it is deleting few lines in my original file.I took the deleted lines from the original file and kept it in test file and I executed the command but it is not changing any thing I am attaching the data for your reference.

I tried below commands on test file and data is not changing.

awk '{printf "%s%s", NF==8?RS:"", $0} END {printf RS}' FS="|" test_11.txt > test_12.txt

awk '{gsub (/  */, " "); printf "%s%s", NF==8?RS:"", $0} END {printf RS}' FS="|" test_11.txt > test_12.txt

Thanks,
Narasimha

RudiC · July 11, 2016, 8:25am

That's a different story - why didn't you post representative samples in the first place? Try

awk '{while (NF < 8) {getline X; $0 = $0 " " X}}1' FS="|" /tmp/test_11.txt

.

BTW, your file contains DOS line terminators (<CR>, \r, 0x0D) which you might want to get rid of before text processing in *nix.

Narasimhasss · July 11, 2016, 9:04am

Thanks RudiC,

As per your update I want work on DOS line terminators before processing/using the commands which you posted.Could you please let me know how to to work on this DOS line terminators in Unix.

Thanks,
Narasimha

bakunin · July 11, 2016, 9:14am

A word in general: we are a self-help forum! That means: we help you to help yourself, we are not doing your work for you. If you want that: hire someone.

Here is how you do it in sed (note that "^M" is a single character! You enter it i.e. in vi pressing <CTRL>-<V> and then the <ENTER> key):

sed 's/^M$//' /path/to/file

I hope this helps.

bakunin

Narasimhasss · July 13, 2016, 10:02am

Hi bakunin,

Thanks for your commands.

Instead of using Unix commands I have opened the file which is having pipe delimiter in Notepad++
and every line is having CRLF and only LF.
whereever we have single LF we are facing issue.To avoid the issue, I have replaced the individual value LF with CR then i exported the file into Unix server.After that It is working fine.

But I have one file which is having 850 MB (89,18,51,027 bytes) in size and i could not able to open in Notepad++.If we use the Unix command to do the same fix for this file then issue will resolve.

I used command which you have suggested and other commands to but not working as expected.

sed 's/^M$//' /path/to/file

Could you please let me know is there any way to do so.

Thanks,
Narasimha

Narasimhasss · July 13, 2016, 11:33am

+
Hi bakunin,

I used the command

cat -e file.txt

to find out the EOL character.

For issue related lines we have data in Notepad++(see the attachment Delimeter_Issue_notepad++1) having "LF" as the EOL and Unix we have"$" (see the attachment Delimeter_Issue_Unix1).

After replacing the "LF" with "CR" in Notepad++ we are seeing "CR" (see the attachment Delimeter_Issue_notepad++2) and in Unix we have "^M"(see the attachment Delimeter_Issue_Unix1).

I think if we replace the

"$"

with

"^M"

issue will resolve.
Could you please let me know is there any way to do this.
Or could you please let me know the best approach to fix this issue.

Thanks,
Narasimha

RudiC · July 13, 2016, 2:51pm

DOS line terminators (<CR>, \r, ^M, 0x0D) in *nix system are definitely in the wrong spot. Don't use them, and, less than ever, ADD them! DON'T use notepad to create files to be used/analysed on *nix systems.

The LF char is used in EXCEL to mark a line break within a cell. Does that file come from EXCEL?

Narasimhasss · July 14, 2016, 3:20am

Hi RudiC,

No this file is not coming from EXCEL.We have donloaded the file from website and in manual read they suggested one command to fix this kind of issues.
We tried that command but no use.

Below is the description from the Manual file which they have provided.

Several files contain records that span multiple lines. This often causes problems when importing into relational databases. Users may wish to remove such features from a file before attempting to import its contents. For example, the following awk command can be used (on Linux or MacOS platforms) to address some of these situations.

awk 'BEGIN {FS="|";} {if ((length($2)==11) && index($2,"NCT") !=0) printf "\n%s",$0; else printf "%s",$0;}' arm_groups.txt | sed -e �s/[[:space:]]\+/ /g� > arm_groups.out

This command looks at each line in the arm_groups.txt file and determines if the 2nd field is the NCT_ID (length is 11 and first 3 chars are �NCT') which suggests it represents an actual record (as opposed to �carry-over text'). If so, it prints the record on a new line. Subsequent lines that do not have an NCT_ID in the second field are assumed to be carry-over text and are appended to this record. The �sed' clause near the end of the command

(sed -e "s/[[:space:]]\+/ /g")

simply compresses contiguous spaces into a single space.

Thanks,
Narasimha

RudiC · July 14, 2016, 6:12am

There's no NCT_ID in either of your samples. Why do you send us mess around with incorrect sample data and irrelevant approaches when there's a proven solution that might fail in your special case?

bakunin · July 14, 2016, 10:12am

actually: no. The "$" is just signifying the line end.

The problem you are obviously encountering is the old DOS<->UNIX problem:

in DOS lines are separated by the <CR><LF>-character sequence. That is, if you see a file (in DOS/Windows) like:

AB
CD

This file has in fact 6 bytes: "<A><B><CR><LF><C><D>". CR (Carriage Return) and LF (Line Feed) were originally printer-steering characters and this way DOS did circumvent the necessity to implement a printing program which (in professional OSes) entered these control sequences. Instead in DOS "printing" meant just dumping the file at it was to the printer device.

In Unix the situation was different and indeed it had such a printing system. Therefore it was not necessary to have a two-character sequence to separate lines and hence UNIX systems have only a single character "NL" (new line) to separate lines. Incidentally it is the same character as the "LF" in DOS, which is why you see the additional "^M" character at the end of the line. These are simply the second of the CR-LF pair. The file above in UNIX would consist also of 6 characters, but only because proper UNIX files have <EOF> (End Of File) character at their end:"<A><B><NL><C><D><EOF>".

Your problem comes most probably from transferring files back and forth between DOS- (or Windows-) systems and UNIX-systems without properly translating between them. ftp , for instance, has two modes: A(scii) and BI(nary): binary means no such translation takes place. ASCII means the ftp client becomes aware on which system it runs and to which system it transfers files and translates these line endings to what is proper on the target system. Alas, some email clients base their automatic detection on file-names (like "*.txt", etc.) and many users (you, obviously, included) don't know how and/or when to set the correct mode. This is why these ill-formed files happen.

You can either remove the superfluous line-ending characters in UNIX via the givem sed -script (you have to do that PRIOR to all the other scripts) or you can use the dos2unix and unix2dos utilities (which do the same, just in a "prepackaged" way) or you can use (on some systems) the recode -command, which also does the same.

I hope this helps.

bakunin

RudiC · July 14, 2016, 11:34am

Isn't the <EOF> char actually just another <LF>?