I am new to Unix and I have one challenge and below are the details.
I have pipe delimited text file in that data has span into multiple lines instead of single line.
Sample data.
Data should be like below for entire file.
41|216|398555|77|provided complete NP outcome data constituted
But data in few lines like below.
72|192|402632|580|Completed OHTS Phase 2
, through
March
I need to keep the data in single line like above and remove the spaces.
I have fixed length number in 3rd column of the data file.
Could you please help me on this to fix this issue.
I have used this command for files and all most all files it works except the files which are ending with | "pipe line".attached screenshot for your reference.
Could you please help me on this.
Sorry to all for not using the codes.I used the code tag now.Please let me know still I need to do any changes while posting in the forum.
Thanks RudiC for your command,
I used your commands and working fine but after executing it is deleting few lines in my original file.I took the deleted lines from the original file and kept it in test file and I executed the command but it is not changing any thing I am attaching the data for your reference.
I tried below commands on test file and data is not changing.
As per your update I want work on DOS line terminators before processing/using the commands which you posted.Could you please let me know how to to work on this DOS line terminators in Unix.
A word in general: we are a self-help forum! That means: we help you to help yourself, we are not doing your work for you. If you want that: hire someone.
Here is how you do it in sed (note that "^M" is a single character! You enter it i.e. in vi pressing <CTRL>-<V> and then the <ENTER> key):
Instead of using Unix commands I have opened the file which is having pipe delimiter in Notepad++
and every line is having CRLF and only LF.
whereever we have single LF we are facing issue.To avoid the issue, I have replaced the individual value LF with CR then i exported the file into Unix server.After that It is working fine.
But I have one file which is having 850 MB (89,18,51,027 bytes) in size and i could not able to open in Notepad++.If we use the Unix command to do the same fix for this file then issue will resolve.
I used command which you have suggested and other commands to but not working as expected.
sed 's/^M$//' /path/to/file
Could you please let me know is there any way to do so.
For issue related lines we have data in Notepad++(see the attachment Delimeter_Issue_notepad++1) having "LF" as the EOL and Unix we have"$" (see the attachment Delimeter_Issue_Unix1).
After replacing the "LF" with "CR" in Notepad++ we are seeing "CR" (see the attachment Delimeter_Issue_notepad++2) and in Unix we have "^M"(see the attachment Delimeter_Issue_Unix1).
I think if we replace the
"$"
with
"^M"
issue will resolve.
Could you please let me know is there any way to do this.
Or could you please let me know the best approach to fix this issue.
DOS line terminators (<CR>, \r, ^M, 0x0D) in *nix system are definitely in the wrong spot. Don't use them, and, less than ever, ADD them! DON'T use notepad to create files to be used/analysed on *nix systems.
The LF char is used in EXCEL to mark a line break within a cell. Does that file come from EXCEL?
No this file is not coming from EXCEL.We have donloaded the file from website and in manual read they suggested one command to fix this kind of issues.
We tried that command but no use.
Below is the description from the Manual file which they have provided.
Several files contain records that span multiple lines. This often causes problems when importing into relational databases. Users may wish to remove such features from a file before attempting to import its contents. For example, the following awk command can be used (on Linux or MacOS platforms) to address some of these situations.
This command looks at each line in the arm_groups.txt file and determines if the 2nd field is the NCT_ID (length is 11 and first 3 chars are �NCT') which suggests it represents an actual record (as opposed to �carry-over text'). If so, it prints the record on a new line. Subsequent lines that do not have an NCT_ID in the second field are assumed to be carry-over text and are appended to this record. The �sed' clause near the end of the command
(sed -e "s/[[:space:]]\+/ /g")
simply compresses contiguous spaces into a single space.
There's no NCT_ID in either of your samples. Why do you send us mess around with incorrect sample data and irrelevant approaches when there's a proven solution that might fail in your special case?
actually: no. The "$" is just signifying the line end.
The problem you are obviously encountering is the old DOS<->UNIX problem:
in DOS lines are separated by the <CR><LF>-character sequence. That is, if you see a file (in DOS/Windows) like:
AB
CD
This file has in fact 6 bytes: "<A><B><CR><LF><C><D>". CR (Carriage Return) and LF (Line Feed) were originally printer-steering characters and this way DOS did circumvent the necessity to implement a printing program which (in professional OSes) entered these control sequences. Instead in DOS "printing" meant just dumping the file at it was to the printer device.
In Unix the situation was different and indeed it had such a printing system. Therefore it was not necessary to have a two-character sequence to separate lines and hence UNIX systems have only a single character "NL" (new line) to separate lines. Incidentally it is the same character as the "LF" in DOS, which is why you see the additional "^M" character at the end of the line. These are simply the second of the CR-LF pair. The file above in UNIX would consist also of 6 characters, but only because proper UNIX files have <EOF> (End Of File) character at their end:"<A><B><NL><C><D><EOF>".
Your problem comes most probably from transferring files back and forth between DOS- (or Windows-) systems and UNIX-systems without properly translating between them. ftp , for instance, has two modes: A(scii) and BI(nary): binary means no such translation takes place. ASCII means the ftp client becomes aware on which system it runs and to which system it transfers files and translates these line endings to what is proper on the target system. Alas, some email clients base their automatic detection on file-names (like "*.txt", etc.) and many users (you, obviously, included) don't know how and/or when to set the correct mode. This is why these ill-formed files happen.
You can either remove the superfluous line-ending characters in UNIX via the givem sed -script (you have to do that PRIOR to all the other scripts) or you can use the dos2unix and unix2dos utilities (which do the same, just in a "prepackaged" way) or you can use (on some systems) the recode -command, which also does the same.