How to cut a pipe delimited file and paste it with another file to form a comma separated outputfile

etldev · October 4, 2014, 2:41am

Hello ppl

I have a requirement to split (cut in unix) a file (A.txt) which is a pipe delimited file into A1.txt and A2.txt

Now I have to join (paste in unix) this A2.txt with external file A3.txt to form
output file A4.txt which should be CSV (comma separated file) so that third party can open it using Microsoft excel.

I can split the files however I want according to my flexibility but my input A.txt is a pipe delimited file and my output should be A4.txt.

the reason my input A.txt was pipe because the fields inside might contain commas within them .

Any quick help is appreciated..

MadeInGermany · October 4, 2014, 3:52am

Replace all pipes by commas

tr '|' ',' <input >output

In case the fields have commas try

sed 's/|/","/g; s/^/"/; s/$/"/'

blastit.fr · October 4, 2014, 3:54am

please give samples files.

you wrote:

so output files can't have comma as separator in such case, unless you surrounds data with double-quote ( " ) for instance .

Corona688 · October 4, 2014, 1:34pm

That's what the sed solution does, protectively quotes all fields.

blastit.fr · October 4, 2014, 5:13pm

You can try to add this filter that converts | to ","

... some code ...|awk -F\| -vOFS='","' '{print "\"" $1,$2,$3,$4 "\""}' > A4.txt

Jean-Paul

Don_Cragun · October 4, 2014, 5:46pm

But, of course, the above script only works correctly if the input fed into it always has exactly four fields in the input (and I didn't see anything in this thread so far that places any limits on the number of fields).

The following should work (even keeping empty empty lines in the input empty in the output):

... some code ... | awk -F'|' -v OFS='","' 'NF{$1 = $1; $0 = "\"" $0 "\""}1' > A4.txt

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

etldev · October 5, 2014, 2:04am

another thing I forgot to mention is I dont know pre hand how many fields exist in the file meaning A2.txt can have 0-n number of fields ..The file A2.txt might have any number of variable fields.

Don_Cragun · October 5, 2014, 3:07am

You also have forgotten to tell us what OS you're using. The awk script I suggested will properly handle any number of fields on a line as long as the length of longest line in the output file is no longer than LINE_MAX (typically 2048) bytes. Most of the sed scripts you have been shown in this thread will properly handle files with one or more (but not zero) fields with the same line length limits. Some OSs will be OK with longer lines; others will generate a diagnostic message if a long line is encountered; and, unfortunately, a few others will silently corrupt your data.

If the line length is a problem for your data on your system, we can play games with the awk record separator (instead of field separator) to get around the issue.

So, what is the maximum line length in your input data and how many fields can you have in your output files? (Output line length will be (input line length) + 2 * (number of input fields).)

What is the value of LINE_MAX on your system (output from getconf LINE_MAX )?

What OS are you using (output from uname -a )?

etldev · October 5, 2014, 4:51am

Hey Don

Much thanks for looking into this.I am using AIX 7.2.I dont think length of line is a problem and is very well within limits.I tried the AWk you suggested and I got an error.

A.txt is my pipe delim input file

cut -d "|" -f 1,2,3,4,5 A.txt > A1.txt 
cut -d "|" -f 6- A.txt > A2.txt

A3.txt is an external pipe delim file

paste -d "|" A2.txt A3.txt>>A4.txt

A5.txt would be my comma delim output

awk -F'|' -v OFS='","' 'NF{$1 = $1; $0 = "\"" $0 "\""}1'A4.txt >> A5.txt
 syntax error The source line is 1.
 The error context is
                NF{$1 = $1; $0 = "\"" $0 >>>  "\""}1A4. <<< txt
 awk: Quitting
 The source line is 1.

---------- Post updated at 03:51 AM ---------- Previous update was at 03:42 AM ----------

here is URL for sample of how my input delim file A.txt looks like

http://s24.postimg.org/d8afdzcsl/example.jpg

RudiC · October 5, 2014, 4:56am

What if any one of the fields already is double quoted?

cat file
A|B|C|D
A|B|C|D
A|"B"|C|"D"
A|B|C|D
awk -F'|' -v OFS='","' 'NF{$1 = $1; $0 = "\"" $0 "\""}1' file
"A","B","C","D"
"A","B","C","D"
"A",""B"","C",""D""
"A","B","C","D"

Don_Cragun · October 5, 2014, 6:45am

etldev:

Hey Don

Much thanks for looking into this.I am using AIX 7.2.I dont think length of line is a problem and is very well within limits.I tried the AWk you suggested and I got an error.

A.txt is my pipe delim input file
cut -d "|" -f 1,2,3,4,5 A.txt > A1.txt 
cut -d "|" -f 6- A.txt > A2.txt
A3.txt is an external pipe delim file
paste -d "|" A2.txt A3.txt>>A4.txt
A5.txt would be my comma delim output
awk -F'|' -v OFS='","' 'NF{$1 = $1; $0 = "\"" $0 "\""}1'A4.txt >> A5.txt
 syntax error The source line is 1.
 The error context is
   NF{$1 = $1; $0 = "\"" $0 >>>  "\""}1A4. <<< txt
 awk: Quitting
 The source line is 1.
---------- Post updated at 03:51 AM ---------- Previous update was at 03:42 AM ----------

here is URL for sample of how my input delim file A.txt looks like

http://s24.postimg.org/d8afdzcsl/example.jpg

etldev,
There has to be a space between the ' terminating the awk script and the filename being given to awk as an input file:

awk -F'|' -v OFS='","' 'NF{$1 = $1; $0 = "\"" $0 "\""}1' A4.txt >> A5.txt

RudiC,
The sample input shown in the link provided in post #9 in this thread contains no quotes of any kind (and no commas, either). With the sample data provided (which did not include any empty lines either), the following would be sufficient:

awk -F'|' -v OFS='","' '{$1 = $1}1' A4.txt >> A5.txt

But I assume that other input files will contain commas. If other input contains any double quote characters, adjustments will be needed and we'll need to know if the double quotes that are present are intended to be literal characters or are quoting other characters in the input. If any of them are intended to be literal double quote characters, we'll also need to know the quoting conventions used by the application that is going to read the CSV file this script is producing.

etldev,
In your sample code you are appending to files A4.txt and A5.txt rather than replacing whatever may have been in those files before. Is that intentional, or did you mean to use > instead of >> in both of those places?

etldev · October 5, 2014, 2:51pm

neither of these above AWK codes are giving me desired output as I could still see PIPES as field delims but not commas...

however the awk code which takes care of quotes and comas is just putting commas at the beggining and at the end of the line.

Don

to your point there are no commas within the data now in the input files but there may be situations we might and in those situations the pipes and commas in data have to be selectively converted so that when the ouput is opened in sprdsheet all the data looks intact without any quotes around them (preferrably)

Don_Cragun · October 5, 2014, 3:21pm

What OS are you using?

I repeat: Why are you using >> A5.txt instead of > A5.txt ???

Please put the data you showed us in the JPEG file you posted in a file named A4.txt and run the command:

awk -F'|' -v OFS='","' 'NF{$1 = $1; $0 = "\"" $0 "\""}1' A4.txt > A5.txt

or, as long as there are no commas in A4.txt, the command:

awk -F'|' -v OFS=',' '{$1 = $1}1' A4.txt > A5.txt

and then show us the contents of the file A5.txt and any diagnostics produced by the awk command.

Posting a JPEG file doesn't really help with this problem. We need text that we can feed into awk , not pixels that are unintelligible to the UNIX and Linux system text processing utilities.

etldev · October 5, 2014, 3:41pm

I have attched the input A4.txt .

When I run the big awk commnd the one which takes care of quotes my output is A5.txt-which is attached but when I run the other awk my output is same as A4.txt

Don_Cragun · October 5, 2014, 6:05pm

1st: This is a UNIX and Linux forum. Unless explicitly told otherwise, we do not expect input files using DOS <carriagereturn><newline> line terminators; we expect UNIX and Linux <newline> line terminators. So all of the code we suggested that adds quotes ends up with a final field that just contains a quoted <carriagereturn> character.

2nd: Running the following script with the A4.txt that you provided:

#!/bin/ksh
sed 's/|/","/g; s/^/"/; s/$/"/' A4.txt > A5sed.txt
awk -F'|' -v OFS='","' 'NF{$1 = $1; $0 = "\"" $0 "\""}1' A4.txt > A5q.txt
awk -F'|' -v OFS=',' '{$1 = $1}1' A4.txt > A5nq.txt

produces the following in the three output files (shown using cat -v to maek the carriage returns visible:

$ cat -v A5sed.txt
"DEP","2","08/19/2014","SECOND TEST FILE DESCRIPTION                      ","      250000.00","      121232.87","           0.00","B64C8100                                ","08/04/2014                              ","2014-08-19-00.47.32.050493              ","^M"
"DEP","2","08/19/2014","SECOND TEST FILE DESCRIPTION                      ","      500000.00","      242465.75","           0.00","B64C8100                                ","08/04/2014                              ","2014-08-19-00.47.32.050627              ","^M"
"DEP","2","08/19/2014","SECOND TEST FILE DESCRIPTION                      ","      315285.83","      152882.45","           0.00","B64C8100                                ","01/01/0001                              ","2014-08-19-00.45.47.744917              ","^M"
"DEP","2","08/19/2014","SECOND TEST FILE DESCRIPTION                      ","      520376.42","      250916.73","           0.00","B64C8100                                ","08/04/2014                              ","2014-08-19-00.47.20.793454              ","^M"
"DEP","2","08/19/2014","SECOND TEST FILE DESCRIPTION                      ","     1000131.51","      482246.54","           0.00","B64C8100                                ","08/04/2014                              ","2014-08-19-00.47.20.793644              ","^M"
"DEP","2","08/19/2014","SECOND TEST FILE DESCRIPTION                      ","      150037.33","       72344.30","           0.00","B64C8100                                ","08/04/2014                              ","2014-08-19-00.46.47.306701              ","^M"
"DEP","2","08/19/2014","SECOND TEST FILE DESCRIPTION                      ","      646358.39","      311668.70","           0.00","B64C8100                                ","08/04/2014                              ","2014-08-19-00.47.08.815658              ","^M"
"DEP","2","08/19/2014","SECOND TEST FILE DESCRIPTION                      ","      110000.00","       53041.08","           0.00","B64C8100                                ","08/04/2014                              ","2014-08-19-00.46.50.346962              ","^M"
"DEP","2","08/19/2014","SECOND TEST FILE DESCRIPTION                      ","      158213.08","      160383.33","         750.00","B64C8100                                ","08/23/2012                              ","2014-08-19-00.45.33.451061              ","^M"
"DEP","2","08/19/2014","SECOND TEST FILE DESCRIPTION                      ","      140383.43","      132266.13","        1400.00","B64C8100                                ","09/06/2012                              ","2014-08-19-00.45.33.451359              ","^M"
$

The contents of A5q.txt are identical to the contents of A5sed.txt .

$ cat -v A5nq.txt
DEP,2,08/19/2014,SECOND TEST FILE DESCRIPTION                      ,      250000.00,      121232.87,           0.00,B64C8100                                ,08/04/2014                              ,2014-08-19-00.47.32.050493              ,^M
DEP,2,08/19/2014,SECOND TEST FILE DESCRIPTION                      ,      500000.00,      242465.75,           0.00,B64C8100                                ,08/04/2014                              ,2014-08-19-00.47.32.050627              ,^M
DEP,2,08/19/2014,SECOND TEST FILE DESCRIPTION                      ,      315285.83,      152882.45,           0.00,B64C8100                                ,01/01/0001                              ,2014-08-19-00.45.47.744917              ,^M
DEP,2,08/19/2014,SECOND TEST FILE DESCRIPTION                      ,      520376.42,      250916.73,           0.00,B64C8100                                ,08/04/2014                              ,2014-08-19-00.47.20.793454              ,^M
DEP,2,08/19/2014,SECOND TEST FILE DESCRIPTION                      ,     1000131.51,      482246.54,           0.00,B64C8100                                ,08/04/2014                              ,2014-08-19-00.47.20.793644              ,^M
DEP,2,08/19/2014,SECOND TEST FILE DESCRIPTION                      ,      150037.33,       72344.30,           0.00,B64C8100                                ,08/04/2014                              ,2014-08-19-00.46.47.306701              ,^M
DEP,2,08/19/2014,SECOND TEST FILE DESCRIPTION                      ,      646358.39,      311668.70,           0.00,B64C8100                                ,08/04/2014                              ,2014-08-19-00.47.08.815658              ,^M
DEP,2,08/19/2014,SECOND TEST FILE DESCRIPTION                      ,      110000.00,       53041.08,           0.00,B64C8100                                ,08/04/2014                              ,2014-08-19-00.46.50.346962              ,^M
DEP,2,08/19/2014,SECOND TEST FILE DESCRIPTION                      ,      158213.08,      160383.33,         750.00,B64C8100                                ,08/23/2012                              ,2014-08-19-00.45.33.451061              ,^M
DEP,2,08/19/2014,SECOND TEST FILE DESCRIPTION                      ,      140383.43,      132266.13,        1400.00,B64C8100                                ,09/06/2012                              ,2014-08-19-00.45.33.451359              ,^M
$

All of these are exactly what we would expect for the input file you provided!

To get the output you showed us, you had to use different commands than those we suggested you use. (Most likely you are using the wrong quotes in -F'|' or are using something like � instead of | .)

To correctly process your DOS files on UNIX systems, change the DOS line terminators in your input file to UNIX line terminators using:

dos2unix input output

where input is a DOS file and output is the name of the file you want to create with corrected line terminators.

Don_Cragun · October 5, 2014, 9:19pm

If none of the above suggestions work and you have exactly copied the commands suggested, try changing:

awk -F'|' ...

to:

awk -F'[|]' ...

and see if that makes any difference with the awk on AIX systems.

etldev · October 6, 2014, 2:37am

2 things Don

1)I still dint get the AWk working both for (q and nq).I also tried the

awk -F'[|]' -v OFS='","' 'NF{$1 = $1; $0 = "\"" $0 "\""}1' A4.txt > A5q.csv

.dont know if its somethng to do with AIX ..

but the SED works good wondering if it would take care of both q and nq if not can you plz help me with that sed code for q.

2)second thing is i dont know abt the DOS vs UNIX line terminators..However I dint get the necessity of using [CODE][dos2unix input output] as sed code would give me a perfect csv output which when again transferred to windows using winscp text mode opens perfectly aligned using microsoft excel.
Please tell me where I should use dos2unix.

Don_Cragun · October 6, 2014, 3:19am

[quote=etldev;302919950]
2 things Don

1)I still dint get the AWk working both for (q and nq).I also tried the

awk -F'[|]' -v OFS='","' 'NF{$1 = $1; $0 = "\"" $0 "\""}1' A4.txt > A5q.csv

.dont know if its somethng to do with AIX ..

but the SED works good wondering if it would take care of both q and nq if not can you plz help me with that sed code for q.

2)second thing is i dont know abt the DOS vs UNIX line terminators..However I dint get the necessity of using

[dos2unix input output] as sed code would give me a perfect csv output which when again transferred to windows using winscp text mode opens perfectly aligned using microsoft excel.
Please tell me where I should use dos2unix.
I know that awk (not AWk and not AWK) work OK on AIX.  Something else is going on here.

I assume that you have the three commands I suggested in a file that you executed to get the results you got.  Show us the output from the command:
od -bc file

where file is the name of the file containing those commands.

The sed command I gave you (copied from MadeInGermany's much earlier suggestion) changes all pipe symbols to "," and then adds " to the start and end of each line. You said that is what you want. I don't know what you mean by "help me with that sed code for q"???

You say that having the carriage return at the end of your input lines included between quotes in the last input field is what you want. If that is true, you don't need to worry about dos2unix . (I don't believe you, but if that is what you want, there is no reason to try to change it.)

blastit.fr · October 6, 2014, 3:55am

I'm very puzzled with Don's trick using

...{$1=$1 ; ... .

It makes specific expansions on $0 using FS and OFS variables

Of course it works very fine on my PC cygwin version, as I have the most recent version of awk
It looks rather like undocumented features witch have unpredictable effects on old version.

You can check your own version of awk using

awk -V

So far it seems the best issue is still to use the original suggestion made by MadeInGermany:

sed 's/|/","/g; s/^/"/; s/$/"/'

We can translate this to awk this way for instance, using the exact equivalent to sed 's' : gsub.
Let's add too Don's trick to preverve empty lines.

Try this :

$ awk -F'|' 'NF{ gsub(/\|/,"\",\"") ; $0 = "\"" $0 "\"" }1'

Regarding the dos2unix tool, you can easily use instead the short sed line :

$ sed 's/$/\r/'  unixfile > dosfile

but of course winscp does this conversion perfectly.

Jean-Paul

Don_Cragun · October 6, 2014, 5:35am

blastit.fr:

I'm very puzzled with Don's trick using
...{$1=$1 ; ... .
It makes specific expansions on $0 using FS and OFS variables

Of course it works very fine on my PC cygwin version, as I have the most recent version of awk
It looks rather like undocumented features witch have unpredictable effects on old version.

... ... ...

Jean-Paul

This is not some undocumented trick. From the standards:

Part of that re-evaluation includes using OFS as the field delimiter when that record is printed.

This might not work in /usr/bin/awk on Solaris systems (a 1975 vintage awk ), but will work on any 1988 or later version of awk which will be installed as awk , gawk , or mawk on most systems; and as /usr/xp4/bin/awk , /usr/xp6/bin/awk , and /usr/bin/nawk on Solaris systems.