CSV from Text file

jiam912 · May 18, 2018, 4:47pm

Gents,

Attached there is a input.txt and code.txt file.

I use that code to create a csv file from txt file.

Is there the possibility to generate other code more faster to generate the CVS file using the input file.

I have deleted many data in the input file to avoid load a lot data.

Appreciate your help.

Don_Cragun · May 18, 2018, 8:23pm

For those of us who don't have gawk (or any awk with a switch statement) it would help if you would show us the output you hope to produce from the sample input you provided ( file.txt ).

jiam912 · May 19, 2018, 6:48am

Hi Don.

I have attached the file produced with my code. ( it is a csv file )

Don_Cragun · May 19, 2018, 9:22am

I don't understand what your code was trying to do by setting elements of your C[] array to -9999. Although not tested with the sample input you provided, I assume that you were setting elements of your V[] array to -9999 to give each field a default value if there is no line in an input record for a given field.

The following code gathers field headers from the 1st input record and gathers data from subsequent records ignoring headers (field 1 values) that were not found in the 1st record. If fields are missing from other input records the default value of -9999 will be printed as the value for those fields. The order of fields in the first input record determines the order of fields in all of the output records.

If this is what you were trying to do, the following code is considerably shorter than the code you were using and produces the same output that your code produced when given the DOS text format input file you uploaded as file.txt :

awk -F': ' '
BEGIN {	OFS = ","
}
/^___/ {if(!r++)
		for(i = 1; i <= n; i++)
			printf("%s%s", h, (i == n) ? ORS : OFS)
	for(i = 1; i <= n; i++) {
		printf("%s%s", v, (i == n) ? ORS : OFS)
		v = -9999
	}
}
r == 0 {h[++n] = $1
	h2f[$1] = n
}
{	sub(/\r/, "")
	v[h2f[$1]] = $2
}' "${1:-file.txt}" > "${2:-file}.csv"

Note that if you would like to get rid of the trailing space in each field in the header output line, you can do that by changing the line:

awk -F': ' '

in the script above to:

awk -F' : ' '

Although I don't have gawk , the above code should work with gawk as well as with awk . If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

jiam912 · May 19, 2018, 10:49am

Don,

It is amazing and much efficient the code i did. Many thanks to improve it. Appreciate your support.

I have other small code that use same input file but select only few columns it is like this;

awk '/File/{if(s){print s;s=$NF+0}else{s=$NF+0}}
    /Swath Name/{s=s","$NF}
    /Tape #/{s=s","$NF}
    /Tape Label /{s=s","$NF}
    /Line Name/{s=s","$NF}
    /Point Number/{s=s","$NF}
    /Live Seis/{s=s","substr($0,12)}
    /SCI TB Timestamp GPS Time/{s=s","$NF}
    END{print s}' file.txt > 2.csv

Is there any change you check it. For sure you will have a better and efficient idea with this.

here is the output ( 8 selected columns by header )

5000,1000,2240,X5,40559,38741,40560:38466-39172/2(1-354) 40572:38466-39172/2(355-708) 40584:38466-39172/2(709-1062) 40596:38466-39172/2(1063-1416) 40608:38466-39172/2(1417-1770) 40620:38466-39172/2(1771-2124) 40632:38466-39172/2(2125-2478) 40644:38466-39172/2(2479-2832) 40656:38466-39172/2(2833-3186) 40668:38466-39172/2(3187-3540) 40680:38466-39172/2(3541-3894) 40692:38466-39172/2(3895-4248) 40704:38466-39172/2(4249-4602) 40716:38466-39172/2(4603-4956) 40728:38466-39172/2(4957-5310) 40740:38466-39172/2(5311-5664) 40752:38466-39172/2(5665-6018) 40764:38466-39172/2(6019-6372) 40776:38466-39172/2(6373-6726) 40788:38466-39172/2(6727-7080) 40800:38466-39172/2(7081-7434) 40812:38466-39172/2(7435-7788) 40824:38466-39172/2(7789-8142) 40836:38466-39172/2(8143-8496) 40848:38466-39172/2(8497-8850) 40860:38466-39172/2(8851-9204) 40872:38466-39172/2(9205-9558) 40884:38466-39172/2(9559-9912) 40896:38466-39172/2(9913-10266) 40908:38466-39172/2(10267-10620) 40920:38466-39172/2(10621-10974) 40932:38466-39172/2(10975-11328) 40944:38466-39172/2(11329-11682) 40956:38466-39172/2(11683-12036) 40968:38466-39172/2(12037-12390) 40980:38466-39172/2(12391-12744),1209452095780002
5001,1000,2240,X5,40549,39053,40560:38622-39484/2(1-432) 40572:38622-39484/2(433-864) 40584:38622-39484/2(865-1296) 40596:38622-39484/2(1297-1728) 40608:38622-39484/2(1729-2160) 40620:38622-39484/2(2161-2592) 40632:38622-39484/2(2593-3024) 40644:38622-39484/2(3025-3456) 40656:38622-39484/2(3457-3888) 40668:38622-39484/2(3889-4320) 40680:38622-39484/2(4321-4752) 40692:38622-39484/2(4753-5184) 40704:38622-39484/2(5185-5616) 40716:38622-39484/2(5617-6048) 40728:38622-39484/2(6049-6480) 40740:38622-39484/2(6481-6912) 40752:38622-39484/2(6913-7344) 40764:38622-39484/2(7345-7776) 40776:38622-39484/2(7777-8208) 40788:38622-39484/2(8209-8640) 40800:38622-39484/2(8641-9072) 40812:38622-39484/2(9073-9504) 40824:38622-39484/2(9505-9936) 40836:38622-39484/2(9937-10368) 40848:38622-39484/2(10369-10800) 40860:38622-39484/2(10801-11232) 40872:38622-39484/2(11233-11664) 40884:38622-39484/2(11665-12096) 40896:38622-39484/2(12097-12528) 40908:38622-39484/2(12529-12960) 40920:38622-39484/2(12961-13392) 40932:38622-39484/2(13393-13824) 40944:38622-39484/2(13825-14256) 40956:38622-39484/2(14257-14688) 40968:38622-39484/2(14689-15120) 40980:38622-39484/2(15121-15552),1209492969580002

Appreciate your help.

Don_Cragun · May 19, 2018, 10:51pm

Hi jiam912,
The code that you have provided in post #5 doesn't even come close to providing the output you showed us in that post from the sample input file named file.txt that you uploaded in post #1.

Some of the numeric fields have had trailing .0 removed, all fields have had the <carriage-return> removed from the end of the input lines, and a <space> has been removed from the output in the Live Seis field output (none of which are done by the code you showed us).

If you can't give us a clear description of the output your code is supposed to produce, I'm not going to waste any more time trying to guess at which conversions are supposed to be applied to which fields to get the output you want.

I don't have any way to know if you want the output produced by your code (which is not the output you showed us) or if you want the output you showed us (which is not the output your code produces).

jiam912 · May 20, 2018, 3:37am

Hi Don,

Yes you are right, Very sorry, I forget to mention that i have removed the <carriage-return> from the output file, i did it to get it as csv file, because when i open it on excel it is not a csv file due to the <carriage-return>. Kindly check file attached named file2.zip.

it contends the 2 files

file2.csv: output after script with <carriage-return>
file3.csv: <carriage-return> removed

code

awk '/File/{if(s){print s;s=$NF+0}else{s=$NF+0}}
    /Swath Name/{s=s","$NF}
    /Tape #/{s=s","$NF}
    /Tape Label /{s=s","$NF}
    /Line Name/{s=s","$NF}
    /Point Number/{s=s","$NF}
    /Live Seis/{s=s","substr($0,12)}
    /SCI TB Timestamp GPS Time/{s=s","$NF}
    END{print s}' file.txt > file2.csv

tr -d '\r' < file2.csv > file3.csv

As you will notice, mi code greps also the line

Nb Of Live Seis : 00000

, which is not needed.

Many thanks.

Don_Cragun · May 20, 2018, 8:10am

The changes you've listed in post #7 still don't address several discrepancies with the output you said you want in post #5. The following comes close to what you said you wanted in post #5 when working with your original file.txt file:

awk -F' : ' '
BEGIN {	split("File #,Swath Name,Tape #,Tape Label,Line Name,Point Number," \
	    "Live Seis,SCI TB Timestamp GPS Time", hdr, OFS = ",")
	for(i in hdr)
		f[hdr] = i
}
/^___/ {print s
}
$1 in f {
	sub(/ *\r*$/, "", $2)
	if(!($2 ~ /[^-.[:digit:]]/))
		$2 += 0
	s = ((f[$1] == 1) ? "" : s OFS) $2
}' "${1:-file.txt}" > 2.csv

This does not include any output for the input lines with field 1 containing Nb Of Live Seis . For the sample data provided in post #1, it produces the output:

5000,1000,2240,X5,40559,38741,40560:38466-39172/2(1-354) 40572:38466-39172/2(355-708) 40584:38466-39172/2(709-1062) 40596:38466-39172/2(1063-1416) 40608:38466-39172/2(1417-1770) 40620:38466-39172/2(1771-2124) 40632:38466-39172/2(2125-2478) 40644:38466-39172/2(2479-2832) 40656:38466-39172/2(2833-3186) 40668:38466-39172/2(3187-3540) 40680:38466-39172/2(3541-3894) 40692:38466-39172/2(3895-4248) 40704:38466-39172/2(4249-4602) 40716:38466-39172/2(4603-4956) 40728:38466-39172/2(4957-5310) 40740:38466-39172/2(5311-5664) 40752:38466-39172/2(5665-6018) 40764:38466-39172/2(6019-6372) 40776:38466-39172/2(6373-6726) 40788:38466-39172/2(6727-7080) 40800:38466-39172/2(7081-7434) 40812:38466-39172/2(7435-7788) 40824:38466-39172/2(7789-8142) 40836:38466-39172/2(8143-8496) 40848:38466-39172/2(8497-8850) 40860:38466-39172/2(8851-9204) 40872:38466-39172/2(9205-9558) 40884:38466-39172/2(9559-9912) 40896:38466-39172/2(9913-10266) 40908:38466-39172/2(10267-10620) 40920:38466-39172/2(10621-10974) 40932:38466-39172/2(10975-11328) 40944:38466-39172/2(11329-11682) 40956:38466-39172/2(11683-12036) 40968:38466-39172/2(12037-12390) 40980:38466-39172/2(12391-12744),1209452095780002
5001,1000,2240,X5,40549,39053,40560:38622-39484/2(1-432) 40572:38622-39484/2(433-864) 40584:38622-39484/2(865-1296) 40596:38622-39484/2(1297-1728) 40608:38622-39484/2(1729-2160) 40620:38622-39484/2(2161-2592) 40632:38622-39484/2(2593-3024) 40644:38622-39484/2(3025-3456) 40656:38622-39484/2(3457-3888) 40668:38622-39484/2(3889-4320) 40680:38622-39484/2(4321-4752) 40692:38622-39484/2(4753-5184) 40704:38622-39484/2(5185-5616) 40716:38622-39484/2(5617-6048) 40728:38622-39484/2(6049-6480) 40740:38622-39484/2(6481-6912) 40752:38622-39484/2(6913-7344) 40764:38622-39484/2(7345-7776) 40776:38622-39484/2(7777-8208) 40788:38622-39484/2(8209-8640) 40800:38622-39484/2(8641-9072) 40812:38622-39484/2(9073-9504) 40824:38622-39484/2(9505-9936) 40836:38622-39484/2(9937-10368) 40848:38622-39484/2(10369-10800) 40860:38622-39484/2(10801-11232) 40872:38622-39484/2(11233-11664) 40884:38622-39484/2(11665-12096) 40896:38622-39484/2(12097-12528) 40908:38622-39484/2(12529-12960) 40920:38622-39484/2(12961-13392) 40932:38622-39484/2(13393-13824) 40944:38622-39484/2(13825-14256) 40956:38622-39484/2(14257-14688) 40968:38622-39484/2(14689-15120) 40980:38622-39484/2(15121-15552),1209492969580002

in the file named 2.csv . Note that it removes the trailing <carriage-return> from all input lines, removes the trailing <space> from the Live Seis input line, and gets rid of trailing .0 on all selected entirely numeric input fields

Note that if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk . (I don't think nawk will correctly process this code.)

jiam912 · May 20, 2018, 12:56pm

Don,
Amazing, the code works perfects.. I use gawk instead of awk, to let it work in my machine.. I will read well the code to understand well all the processing.
One question please, where in the code it removes the <carriage-return>
Appreciate your help.

Don_Cragun · May 20, 2018, 2:10pm

The line of code:

	sub(/ *\r*$/, "", $2)

removes zero or more <space> characters followed by zero or more <carriage-return> characters at the end of field two. This allows the code to work correctly whether or not the data you feed to this script has trailing <carriage-return>s. In your sample data, getting rid of trailing <space>s is only needed in the Live Seis input lines, but it is easier to use the same code for every selected line in case fields selected in a later revision of this code also contain them.

jiam912 · May 21, 2018, 6:15am

Don
Many thanks for the information.