Data extraction and converting into .csv file.

abhi_123 · March 1, 2018, 9:47pm

Hi All,

I have a data file and need to extract and convert it into csv format:
1) Read and extract the line containing string ending with "----" (file sample_linebyline.txt file) and to make a .csv file from this.

2) To read the flat file flatfile_sample.txt which consists of similar data ( but in one single line) and extract the line containing string ending with "----" (flatfile_sample file) and also make another csv file from this.

Can you please help me out on this.

Thanks in advance.

RudiC · March 2, 2018, 1:03am

No attempts / ideas / thoughts from your side?

try

awk '/----\r$/ && $1=$1' OFS=, /tmp/sample_linebyline.txt 
840-1,1,ABCD,0010211-00,0012345678/012345678912,123456789012,2745.25-,----
840-1,1,ABCD,0010211-00,0012345678/012345678912,123456789012,2745.25,----

taking into account that your file has non-*nix but DOS line teminators.

EDIT: And for your other file, which has several trailing spaces in the lines, try

awk '/---- *\r$/ && $1=$1' OFS=, /tmp/flatfile_sample.txt 
840-1,1,ABCD,0010211-00,0012345678/012345678912,123456789012,2745.25-,----,
840-1,1,ABCD,0010211-00,0012345678/012345678912,123456789012,2745.25,----,

Aia · March 2, 2018, 1:29am

Save as convert.py
Run as python3 convert.py

with open('sample_linebyline.txt') as rf, open('sample_linebyline.csv', 'w') as wf:
    for line in rf:
        if line.endswith('----\n'):
            fields = line.split()
            print(",".join(fields), file=wf)

with open('flatfile_sample.txt') as rf, open('flatfile_sample.csv', 'w') as wf:
    for line in rf:
        if '----' in line:
            fields = line.split()
            print(",".join(fields), file=wf)

Output:

$ cat sample_linebyline.csv
840-1,1,ABCD,0010211-00,0012345678/012345678912,123456789012,2745.25-,----
840-1,1,ABCD,0010211-00,0012345678/012345678912,123456789012,2745.25,----

$ cat flatfile_sample.csv
840-1,1,ABCD,0010211-00,0012345678/012345678912,123456789012,2745.25-,----
840-1,1,ABCD,0010211-00,0012345678/012345678912,123456789012,2745.25,----

abhi_123 · March 2, 2018, 5:55am

Thank you RudiC and Aia for your help(actually Python is not installed in our system otherwise would have tried your option).
I have tried the below awk command from my end and worked for datafile containing line by line record and it works.

awk 'BEGIN {OFS=","} {print $1,$2,$3,$4,$5,$6,$7,$8}' D:/tmp/file > tmp.txt
awk '/----/' D:/tmp/tmp.txt >tmp1.txt

But I am having issue reading the flat file(a single line file) which throws error as:
awk: line 0 (NR=0): line too long: limit 20000

Could you please help/suggest me on this.
Thanks

RudiC · March 2, 2018, 6:13am

For the sample attached in post#1, the file command returns

file /tmp/flatfile_sample.txt 
/tmp/flatfile_sample.txt: ASCII text, with very long lines, with CRLF line terminators

, and line length ist between 1155 and 1159.

How does this differ from your real data file? What's your OS and tools' versions? Why that tedious awk conversion from D:/tmp/file to an intermediate file?

abhi_123 · March 2, 2018, 7:48am

Hi Rudi,
All the data in the flatfile_sample.txt is in one single line ie:it has ten thousands of records in one single line instead of a normal file which will have multiple record lines.
issue over here is reading this one line from the file. We are using mks tool in windows for shell scripting.
i will be using your suggestion which you have used earlier post, i just mentioned which i tried out.

Thanks

RudiC · March 2, 2018, 8:30am

How then do you tell one record from the other? Are they fixed length? Or another, non-LF line terminator?

abhi_123 · March 4, 2018, 3:55am

Hi Rudi,

Each record line length is of 2025characters upon which next record set starts.

Thanks

RudiC · March 4, 2018, 5:01am

How about

fold -w1158 /tmp/flatfile_sample.txt | awk '/---- *\r*$/ && $1=$1' OFS=,
840-1,1,ABCD,0010211-00,0012345678/012345678912,123456789012,2745.25-,----
840-1,1,ABCD,0010211-00,0012345678/012345678912,123456789012,2745.25,----

Adapt the record length ( -w option) to your input file.

abhi_123 · March 4, 2018, 8:58am

Thanks a lot for your help RudiC. :)