Text file to CSV with field data separated by blank lines

vestport · March 18, 2012, 12:49am

Hello,

I have some data in a text file where fields are separated by blank lines. There are only 6 fields however some fields have several lines of data as I will explain. Also data in a particular field is not consistently the same size but does end on a blank line. The first field start with the number 12 so I can use that as a start/stop point I guess but it has been a while and I forgot much of my shell scripting. Here is an example and what I need to do:

12 ABC5

01/01/2012 NameOfCity

Street Address

Person 1
Person 1 continued
Person 1 extra data

Person 2

12 ABC6

01/01/2012 NameOfCity

Street Address
Street Address continued
Street Address extra

Person 1

Person 2
Person 2 extra data

12 ABC7

01/01/2012 NameOfCity

.........and so on

What I want it to look like delimited for CSV use is:

12 ABC5;01/01/2012 NameOfCity;Street Address;Person 1,Person1 continued, Person1 extra data;Person 2
12 ABC6;01/01/2010 NameOfCity;StreetAddress;Person 1;Person 2, Person2 extra data
12 ABC7;....and so on........to the end of file.

In a perfect world I would like to delimit the date from city name like so:

12 ABC5;01/01/2012;NameOfCity;Street Address;Person 1,Person1 continued, Person1 extra data;Person 2
12 ABC6;01/01/2010;NameOfCity;StreetAddress;Person 1;Person 2, Person2 extra data
12 ABC7;....and so on........to the end of file.

So you see the problem is that Address, Person 1 and Person 2 can just be one line or may have many lines (up until a blank line) but that is not the same for every field every time. Just need to get this stuff all on single lines so I can import into excel or MYSQL etc.

Preferably I would like to do this with sed and loop in common bash script as my current script that got me this far is in bash. I'll take anything at this point though even if I have to call another program.

Thanks in advance for any replies!

Art

Franklin52 · March 18, 2012, 10:16am

Try this:

awk '/^12 /{
  if(s){print s}
  s=$0
  next
}
NF{
  if($0 ~ /..\/..\//){sub(" ",";")}
  s=s";"$0
}
END{print s}
' file

vestport · March 18, 2012, 12:24pm

Franklin52,

Thanks so much for that. What you sent is nearly perfect. The problem is that some fields like "person1" or "person2" (as well as others) could have many more than 2 or 3 lines. What you sent is awesome but some fields had a ";" in them where there should be a comma (fields with more than 2 or 3 lines initially). I am just not familiar with awk and how to adjust this.

Thanks so much for what you sent. I will try to hack at it unless you follow up.

Art

---------- Post updated at 12:24 PM ---------- Previous update was at 12:16 PM ----------

Also noticed that far down in the file I am processing that the ";" delimiter is dropped after the date. Not sure if this is due to extra lines per field in some cases.

Art

Scrutinizer · March 18, 2012, 12:35pm

Try:

awk 'END{print "\n"} $1~"/../"{{sub(" ",";")} NR>1 && NF{$1=$1;print /^12/?"\n":";"}NF' FS="\n" OFS=, RS= ORS= infile

slightly more readable form:

awk '
END{
  print "\n"
}
$1~"/../"{{
  sub(" ",";")
}
NR>1 && NF{
  $1=$1
  print /^12/?"\n":";"
}
NF
' FS="\n" OFS=, RS= ORS= infile

vestport · March 18, 2012, 12:51pm

Scrutinizer,

Thanks so much, that is it! Exactly what I am trying to to. Now I should be able to import this in mysql or excel as CSV data.

Also many thanks to Franklin52 for what you sent!

You made my day!

Art

ops_team · March 20, 2012, 5:43am

Hi All,

I have csv file which is comma seperated having data like :

CSV FILE:

121,11852,2011 Q4,0201, ,TRADITIONAL,Internal Fraud,6/14/2010,"100,000,000","Unauthorized Trading (e.g. Violating guidelines, Hiding losses, Securities Valuation)",1,

121,11853,2011 Q4,0201, ,TRADITIONAL,Internal Fraud,6/14/2010,"100,000,000",Fraud and Collusion,2,

121,11854,2011 Q4,0201, ,TRADITIONAL,Internal Fraud,6/14/2010,"100,000,000",Embezzlement,3,

In this file for few columns actual data contains comma e.g. column no 9,10 in first row. ("100,000,000","Unauthorized Trading (e.g. Violating guidelines, Hiding losses, Securities Valuation)").

For columns where actual data contains comma those columns are prefixed and suffixed by double quote in csv (" ").

I want to replace commas which comes in between double quotes with semicolon. (In short where actual column data contains comma i want it to replace with semicolon) and also those double quotes should get deleted afterwards.

Output should be like this where comma should be only file delimiter:

121,11852,2011 Q4,0201, ,TRADITIONAL,Internal Fraud,6/14/2010,100;000;000,Unauthorized Trading (e.g. Violating guidelines; Hiding losses; Securities Valuation),1,

121,11853,2011 Q4,0201, ,TRADITIONAL,Internal Fraud,6/14/2010,100;000;000,Fraud and Collusion,2,

121,11854,2011 Q4,0201, ,TRADITIONAL,Internal Fraud,6/14/2010,100;000;000",Embezzlement,3,

Please reply

Scrutinizer · March 20, 2012, 5:56am

Hi ops_team, please start a new thread for this. Don't forget to use code tags.