Convert unstructured file to unstructured comma separated file in Unix

I have another 2 scenarios of unstructured files, could you please solution for it as well.
Scenario 1:

C1 Report Type :  ABC
C2 Created By : DEF
C3 Report Count : 1
C4 Total Reports : 5
D       A      B     C
D2    : 1     :1A    : A*
D2    :4       :1Z   :  A*
D2    :3       :M     : 1

E Sub-Hdr   Report_cnt
E2  :1         :1

Required Output (3 rows):

ABC,DEF,1,5,1,1A,A*,1,1
ABC,DEF,1,5,4,1Z,A*,1,1
ABC,DEF,1,5,3,M,1,1,1

Scenario 2:

C1 Report Type :  ABC
C2 Created By : DEF
C3 Report Count : 1
C4 Total Reports : 5
D       A      B     C

E Sub-Hdr   Report_cnt
E2  :1         :1

Required Output (1 rows):

ABC,DEF,1,5,,,,1,1 

Thanks in advance.

Dear @Abidaliattar,

We are a community of people who will help you solve your own problems which you are working on.

We are not a community where you post your problem and ask for a solution to be provided to you where you do not show your own work nor provide any effort into the solution.

Please post your work and all you have done, in a step-by-step manner, to create your own solution.

Thanks!

Hi Neo,
Thanks for writing.
I have created a.awk file as Extract_Data.awk, and calling the file using awk command as below,

awk -f Extract_Data.awk unstructured_file.dat > Structured_File

#File Format:#

BEGIN {
  FS=" *: *"            # Separate fields according by zero or more spaces, a colon and zero or more spaces
  OFS=","               # The output field separator is a comma
}

/^C1 / {                # When C1 is encountered at the start of the line
  start=$NF             # The start string becomes the last value on the line
}

/^C2 | ^C3 |^C4  / {           # When C2 or C4 is encountered
  start=start OFS $NF   # The start string is appended with the last value
}

/^D2 | ^E2 / {                # For every E2 and D2
  $1=start              # The first field (D2) gets changed to the start string, and the fields are automatically separated by OFS.
  print                 # And we print the modified line
}

With this solution I am facing issues in each scenario.

Issue in scenario 1: E2 values are coming before D2.

Issue in Scenario 2: As I do not have D2 values (column A, B, and C) I am expecting comma for each column, but I am not getting commas.

Thanks again.

I modified your unstructured_file.dat file to remove extra spaces between : 1 and : A* for example. Here's the modified file:

C1 Report Type :  ABC
C2 Created By : DEF
C3 Report Count : 1
C4 Total Reports : 5
D       A      B     C
D2    :1     :1A    :A*
D2    :4       :1Z   :A*
D2    :3       :M     :1

E Sub-Hdr   Report_cnt
E2  :1         :1

If this is not a safe assumption, let us know...
You can change FS="[ :]+" to your FS=" *: *" - it produces the same desired result.

my script is a bit verbose to ease the understanding:

BEGIN {
  FS="[ :]+"
  OFS=","
  strDE=",,"
}
/Report Type/ {rptT=$NF}
/Created By/ {crtB=$NF}
/Report Count/ {rptC=$NF}
/Total Reports/ {rptsT=$NF}
/^D[0-9]+/ {
   strD[++n]
   for(i=2;i<=NF; i++)
     strD[n]=(strD[n])?strD[n] OFS $i:$i
}
/^E[0-9]+/ {
   strE=$3 OFS $NF
   if (n)
     for(i=1; i<=n;i++)
       printf("%s%s%s%s%s%s%s%s%s%s%s\n", rptT, OFS, crtB, OFS, rptC, OFS, rptsT, OFS, strD[i], OFS, strE)
   else
       printf("%s%s%s%s%s%s%s%s%s%s%s\n", rptT, OFS, crtB, OFS, rptC, OFS, rptsT, OFS, strDE, OFS, strE)
}

yields:

ABC,DEF,1,5,1,1A,A*,1,1
ABC,DEF,1,5,4,1Z,A*,1,1
ABC,DEF,1,5,3,M,1,1,1
1 Like

And here is a modified sample of the original you could try:

BEGIN {
  FS=" *: *"            # Separate fields according by zero or more spaces, a colon and zero or more spaces
  OFS=","               # The output field separator is a comma
}

/^C1 / {                # When C1 is encountered at the start of the line
  start=$NF             # The start string becomes the last value on the line
  D2_seen=0             # Reset D2 seen boolean
  output=""             # Clear output
}

/^C2 |^C3 |^C4 / {      # When C2, C3 or C4 is encountered
  start=start OFS $NF   # The start string is appended with the last value
}

/^D2 / {                # For every E2 or C6
  $1=start              # The first field (D2 or C6) gets changed to the start string, and the fields are automatically separated by OFS.
  output=output $0 ORS  # Append to current output
  D2_seen=1
}

/^E2/ {
  tail=OFS $2 OFS $3 ORS          # Create the tail
  if (D2_seen)                    # If D2 encounter
    gsub(ORS, tail, output)       # Prepend all newlines in output with tail
  else
    output=start ",,," tail       # Create output if D2 not encountered
  printf "%s", output             # And we print the modified line
}

So now you have two approaches to study :slight_smile:

2 Likes

Hi Scrutinizer,
Thanks for writing again,
I tried to the solution you have provided, not sure if I missed something.
I am not getting expected results. Below are the expected and actual results.
Scenario 1 Output:
Required Output (3 rows):

ABC,DEF,1,5,1,1A,A*,1,1
ABC,DEF,1,5,4,1Z,A*,1,1
ABC,DEF,1,5,3,M,1,1,1

Output of above code:

ABC DEF 1 5 :1 :1A :A* :1 :1
ABC DEF 1 5 :4 :1Z :A* :1 :1
ABC DEF 1 5 :3 :M :1 :1 :1

Scenario 2 Output:

Required Output (1 rows):

ABC,DEF,1,5,,,,1,1

Output of above code:

ABC DEF 1 5,,,,1 :1

Hi @Abidaliattar,

That is not the output I am getting with your input samples..
Maybe you left out the BEGIN section?

Otherwise what is your OS and version?

2 Likes

Hi Scrutinizer,
Yes, it was my bad, I had missed BEGIN section, now I am getting the expected results.
I had 3 files, for one file your very first solution is working, and for other 2 files (with E2 values)
are working with your 2nd solution. You are great . Thanks for your time and wonderful solutions.

Hi vgersh99,
Thanks for your time and solution. I am new to this community, and saw so wonderful people who comes for help with perfect solutions in no time. Thanks again.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.