Changing format of file with awk

owwow14 · September 11, 2014, 6:52am

Hi all,
I have a file that looks like this:

Closest words to: manifesto >>>>
[(0.99999999999999978, 'manifesto'), (0.72008211381623111, 'communiqu\xe9'), (0.6942217252661308, 'manifestos'), (0.68892580417319915, 'pamphlet'), (0.68146378689894338, 'communique'), (0.66477336566612566, 'newssheet'), (0.65802727088954649, 'workplan'), (0.65534176275799949, 'counter-proposal'), (0.65430633850582132, 'credo'), (0.65313506395462273, 'report*')]

Closest words to: passport >>>>
[(1.0000000000000004, 'passport'), (0.82035608388470505, 'passports'), (0.74795707589520077, 'photocard'), (0.7029703031026393, 'visa'), (0.66463194673185344, 'certificate'), (0.65157805812927172, 'railcard'), (0.64138220956663572, 'chequebook'), (0.64021573915462227, 'payslip'), (0.63595253934734819, 'cis5'), (0.63233458893012662, 'carnet')]

and I want to reformat this with

awk

with the following desired result:

manifesto
0.99999999999999978, 'manifesto'
0.72008211381623111, 'communiqu\xe9'
0.6942217252661308, 'manifestos'
0.68892580417319915, 'pamphlet'
0.68146378689894338, 'communique'
0.66477336566612566, 'newssheet'
0.65802727088954649, 'workplan'
0.65534176275799949, 'counter-proposal'
0.65430633850582132, 'credo'
0.65313506395462273, 'report*'

passport
1.0000000000000004, 'passport'
0.82035608388470505, 'passports'
0.74795707589520077, 'photocard'
0.7029703031026393, 'visa'
0.66463194673185344, 'certificate'
0.65157805812927172, 'railcard'
0.64138220956663572, 'chequebook'
0.64021573915462227, 'payslip'
0.63595253934734819, 'cis5'
0.63233458893012662, 'carnet'

Can someone please let me know if this will be possible?
Thank you in advance.

rbatte1 · September 11, 2014, 7:04am

Dear owwow14,

I have a few to questions pose in response first:-

What have you tried so far?
What output/errors do you get?
What OS and version are you using?
You've said awk but would you consider alternatives?
What logical process have you considered? (to help steer us to follow what you are trying to achieve)

Most importantly, What have you tried so far?

There are probably many ways to achieve most tasks, so giving us an idea of your style and thoughts will help us guide you to an answer most suitable to you so you can adjust it to suit your needs in future.

We're all here to learn and getting the relevant information will help us all.

Kind regards,
Robin

owwow14 · September 11, 2014, 7:14am

Hi, Sorry for being cryptic, I have been struggling with this for some hours now.
To answer your questions.

I am have been using python and awk. The first approach was challenging for me, as the file is already in what would be considered a dictionary format. So, I do not know how to unnestle the information within the brackets. Also, there is a header information which is a label for the bracketted information, so I could not find a solution for this.
For the second approach awk I have been trying to isolate each individual chunk of information (i.e. between blank lines). Then, delete the punctuation by replacing it with new lines to create the column format
I am using OSx 10.7.5
I am open to using a UNIX-type command or python

I hope I provided more information to my question.
Thank you again, in advance.

RavinderSingh13 · September 11, 2014, 7:17am

Hello,

Following may help for the given input.

awk '/>>>>/ {print $(NF-1)} !/>>>>/ {gsub(/\)\, \(/,"\n",$0);gsub(/\[|\]|\(|\)/,X,$0);print $0}' filename

Output will be as follows.

manifesto
0.99999999999999978, 'manifesto'
0.72008211381623111, 'communiqu\xe9'
0.6942217252661308, 'manifestos'
0.68892580417319915, 'pamphlet'
0.68146378689894338, 'communique'
0.66477336566612566, 'newssheet'
0.65802727088954649, 'workplan'
0.65534176275799949, 'counter-proposal'
0.65430633850582132, 'credo'
0.65313506395462273, 'report*'
 
passport
1.0000000000000004, 'passport'
0.82035608388470505, 'passports'
0.74795707589520077, 'photocard'
0.7029703031026393, 'visa'
0.66463194673185344, 'certificate'
0.65157805812927172, 'railcard'
0.64138220956663572, 'chequebook'
0.64021573915462227, 'payslip'
0.63595253934734819, 'cis5'
0.63233458893012662, 'carnet'

Thanks,
R. Singh

rbatte1 · September 11, 2014, 7:24am

Hello RavinderSingh13,

This looks to be an excellent answer, but can you explain how it achieves the result so we can all learn?

Thanks, in advance,
Robin

RavinderSingh13 · September 11, 2014, 7:31am

Posted by rbatte1:rbatte1

Hello Robin,

Here is the solution which I have tried for the given input by user.

awk '
/>>>>/ {print $(NF-1)}            #### Searching for string >>>> and then printing it's second last field.
!/>>>>/ {gsub(/\)\, \(/,"\n",$0); #### Now looking for text which is NOT having string >>>> and then replacing string ), ( (these 3 chars in group)with new line. ###
gsub(/\[|\]|\(|\)/,X,$0);         #### Now replaing ] [ and ( ) characters to NULL as they are not required in user's output. ###
print $0}'                        #### printing the line now

Thanks,
R. Singh