Hello folks
I have a txt file of information about journal articles from different fields. I need to convert this information into a format that is easier for computers to manipulate for some research that I'm doing on how articles are cited. The file has some header information and then details of records. For example,
Tue Jun 19 14:07:34 EDT 2012
CSA
Database: EconLit
Record 1 of 500
DN: Database Name
EconLit
TI: Title
Statistical Modeling of Monetary Policy and Its Effects
AU: Author
Sims, Christopher A
SO: Source
American Economic Review, vol. 102, no. 4, June 2012, pp. 1187-1205
DE: Descriptors
History of Economic Thought: Macroeconomics (B220); Economic
Methodology (B410); Methodological Issues: General (C180); Business
Fluctuations, Cycles (E320); Prices, Business Fluctuations, and
Cycles: Forecasting and Simulation: Models and Applications (E370);
Monetary Policy (E520); Modeling; Monetary; Monetary Policy; Policy
PY: Publication Year
2012
Record 2 of 500
DN: Database Name
EconLit
TI: Title
Targeting the Poor: Evidence from a Field Experiment in Indonesia
AU: Author
Alatas, Vivi; Banerjee, Abhijit; Hanna, Rema; Olken, Benjamin A;
Tobias, Julia
SO: Source
American Economic Review, vol. 102, no. 4, June 2012, pp. 1206-40
DE: Descriptors
Field Experiments (C930); Measurement and Analysis of Poverty (I320);
Welfare and Poverty: Government Programs, Provision and Effects of
Welfare Programs (I380); Microeconomic Analyses of Economic
Development (O120); Economic Development: Human Resources, Human
Development, Income Distribution, Migration (O150); Economic
Development: Urban, Rural, Regional, and Transportation Analysis,
Housing, Infrastructure (O180); Urban, Rural, Regional, and
Transportation Economics: Regional Migration, Regional Labor Markets,
Population, Neighborhood Characteristics (R230); Indonesia; Asia;
Experiment; Experiments; Field Experiment; Poor; Poverty; Village
PY: Publication Year
2012
.
.
.
My goal is to convert this information into CSV format like so:
"TITLE","AUTHOR(S)","SOURCE","DESCRIPTOR CODES ONLY",PUBLICATION
So the above should turn into
"Statistical Modeling of Monetary Policy and Its Effects","Sims, Christopher A","American Economic Review, vol. 102, no. 4, June 2012, pp. 1187-1205","B220,B410,C180,E320,E370,E520",2012
"Targeting the Poor: Evidence from a Field Experiment in Indonesia","Alatas, Vivi; Banerjee, Abhijit; Hanna, Rema; Olken, Benjamin A; Tobias, Julia","American Economic Review, vol. 102, no. 4, June 2012, pp. 1206-40","C930,I320,I380,O120,O150,O180,R230",2012
Note that there are some descriptors that do not have codes, (eg. `Modeling' at the end of the first record). The code needs to drop those descriptors and only include the 4 character/numeric codes in brackets.
I am certain this is a fairly simple task for either awk or sed, but I don't know either with the proficiency that I should. I'd be grateful if someone can out with this.