Converting txt file into CSV using awk or sed

ksk · July 15, 2012, 1:54pm

Hello folks

I have a txt file of information about journal articles from different fields. I need to convert this information into a format that is easier for computers to manipulate for some research that I'm doing on how articles are cited. The file has some header information and then details of records. For example,

Tue Jun 19 14:07:34 EDT 2012
CSA
Database: EconLit

Record 1 of 500

DN: Database Name
    EconLit
TI: Title
    Statistical Modeling of Monetary Policy and Its Effects
AU: Author
    Sims, Christopher A
SO: Source
    American Economic Review, vol. 102, no. 4, June 2012, pp. 1187-1205
DE: Descriptors
    History of Economic Thought: Macroeconomics (B220); Economic
    Methodology (B410); Methodological Issues: General (C180); Business
    Fluctuations, Cycles (E320); Prices, Business Fluctuations, and
    Cycles: Forecasting and Simulation: Models and Applications (E370);
    Monetary Policy (E520); Modeling; Monetary; Monetary Policy; Policy
PY: Publication Year
    2012

Record 2 of 500

DN: Database Name
    EconLit
TI: Title
    Targeting the Poor: Evidence from a Field Experiment in Indonesia
AU: Author
    Alatas, Vivi; Banerjee, Abhijit; Hanna, Rema; Olken, Benjamin A;
    Tobias, Julia
SO: Source
    American Economic Review, vol. 102, no. 4, June 2012, pp. 1206-40
DE: Descriptors
    Field Experiments (C930); Measurement and Analysis of Poverty (I320);
    Welfare and Poverty: Government Programs, Provision and Effects of
    Welfare Programs (I380); Microeconomic Analyses of Economic
    Development (O120); Economic Development: Human Resources, Human
    Development, Income Distribution, Migration (O150); Economic
    Development: Urban, Rural, Regional, and Transportation Analysis,
    Housing, Infrastructure (O180); Urban, Rural, Regional, and
    Transportation Economics: Regional Migration, Regional Labor Markets,
    Population, Neighborhood Characteristics (R230); Indonesia; Asia;
    Experiment; Experiments; Field Experiment; Poor; Poverty; Village
PY: Publication Year
    2012

.
.
.

My goal is to convert this information into CSV format like so:

"TITLE","AUTHOR(S)","SOURCE","DESCRIPTOR CODES ONLY",PUBLICATION

So the above should turn into

"Statistical Modeling of Monetary Policy and Its Effects","Sims, Christopher A","American Economic Review, vol. 102, no. 4, June 2012, pp. 1187-1205","B220,B410,C180,E320,E370,E520",2012
"Targeting the Poor: Evidence from a Field Experiment in Indonesia","Alatas, Vivi; Banerjee, Abhijit; Hanna, Rema; Olken, Benjamin A; Tobias, Julia","American Economic Review, vol. 102, no. 4, June 2012, pp. 1206-40","C930,I320,I380,O120,O150,O180,R230",2012

Note that there are some descriptors that do not have codes, (eg. `Modeling' at the end of the first record). The code needs to drop those descriptors and only include the 4 character/numeric codes in brackets.

I am certain this is a fairly simple task for either awk or sed, but I don't know either with the proficiency that I should. I'd be grateful if someone can out with this.

agama · July 15, 2012, 3:28pm

One way:

awk '
    function printit( )
    {
        if( stuff["auth"] )
        {
            sub( ",$", "",  stuff["des"] );
            for( x in stuff )
                if( index( stuff[x], "," ) )            # must quote if it has commas
                    stuff[x] = "\"" stuff[x] "\"";
            printf( "%s,%s,%s,%s,%s\n", stuff["title"], stuff["auth"], stuff["source"], stuff["des"], stuff["year"] );
        }
    }

    /^Record/ { printit(); delete stuff; n = ""; next; }
    /TI:/ { n = "title"; next; }
    /DN:/ { n = "dname"; next; }
    /AU:/ { n = "auth"; next; }
    /DE:/ { n = "des"; next; }
    /PY:/ { n = "year"; next; }
    /SO:/ { n = "source"; next; }
    /^..:/ { n = ""; next; }

    n == "" { next; }

    n == "des" {
        split( $0, a, ";" );
        for( i = 1; i <= length( a ); i++ )
        {
            if( substr( a, length(a) ) == ")" )
            {
                l = split( a, b, " " );
                stuff[n] = stuff[n] substr( b[l], 2, length( b[l] ) - 2 ) ",";
            }
        }
        next;
    }

    {
        sub( "^ +", "" );
        stuff[n] = stuff[n] $0 " "; 
        next; 
    }

    END { printit(); }

' input-file >output.csv

Assumption made: descriptors were always the last token before the semicolon, and in parens.

ksk · July 15, 2012, 3:35pm

Hi agama, I just gave your code a try. It says

awk: line 27: illegal reference to array a

.

agama · July 15, 2012, 3:45pm

What version of awk are you running? (run `awk --version` to see)

---------- Post updated at 15:45 ---------- Previous update was at 15:42 ----------

This might be a fix (assuming it was length( a ) that awk was complaining about).

awk '
    function printit( )
    {
        if( stuff["auth"] )
        {
            sub( ",$", "",  stuff["des"] );
            for( x in stuff )
                if( index( stuff[x], "," ) )            # must quote if it has commas
                    stuff[x] = "\"" stuff[x] "\"";
            printf( "%s,%s,%s,%s,%s\n", stuff["title"], stuff["auth"], stuff["source"], stuff["des"], stuff["year"] );
        }
    }

    /^Record/ { printit(); delete stuff; n = ""; next; }
    /TI:/ { n = "title"; next; }
    /DN:/ { n = "dname"; next; }
    /AU:/ { n = "auth"; next; }
    /DE:/ { n = "des"; next; }
    /PY:/ { n = "year"; next; }
    /SO:/ { n = "source"; next; }
    /^..:/ { n = ""; next; }

    n == "" { next; }

    n == "des" {
        al = split( $0, a, ";" );
        for( i = 1; i <= al; i++ )
        {
            if( substr( a, length(a) ) == ")" )
            {
                l = split( a, b, " " );
                stuff[n] = stuff[n] substr( b[l], 2, length( b[l] ) - 2 ) ",";
            }
        }
        next;
    }

    {
        sub( "^ +", "" );
        stuff[n] = stuff[n] $0 " ";
         next; 
    }

    END { printit(); }

'

ksk · July 15, 2012, 3:58pm

'awk --version' gives 'mawk: not an option: --version'. 'man awk' gives the man page for mawk. Should I get something else?

---------- Post updated at 01:19 AM ---------- Previous update was at 01:16 AM ----------

Hi agama, the fix makes it work almost perfectly. Oddly, the CSV file has the title in quotes sometimes but not other times, and it doesn't seem to follow any pattern. Should I attach the txt file and csv for reference?

---------- Post updated at 01:28 AM ---------- Previous update was at 01:19 AM ----------

Ah, some of the title entries have quotes because they have commas in them. Can I modify the code so it puts all title entries in quotes by default? I tried removing

if( index( stuff[x], "," ) )            # must quote if it has commas

. That puts quotes on everything which is fine except then I need the quotes to disappear on the date field.

agama · July 15, 2012, 4:05pm

Very good.

Adding quotes regardless, except for the date field, with a small change to the print function:

function printit( )
    {
        if( stuff["auth"] )
        {
            sub( ",$", "",  stuff["des"] );
            printf( "\"%s\",\"%s\",\"%s\",\"%s\",%s\n", stuff["title"], stuff["auth"], stuff["source"], stuff["des"], stuff["year"] );
        }
    }

Justs adds quotes to all things in the printf(), so it's a bit more efficient this way too.

ksk · July 15, 2012, 4:36pm

Thanks agama! I wonder if I can get your advice on wrapping this all up into a nice shell script. I have several of these text files that are generated through another script and land up in a folder. Would it be possible for me to write a .sh file that performs this awk operation on each text file in the directory that I'm in, and append everything into a large CSV file?
Once I'm done with this, there is one last step, but it might be easier to explain in a separate thread or perhaps once I am to get this master CSV file.

Thanks again for you help!

alister · July 15, 2012, 4:58pm

It's never a good idea to withhold information. Better too much than too little. I've lost count of how many times I've seen someone ask for help on some very specific task, and then, later add another requirement which makes it clear that a completely different approach would have been simpler, more efficient and more maintainable. When that happens, it's a waste of time for everyone involved, both the member seeking help and the members offering help.

Whatever it is, I'm sure agama can handle it.

Regards,
Alister

agama · July 15, 2012, 6:30pm

Here is a simple wrapper that reads all files in the directory and generates one csv file. Accepts the name of the output file on the command line. So if your script is files2csv and you wish the output to go to /tmp/listing.csv then the command would be: files2csv /tmp/listing.csv .

#/usr/bin/env ksh

# read the named file and output a csv of the desired data
# same awk programme  as in previous posts
function f2csv
{
    awk '
    function printit( )         # always add quotes except for the year field
    {
        if( stuff["auth"] )
        {
            sub( ",$", "",  stuff["des"] );
            printf( "\"%s\",\"%s\",\"%s\",\"%s\",%s\n", stuff["title"], stuff["auth"], stuff["source"], stuff["des"], stuff["year"] );
        }
    }

    /^Record/ { printit(); delete stuff; n = ""; next; }
    /TI:/ { n = "title"; next; }
    /DN:/ { n = "dname"; next; }
    /AU:/ { n = "auth"; next; }
    /DE:/ { n = "des"; next; }
    /PY:/ { n = "year"; next; }
    /SO:/ { n = "source"; next; }
    /^..:/ { n = ""; next; }

    n == "" { next; }

    n == "des" {
        al = split( $0, a, ";" );
        for( i = 1; i <= al; i++ )
        {
            if( substr( a, length(a) ) == ")" )
            {
                l = split( a, b, " " );
                stuff[n] = stuff[n] substr( b[l], 2, length( b[l] ) - 2 ) ",";
            }
        }
        next;
    }

    { 
        sub( "^ +", "" );
        stuff[n] = stuff[n] $0 " "; next; }

    END { printit(); }
    
    ' $1
}


ls | while read file        # read all files in the directory and run them through the awk
do
    if [[ $file != $1 ]]
    then
        f2csv $file
    fi
done >>$1

Yes, all files could be given on the awk command line, but this allows some control if not all files are needed, or other checking (old/new etc.) needs to be done to refine the files actually processed.