Using awk to isolate specific rows

sidiqmk · October 7, 2011, 5:35am

Hi all!

Let's say I have obtained this dataset from the ISI Web of Knowledge

...
PT J
AU Yousefi, Ramin
   Muhamad, Muhamad Rasat
   Zak, Ali Khorsand
TI The effect of source temperature on morphological and optical properties
   of ZnO nanowires grown using a modified thermal evaporation set-up
SO CURRENT APPLIED PHYSICS
VL 11
IS 3
BP 767
EP 770
DI 10.1016/j.cap.2010.11.061
PD MAY 2011
PY 2011
TC 2
Z9 2
SN 1567-1739
UT WOS:000288183300097
ER

PT J
AU Ooi, C. H. Raymond
TI Conversion of heat to light using Townes' maser-laser engine: Quantum
   optics and thermodynamic analysis
SO PHYSICAL REVIEW A
VL 83
IS 4
AR 043838
DI 10.1103/PhysRevA.83.043838
PD APR 29 2010
PY 2010
TC 0
Z9 0
SN 1050-2947
UT WOS:000290107500018
ER
...

This is just a snippet. I would like to place each entry (from PT J to ER is considered an entry) into a separate file according to year.. Therefore

PT J
AU Yousefi, Ramin
   Muhamad, Muhamad Rasat
   Zak, Ali Khorsand
TI The effect of source temperature on morphological and optical properties
   of ZnO nanowires grown using a modified thermal evaporation set-up
SO CURRENT APPLIED PHYSICS
VL 11
IS 3
BP 767
EP 770
DI 10.1016/j.cap.2010.11.061
PD MAY 2011
PY 2011
TC 2
Z9 2
SN 1567-1739
UT WOS:000288183300097
ER

will be in the file named 2011 and so forth. I tried to do it with this

i=1990
while [ "$i" -lt 2011 ]
        do
                gawk '/AU/ , /'PY "$i"'/' savedrecs.txt savedrecs2.txt > "$i"
                ((i+=1))
        done

where savedrecs are the files containing the raw data from the database but it won't work because gawk will just keep dumping to $i until it meets the right PY "$i"..

Any ideas? Thanks in advance..

radoulov · October 7, 2011, 5:56am

awk '{ 
  /^PY/ && y = $2
  r = r ? r RS $0 : $0 
  }
/^ER/ {
  print r >> y
  r = x; close(y)
  }' infile

Le me know if you want to discard data outside PT J and ER.

sidiqmk · October 7, 2011, 6:09am

Hi thanks for replying..

I don't really understand what's going on but I'll pick it up later.. However when I ran the commands, it seems like only 1 occurrence will be placed in a file.. There should be more as there are more than 900 entries..

I'm sorry for inconveniencing you and thanks for the help again..

radoulov · October 7, 2011, 6:13am

OK,
I edited my post and modified the script in order to handle multiple entries (you should remove the previously created before rerunning the script though).

This version will be more efficient, but some awk implementations may hit the high number of concurrently open files limit with it:

awk '{ 
  /^PY/ && y = $2
  r = r ? r RS $0 : $0 
  }
/^ER/ {
  print r > y
  r = x
  }' infile

sidiqmk · October 7, 2011, 6:22am

Thanks a lot.. Works like a charm now.. I will now go find out what these commands mean..

radoulov · October 7, 2011, 6:44am

I'll try to explain

I'll take the first one, because it's more complicated.

This is the first action block:

{ 
  /^PY/ && y = $2
  r = r ? r RS $0 : $0 
  }

While reading every[1] input record (most awk code outside of the BEGIN/END/BEGINFILE/ENDFILE special patterns is wrapped in an implicit loop):

/^PY/ && y = $2

When the current record matches the regular expression above (i.e. begins with the string PY), set the content of the variable y to the value of the second field (the year).

r = r ? r RS $0 : $0

Store all the records in the variable r. This statement uses the ternary operator,
in pseudo-code:

expression ? if_true_return_this : otherwise_return_this

It simply appends all the records to the variable r. This approach is rather fragile, in some situation you will loose data (when the first line is empty(NULL or only white space characters or it contains only the digit 0 (zero), let me know if you need a more robust solution).

The next pattern/action pair:

/^ER/ {
  print r >> y
  r = x; close(y)
  }

If the current input record matches ^ER append ( >>) the content of the r variable to the file named y (the current year of the logical record, the previously saved value).
Reset r (x is an uninitialized variable, in awk those are NULL (when used as strings) or 0 (when used as numbers)).
Close the current input file (needed with some awk implementations on certain systems).

[1] I say every record because in that rule the pattern part is missing, so the action is executed for every record.

Hope this helps.

sidiqmk · October 8, 2011, 1:44am

Hi thanks for the tutorial.. I really got the picture, save the r RS $0 part but it's ok I'll look it up some other time.. Now I've tried to use your script to collect articles by author so let's say I'd like to collect all of Mr Yousefi's papers into one file, I'd type

gawk '{
  /"Yousefi.*$"/ && y = yousefi
  r = r ? r RS $0 : $0
  }
/^ER/ {
  print r > y
  r = x
  }' savedrecs.txt savedrecs2.txt

I'm pretty sure I'm doing it wrong because the error I get is
gawk: cmd. line:6: (FILENAME=savedrecs.txt FNR=21) fatal: expression for `>' redirection has null string value
which means y is not initialized I think..

Please helppp.. Thanks again..

radoulov · October 8, 2011, 4:30am

You were close.

y = yousefi

Here yousefi is just another uninitialized variable though,
you'll need:

y = "yousefi"

Note the quotes. In the later example "yousefi" is the string yousefi,
in the former - it's an identifier (a variable name).

Regarding this part:

/"Yousefi.*$"/

If your input file looks like this:

AU Yousefi, Ramin

This should be sufficient:

/Yousefi/ && y = "yousefi"

This should work:

gawk '{
  /Yousefi/ && y = "yousefi"
  r = r ? r RS $0 : $0
  }
/^ER/ && y {     # check if the previous record 
  print r > y    # contains yousefi
  r = y = x      # reset the flag: set both r and y to NULL 
  }' savedrecs.txt savedrecs2.txt

sidiqmk · October 8, 2011, 5:00am

I think it just merged both savedrecs files together into the file yousefi..

radoulov · October 8, 2011, 5:33am

Edit: See comments below.

Are you sure you've used the last version?
This one:

gawk '{
  /Yousefi/ && y = "yousefi"
  r = r ? r RS $0 : $0
  }
/^ER/ && y {   
  print r > y  
  r = y = x    
  }' savedrecs.txt savedrecs2.txt

radoulov · October 8, 2011, 5:42am

Yes,
sorry, we need to reset y at every ER

gawk '{     
  /Yousefi/ && y = "yousefi"
  r = r ? r RS $0 : $0
  }
/^ER/ {     
  if (y) print r > y    
  r = y = x      
  }' infile

sidiqmk · October 8, 2011, 6:10am

Now the first article belongs to Mr Yousefi but the sequential ones are as though it was appended from the contents of both input files..

Hey thanks again for helping me.. I added the input file just in case you needed to tinker..

radoulov · October 8, 2011, 7:28am

It seems correct to me: it prints all the blocks/logical records that contain the string Yousefi.
Anyway, given the real input file, the task is much easier

awk '/Yousefi/ { print > "Yousefi" }' RS= ORS='\n\n' savedrecs.txt

If you need something different, please post an example of the desired output based on the posted input.

radoulov · October 9, 2011, 3:02am

What exactly do you want to do with the second input file savedrecs2.txt?

kvmreddy · October 9, 2011, 3:59am

Thanks, nice explanation.

sidiqmk · October 9, 2011, 4:52am

Hey thanks a lot for your help.. Works like a charm now.. I'm now learning how to place in a loop and array a specific list of lecturers to be extracted from the code that you've provided me.. I'll post it here if it works right, or yelp for help if it doesn't work right..

The second input file contains data which is similar to savedrecs.txt.. It's because the ISI Web of Science only allows 500 articles per text file so it has to be broken up to two as there are almost 900 articles..

sidiqmk · October 30, 2011, 11:28pm

Hi, sorry for the late reply.. I couldn't set out to complete what I was going to do, so I took the easy way out and made a clunky script.. I'm just posting it to have somewhat completeness to the posts..

A. I create a directory called nominal, put the input file in it and get rid of records starting with AU, ED and BE. These records are inconsequential and would create problems later if included.

 rm -r nominal
mkdir nominal
cd nominal
cp ../savedrecs.txt ../savedrecs2.txt .
cat savedrecs2.txt >> savedrecs.txt
rm savedrecs2.txt
sed 's/AU/ /g' savedrecs.txt > savedrecsa.txt
sed /ED/d savedrecsa.txt > savedrecsb.txt
sed /BE/d savedrecsb.txt > savedrecsc.txt
mv savedrecsc.txt savedrecs.txt
rm savedrecsa.txt savedrecsb.txt

B. I make a directory for each professor, dump all articles with their names into the file AllPubs, you have to be careful of the syntax though.

 mkdir Arof.AK
cd Arof.AK
awk '/Arof, A.*K.*$/ { print > "AllPubs" }' RS= ORS='\n\n' ../savedrecs.txt 
awk '/AROF, A.*K.*$/ { print >> "AllPubs" }' RS= ORS='\n\n' ../savedrecs.txt

C. From AllPubs, sort all articles by publication year..

 gawk '{ 
  /^PY/ && y=$2
  r = r ? r RS $0 : $0  
 }
/^ER/ {
  print r > y
  r = x;
  }' AllPubs

D. Counts the number of publications each year by the number of times the professors name is mentioned.

 i=1980
while [ "$i" -lt 2012 ]
do
    grep "Arof, A.*K.*$" "$i" > names
    grep "AROF, A.*K.*$" "$i" >> names
    gawk '{nama[$1 $2 $3]++}
    END {for (name in nama) print name, nama[name]}
    ' names > counted
    rm names

    gawk '
    BEGIN   {
        secondcol=0;
        }
        {
        secondcol+=$2;
        }
    END     {
        printf "Arof, AK %d\n",secondcol;
        }
    ' counted > "counted$i"
    rm counted

E. This dumps the number of publication for a certain year into a file called ByYear

     gawk '{print '"$i"', $3}' counted$i >> ByYear
    rm "counted$i"
    ((i+=1))
done

cd ..

F. And so on..

 mkdir Shrivastava.KN
cd Shrivastava.KN
awk '/Shrivastava, K.*N.*$/ { print > "AllPubs" }' RS= ORS='\n\n' ../savedrecs.txt

gawk '{ 
  /^PY/ && y=$2
  r = r ? r RS $0 : $0  
 }
/^ER/ {
  print r > y
  r = x;
  }' AllPubs

i=1980
while [ "$i" -lt 2012 ]
do
    grep "Shrivastava, K.*N.*$" "$i" > names
    gawk '{nama[$1 $2 $3]++}
    END {for (name in nama) print name, nama[name]}
    ' names > counted
    rm names
    
    gawk '
    BEGIN   {
        secondcol=0;
        }
        {
        secondcol+=$2;
        }
    END     {
        printf "Shrivastava, KN %d\n",secondcol;
        }
    ' counted > "counted$i"
    rm counted
    
    gawk '{print '"$i"', $3}' counted$i >> ByYear
    rm "counted$i"
    ((i+=1))
done

cd ..


mkdir Kwek.KH
cd Kwek.KH
awk '/Kwek, K.*H.*$/ { print > "AllPubs" }' RS= ORS='\n\n' ../savedrecs.txt 
awk '/KWEK, K.*H.*$/ { print >> "AllPubs" }' RS= ORS='\n\n' ../savedrecs.txt 

gawk '{ 
  /^PY/ && y=$2
  r = r ? r RS $0 : $0  
 }
/^ER/ {
  print r > y
  r = x;
  }' AllPubs

i=1980
while [ "$i" -lt 2012 ]
do
    grep "Kwek, K.*H.*$" "$i" > names
    grep "KWEK, K.*H.*$" "$i" >> names
    gawk '{nama[$1 $2 $3]++}
    END {for (name in nama) print name, nama[name]}
    ' names > counted
    rm names

    gawk '
    BEGIN   {
        secondcol=0;
        }
        {
        secondcol+=$2;
        }
    END     {
        printf "Kwek, KH %d\n",secondcol;
        }
    ' counted > "counted$i"
    rm counted
    
    gawk '{print '"$i"', $3}' counted$i >> ByYear
    rm "counted$i"
    ((i+=1))
done

cd ..

I guess it would be much more elegant if it all professors initials were to be in an array and just using a while loop.. I'll try to learn how to do this later.. Thanks..

ahamed101 · October 30, 2011, 11:47pm

...
/"Yousefi.*$"/ && y = "yousefi"
...

"y" is the filename here. Since now you are extracting the data by author, you can either hardcode the file name as shown above or extract the author dynamically and use it as a filename in which case you will get all the entries in separate file with author's name.
--ahamed