How to extract data from a huge file?

srsahu75 · January 18, 2008, 1:21am

Hi,
I have a huge file of bibliographic records in some standard format.I need a script to do some repeatable task as follows:

Needs to create folders as the strings starts with "item_*" from the input file
Create a file "contents" in each folders having "license.txt(tab \t)bundle:LICENSE" as string in it
Create a file "dublin_core.xml" in their respective folder "item_" extracting the text from the input file under its "item_" string. The would be extracted text starts with the string <dublin_core schema="dc"> and ends with </dublin_core>

Following are the sample records in the file:

item_3908
<dublin_core schema="dc">
<dcvalue element="contributor" qualifier="author">Fernandes, A.A.</dcvalue>
<dcvalue element="contributor" qualifier="author">Sarma, Y.V.B.</dcvalue>
<dcvalue element="title" qualifier="none">Directional spectrum of ocean waves</dcvalue>
<dcvalue element="date" qualifier="issued">2000</dcvalue>
<dcvalue element="publisher" qualifier="none">GET PUB</dcvalue>
<dcvalue element="identifier" qualifier="citation">Ocean Eng., Vol.27; 345-363p.</dcvalue>
</dublin_core>
/eprints/Ocean_Eng_27_345.pdf
item_3911
<dublin_core schema="dc">
<dcvalue element="contributor" qualifier="author">Phatarpekar, P.V.</dcvalue>
<dcvalue element="title" qualifier="none">A comparative study on growth performance</dcvalue>
<dcvalue element="identifier" qualifier="citation">Aquaculture, Vol.181; 141-155p.</dcvalue>
<dcvalue element="type" qualifier="none">Journal Article</dcvalue>
<dcvalue element="language" qualifier="iso">en</dcvalue>
<dcvalue element="subject" qualifier="none">polyculture</dcvalue>
</dublin_core>
/eprints/Aquaculture_181_141.pdf
item_3921
<dublin_core schema="dc">
<dcvalue element="contributor" qualifier="author">Rao, B.R.</dcvalue>
<dcvalue element="contributor" qualifier="author">Veerayya, M.</dcvalue>
<dcvalue element="title" qualifier="none">Influence of marginal highs on the accumulation</dcvalue>
<dcvalue element="description" qualifier="abstract">Twenty five surficial sediment samples were</dcvalue>
</dublin_core>
/eprints/Deep-Sea_Res_(II)_47_303.pdf

Thanks & Regards

matrixmadhan · January 18, 2008, 3:36am

When you say its bibliographic records, what is the format in which it is encoded ? UNIMARC, MARC something like that ..

Is the sample what you had posted an extraction of bib records ?

Do you need to extract information between the main tags ( inclusive of the tags ) ?
starting from

<dublin_core schema="dc">
and
</dublin_core>

dennis.jacob · January 18, 2008, 3:43am

Try this:

cat test| grep item_ | xargs mkdir
for each in $(cat test| grep item_)
do
awk '/'$each'/,/dublin_core>/ {print }' test > ./${each}/dublin_core.xml
echo "license.txt(tab \t)bundle:LICENSE" >./${each}/contents
done

matrixmadhan · January 18, 2008, 3:45am

The following code snippet, logically splits the data file based on the main tag

#! /opt/third-party/bin/perl

open(FILE, "<", "a");

while(<FILE>) {
  chomp;
  print "$_\n" if( $pr == 1 );
  if( /<dublin/ ) {
    $pr = 1;
    print "$_\n";
    next;
  }
  elsif ( /<\/dublin/ ) {
    print "\n\n\n";
    $pr = 0;
  }
}

close(FILE);

exit 0

srsahu75 · January 18, 2008, 3:53am

Yes, I need to extract information between the main tags ( inclusive of the tags ).
starting from
<dublin_core schema="dc">
to
</dublin_core>

Save the extract as dublin_core.xml in the respective folders item_* which are created from the string (item_*) before <dublin_core schema="dc">

And save another file 'contents' in each folder with the content as license.txt(tab \t)bundle:LICENSE

srsahu75 · January 18, 2008, 4:06am

Yes, I need to extract information between the main tags ( inclusive of the tags ).
starting from
<dublin_core schema="dc">
to
</dublin_core>

Save the extract as dublin_core.xml in the respective folders item_* which are created from the string (item_*) before <dublin_core schema="dc">

And save another file 'contents' in each folder with the content as license.tx(tab)tbundle:LICENSE