Data to import the database as snippets

lxdorney · October 9, 2016, 5:08am

Hi all,

I don't know if this the right place to post my question to get some ideas on how to done this.

I have a text files extracted from OCR that need to have snippets to be import to database as snippet table which have columns "snippet, date, title" I dont know if shell scripts can do it with power of grep and regex command in linux or is there any opensource or commercial tools can use to do this task.

Thank you

blastit.fr · October 9, 2016, 8:03am

Hi ,
Could you provide a sample input file and the expected result ?

lxdorney · October 9, 2016, 8:28am

Hi here's the sample txt file link: Update your browser to use Google Drive - Drive Help

sample at google drive public share 1181.txt

date: September  2015 
title: THE MACROECONOMIC PERSPECTIVE: ENSURING THE BUDGET AS AN EFFECTIVE HANDLE AMID PRESSING CHALLENGES

snippet:  The proposed 2016 national budget is being considered in the legislative mill amid daunting global challenges, the vagaries of a harsh El Nino phenomenon, and the heightened political uncertainty owing to the looming transition in the reins of government.  Further, China�s continued unraveling in recent weeks has spooked global investors, affirmed the persistent global economic slowdown, and exposed the vulnerabilities of emerging markets.

or snippets would be one paragraph with atlist 400 to 500 character end with period or paragraph cut at maximum 500 character with 3 dot.

title: atlist 50 to 100 character end with period. or paragraph cut at maximum 100 character with 3 dot.
Thanks for the response

blastit.fr · October 9, 2016, 7:11pm

Hi lxdorney,
The script should fit your request .

#
#data2imp.sh
#we need first to format the input file
awk 'NF{
     gsub(/  */," ")
     printf "%s" ,$0 ;istext++;next}
     istext{print "";istext=0}' 1181.txt  > 1181.res
# Creation of a sample .csv file , ready for database input
# Take care there is no semicolumn inside the original text . In such case the field separator might be changed
awk -v OFS=";" '
NR==1 {DATE=$1 " " $2}
NR==2 { if (length($0) <= 100) TITLE = $0
        else {
           PART2=substr($0,51)
           dotposition=index(PART2,".")
           if (dotposition == 0) {
             TITLE = substr($0,1,100) "..."}
           else {
             TITLE = substr($0,1,50 + dotposition)
           }
        }
       }
NR==4{ if (length($0) <= 500) SNIPPET = $0
        else {
           PART2=substr($0,401)
           dotposition=index(PART2,".")
           if (dotposition == 0) {
             SNIPPET = substr($0,1,500) "..."}
           else {
             SNIPPET = substr($0,1,400 + dotposition)
           }
        }
print DATE,TITLE,SNIPPET
}' 1181.res >1181.csv

lxdorney · October 9, 2016, 9:12pm

Hi blastit.fr,

Thanks for your time and effort to create and share the script, I see your script focus in 1181.txt file, in this script can execute as bulk file, where talking about 2000 to 4000 text files to increment the result in one csv file including filename of text.

Thanks again

blastit.fr · October 10, 2016, 5:50pm

Hi,
See my attached file : this script will process all files as *.txt in the current directory .
The result file contains one line for each file , with the first field as the original file name.

lxdorney · October 11, 2016, 11:46am

I will try this and thanks again

---------- Post updated 10-11-16 at 10:46 AM ---------- Previous update was 10-10-16 at 05:53 PM ----------

Hi,

Here's the result after execute the script.

field ITEM - result looking good.
field DATE - Testing 100 text files and the result was 2 good out of 100. Maybe because not all content have "Month Year" in the first row.
I tried to replace from

NR==1 {DATE=$1 " " $2}

to this

NR==1 {DATE=system("egrep -R "^[a-zA-Z]{3,9} [0-9]{4} $" -m 1")}

to match date pattern, but no luck to make it work.

field TITLE - The length of title sometimes <=10 characters, maybe if we could add a conditions, for example First match of the title must be atlist minimum of 30 but not
exceeded to 100 characters and esc for not match and stop at first match.
filed SNIPPET - The length of snippet sometimes <=70 characters, maybe if we could add a conditions, for example First match of the snippet must be atlist minimum of 400
but not exceeded to 500 characters and esc for not match and stop at first match.

Also if you could explain the flow of the script much better, for not only me but for the benefit of other users.

Thank you for reading, effort and your patient.

blastit.fr · October 11, 2016, 5:40pm

Hi,

Below is a update of the script data2imp.sh with many comments for help.

Regarding any enhancement on the algorithm , I cannot guess what is to be done without seeing the files.

Could you for instance provide the header of the files, once formatted by this script ( the .lst files ) .

The command will create an output of the top 10 records of each file.

#head -10 *.lst > headers.txt 
# zip headers.zip headers.txt

Then attach the zipped file to your reply, if this content is not confidential.
( this zipped file size shouldn't exceed 100 kbytes for 1000 .lst files .

#data2imp.sh
#this script will process all files as *.txt in the current directory.
#The result file contains one line for each file, with the first field as the original file name.
#
#we need first to format the input file
function format_item {

awk '
# Filter non blank lines. NF = Number of Fields ( NF = 0 for empty line)
#  replace all consecutive blanks with 1 single blank space
NF {gsub(/  */," ")
# print line without Line Feed
 printf "%s" ,$0 
 istext++
 next
}
# Print a Line Feed when text is followed an empty line (NF==0)
istext {print "";istext=0}' $1
}
# Creation of a sample csv record 
# Take care there is no semicolumn inside the original text. In such case the field separator might be changed
function insert_item {
awk -v OFS=";" -v ITEM=$itemname '
# Record 1 is the item date
NR==1 {DATE=$1 " " $2}
# Record 2 is item title
# if length <= 100 : TITLE is the full record
# else seek for a dot in position between 51 and 100 of the record and cut record to this position
#    if no dot  in position between 51 and 100 : cut to the 100 first characters.           
NR==2 { if (length($0) <= 100) TITLE = $0
        else {
           dotposition=index(substr($0,51),".")
           if (dotposition == 0) {
             TITLE = substr($0,1,100) "..."}
           else {
             TITLE = substr($0,1,50 + dotposition)
           }
        }
       }
# same method as for  TITLE
NR==4{ if (length($0) <= 500) SNIPPET = $0
        else {
           dotposition=index(substr($0,401),".")
           if (dotposition == 0) {
             SNIPPET = substr($0,1,500) "..."}
           else {
             SNIPPET = substr($0,1,400 + dotposition)
           }
        }
print ITEM,DATE,TITLE,SNIPPET
}' $1
}
#--- main -----------------------------------
# Output initialisation
#
>items.csv
#
for i in *.txt
do
itemname=$(basename $i .txt)
echo Processing item $itemname
format_item $i > $itemname.lst
insert_item $itemname.lst >> items.csv
done

blastit.fr · October 13, 2016, 6:43pm

Hi,

See the attached script as a big enhancement of my previous one.
Now the fields are processed on different criterias, independant of the record number.
The script creates temporary files for debugging purpose.
These files are the real input , but clean from what we can call "noise" , i.e. useless keywords or full lines.

You can check for other extra noise using this command:

$ ./data2imp.sh 
( output remove) 
# check for useless keywords :
$ sort *.tmp |uniq -c |sort -rn | head -20 |cut -c-60
     56
     40 (FOR FY
     18 2017)
     18 2016)
     17 AUGUST 2012 1
     16 2015)
     13 DEPARTMENT OF
     13 AUGUST 2011 1
     10 (FY
      9 AUGUST 2015 TABLE OF CONTENTS
      9 1. MANDATE
      8 1
      7 DEPARTMENT OF ENERGY
      6 AUGUST 2016 1
      6 AUGUST 2013 2
      6 ((FFYY 22001133))
      5 SITUATIONER
      5 SEPTEMBER 2013 1
      5 DEPARTMENT OF TRADE AND INDUSTRY
      5 DEPARTMENT OF FINANCE

You can then enhance code by adding new exclusions , like for instance /FFYY/

Regards

lxdorney · October 14, 2016, 1:19am

Hi,

Thanks for the update
running the script I got error message if this commented if you don't need to format files:

# remove comment if required to format files (txt to lst)
# for i in *.txt
# do
# itemname=$(basename $i .txt)
# echo Formatting item $i
# format_item $i > $itemname.lst
# done

rm: cannot remove `*.tmp': No such file or directory
Processing item *.lst
awk: cmd. line:40: fatal: cannot open file `*.lst' for reading (No such file or directory)

Thanks

lxdorney · October 15, 2016, 5:02pm

Hi blastit.fr,

how to make this in one line in awk or in sh file:

NR==2 { if (length($0) <= 100) TITLE = $0
        else {
           dotposition=index(substr($0,51),".")
           if (dotposition == 0) {
             TITLE = substr($0,1,100) "..."}
           else {
             TITLE = substr($0,1,50 + dotposition)
           }
        }
       }

Thanks

blastit.fr · October 16, 2016, 1:47pm

Hi ,
I have updated my script : see the attached file.
This greatly improves resultats , as the rules are changed.

1) The field recognition is nomore based on the line number .
2) The rule to filter the TITLE : find a line with no lowercase letter , the longest as possible.

Now you should see the outfile file items.csv filled with relevant results.
Take care that many files , like the 3*.txt , won't fit for this filter .
It fits only with the files like 1181.txt

cdt