Help to make awk script more efficient for large files

script_op2a · May 21, 2011, 6:47pm

Hello,

Error

awk: Internal software error in the tostring function on TS1101?05044400?.0085498227?0?.0011041461?.0034752266?.00397045?0?0?0?0?0?0?11/02/10?09/23/10???10?no??0??no?sct_det3_10_20110516_143936.txt

What it is
It is a unix shell script that contains an awk program as well as some unix commands.
The heart of the script is the awk program.
It works, you can run it on your machine.

What is does
It takes an input file and removes duplicate records.
It creates an output file containing a sorted version of the input file.
It creates an output file containing a sorted archive of the duplicate records.
It creates an output file containing a sorted version of only the wanted records with the duplicates removed.
It tells you by standard output the number of duplicate records found, 0 or more.
It creates a flag file if duplicate records were found.
It sets the locale to the default locale of the server I run in on, in my case I set them to HP-UX server defaults to override any changes to these environment variables..

What it needs to do
It needs to made more efficient for processing large input files. When I run it with large input files say 351 MB I get this err:

awk: Internal software error in the tostring function on TS1101?05044400?.0085498227?0?.0011041461?.0034752266?.00397045?0?0?0?0?0?0?11/02/10?09/23/10???10?no??0??no?sct_det3_10_20110516_143936.txt

How to execute the script
The script is run from the UNIX command line, passing parameters to the script.

Note*
You should give the script the required permissions with chmod. (for example chmod 777 rem_dups.sh)
If necessary, convert the script to unix formart with the dos2unix command. (for example dos2unix rem_dups.sh )

You can run the script by pasting this command on the command line.

./rem_dups.sh '1,2,3' . in.txt . out.txt dups_archive 009 4

Description of parameters
There are 8 parameters passed to the script from the UNIX command line.

Unix parameter $1: Value: '1,2,3'  Description: This is a comma-separated list of column positions in the input file that together make up the key used
to define a unique record
Unix parameter $2: Value: . (for same directory, it could be any directory) Description: This is the path to the input file.
Unix parameter $3: Value: in.txt (you can give it whatever name)  Description: This is the filename of the input file.
Unix parameter $4: Value: . (for same directory, it could be any directory) Description: This is the path to the output file with duplicates removed.
Unix parameter $5: Value: out.txt (you can give it whatever name)  Description: This is the filename of the output file with duplicates removed.
Unix parameter $6: Value: dups_archive (you can give it whatever name)  Description: This is the filename of the archive file containing the duplicate records.
Unix parameter $7: Value: 009 (this is the ASCII code for the tab character) Description: this is the ASCII code for the delimiter in the input file. 
Unix parameter $8: Value: 4 Description: this is the column position in the input file of the site-datetime column used to determine which record to keep
and which record to move to the duplicates archive file. It is also used to add the site to the key value to make the key complete in order to define a unique record.

Content of input file

31erescca    010240    10    sct_det3_10_20110516_143947.txt
11erescca    010240    10    sct_det3_10_20110516_143936.txt
31erescca    010240    10    sct_det3_10_20110516_143947.txt
21erescca    010240    10    sct_det3_10_20110516_143937.txt
31erescca    010240    10    sct_det3_10_20110516_233947.txt
11erescca    010240    10    sct_det3_10_20110516_143936.txt

Description of the input file
Note* that there must always be 1 and only 1 blank line at the end of the input file.

The input file contains 4 columns per record separated by the tab character. (in ASCII this is 009)
The last column is the column used to determine which record to keep. The value with the greatest datetime is kept and the rest are moved
to the duplicates archive file.

For example in sct_det3_10_20110516_143947.txt in the program 20110516_143947 is the date and time.
In the program 20110516143947 is the value used to determine which record to keep.

The key is made up, in this case (you could specify whatever column(s) to be the key), of columns 1, 2 and 3.
In the program the key is separated by a hyphen to make, for example, 31erescca-010240-10- .

Then the site is added to the key from the 4th column.
The site in sct_det3_10_20110516_143936.txt is 10. (The 10 that you see after the det3)
So the final key value become the key & the site to make 31erescca-010240-10-10

Descriptions and content of output files

temp_sort_inputfile_out.txt
This is the sorted version of the input file used to compare with the sorted ouput file (the one that doesn=t contain duplicates).

11erescca    010240    10    sct_det3_10_20110516_143936.txt
11erescca    010240    10    sct_det3_10_20110516_143936.txt
21erescca    010240    10    sct_det3_10_20110516_143937.txt
31erescca    010240    10    sct_det3_10_20110516_143947.txt
31erescca    010240    10    sct_det3_10_20110516_143947.txt
31erescca    010240    10    sct_det3_10_20110516_233947.txt

duplicates_flagfile_out.txt
This is just an empty file indicating duplicates were found. It's not used for anything else in this process.

dups_archive
This file contains only the duplicate records that were found in the input file. It is sorted.

11erescca    010240    10    sct_det3_10_20110516_143936.txt
31erescca    010240    10    sct_det3_10_20110516_143947.txt
31erescca    010240    10    sct_det3_10_20110516_143947.txt

out.txt
This is the final ouput file containing only the unique records. All duplicates have been removed. It is also sorted.

11erescca    010240    10    sct_det3_10_20110516_143936.txt
21erescca    010240    10    sct_det3_10_20110516_143937.txt
31erescca    010240    10    sct_det3_10_20110516_233947.txt

The code of the script
This is the code of the script. You can change sh to ksh or bourne shell whichever you are using.
Feel free to remove the locale environment variables if you wish. Some characters have a problem with the sort command,
that's why I added it. The locale info goes to standard output.

#!/usr/bin/sh

LANG=""
LC_CTYPE="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_MESSAGES="C"
LC_ALL=""

export LANG
export LC_CTYPE
export LC_COLLATE
export LC_MONETARY
export LC_NUMERIC
export LC_TIME
export LC_MESSAGES
export LC_ALL

locale

pos="$1"
infile="$2/$3"
outfile="$4/$5"
temp_dups_file="$2/$6"
temp_sort_file="$2/temp_sort_inputfile_$5"
flagfile="$2/duplicates_flagfile_$5"
delimiter="$7"
site_datetime="$8"

awk -v key_cols="$pos" -v delim="$delimiter" -v site_dt="$site_datetime" '
 
BEGIN {
    FS=sprintf("%c", delim);
    numkeys=split(key_cols,k,",");
} 

{ 

n=split($(site_dt), z, "_")
datetime=z[n-1] substr(z[n], 1 , 6) 
site=z[n-2];

for (x = 1; x <= numkeys; x++)
{    
    key=key $(k[x]) "-";
}

keysite=key site;
key="";

if(datetime > m[keysite]) 
{
    m[keysite] = datetime; out[keysite] = $0;
}

next }

END { for (keysite in out) print out[keysite] }' <$infile |sort > $outfile

sort $infile > $temp_sort_file

diff $outfile $temp_sort_file |awk '/^>/ {print $0}'|sed 's/^> //' > $temp_dups_file

LINES=$(wc -l < $temp_dups_file)

if [ "$LINES" -gt 0 ]
then
        echo "There were $LINES duplicate records."
     touch $flagfile
else
        echo "There were $LINES duplicate records. 0 duplicate records."
fi

DGPickett · May 26, 2011, 1:51pm

Use GNU Awk? "sort ... | uniq -d | wc -l | read dup_ct"

script_op2a · May 26, 2011, 5:17pm

Hello, thank you very mucho for your post,

Do you mean using sort instead of awk? Do you think I could do it using only sort?
If so, could you explain the different piped sections?

Let me see if I understand:

sort |
this pipes the sorted input file to the uniq command (d option keeps only 1 of the duplicate lines) pipes to wc - l count # of non-duplicates? pipes to the read..

I'm not sure what the read command does in this case.

I'm also thinking about and will surely post the final code on this thread.

I need to sort the input file based on the key columns specificed for a particular file.
Say columns 1,2 and 3 to keep it simple. And use say column 4 (which is a datetime column) to determine which record to keep.

If I used sort to put that greatest column 4 value on top then use awk to just remove all the duplicates execept the 1st index of it.

Perhaps using code from this post where the same error was encountered:

awk -F "," ' NR == FNR {   cnt[$1] ++ } NR != FNR {   if (cnt[$1] == 1)     print $0 }' your-file your-file

I thinking about mixing the sort with that type of awk code to make it work for large files.

If you have any more ideas, suggestion or code sample please let me know.

DGPickett · May 31, 2011, 9:59am

Once sort can ensure the newest by key is first, and a sort -u following on kye will keep the first.

True, my script bit counts full duplicates uniquely. If you want the count or difference, you can compare the input of 'sort -u' to the output, perhaps using 'comm'. Since 'comm' expects unique records, if there are full duplicates, I run the sorted lists through 'uniq -c' to take full duplicates to a count and one record.

Awk is limited by its heap memory, but sort/comm/uniq/wc and pipes are robust with huge sets.

otheus · June 1, 2011, 12:04am

lol, yeah, you were reinventing the wheel. Oh well.

sort -s -k 3,6

for instance, sorts on the 3rd through 6th fields of the input. The -s allows the sort to be "stable" so you can make another sort later and the ordering will be consistent.

You can use uniq -c to count the number of times each line appears in a sorted list. Other options give you the ability to count only repeated lines, only count unique lines, skip the first N fields, etc.

comm is pretty useful too, reporting on lines common to both files (or only in the first but not the second, or vice-versa)