Help with awk performance

pmchang · February 15, 2024, 6:56pm

Hello Unix experts.

I'm looking for some input as to how I can improve the performance of the following awk command. It works well on a smaller file but the performance decreases greatly when dealing with a larger file (>1GB in size).

The awk command is to search a compressed file for a particular segment type (based on the input parameters) and generate separate compressed segment files. In the example below, the script will generate 2 separate compressed files (AM04.dat.gz, AM0G.dat.gz).

Snippet of the script "test.sh":

Usage: test.sh B8957ETD AM04,AM0G 13 5

# Accept input parms
source_filename="$1"
search_string_list="$2"
search_col_pos="$3"
search_str_len="$4"

#
# Define search_string_list_variable to be comma separated
IFS=","
search_col_pos=`expr "$search_col_pos" + 0`
search_str_len=`expr "$search_str_len" + 0`
search_elements=''

zcat $source_filename | awk -v search_col_pos=$search_col_pos -v search_str_len=$search_str_len -v search_elements="$search_elements" -v tgt_path="${CREDIT_CARD_SOURCE_DIR}/" 'BEGIN{RS="\r?\n"} {src_substring=substr($0, search_col_pos, search_str_len)} {cmd="gzip > " tgt_path src_substring ".dat.gz"}  index(search_elements, src_substring)>0 { print $0 | cmd }'

vgersh99 · February 15, 2024, 7:24pm

it could help if we had:

a small, but representative input file covering general cases and some edge cases.
a set of cli options covering different invocation scenarios.
a desired output based on the sample input from above.

One improvement would be: not to spawn "gzip" for EACH LINE/RECORD in the file, but construct a "string" with "gzip ... > whatever .dat.gz" (or whatever you're truing to do) and pipe it to "sh" to do the actual "deed".
Once again, we'd need to see more info to understand what you're after.

pmchang · February 15, 2024, 9:01pm

Thank you vgersh99 for getting back to me.

I have uploaded the source file(Test_File.revised.dat.gz) and the 2 output files generated (AM0G.dat.gz and AM04.dat.gz) as you requested. All files are fairly small in sizes.

AM0G.dat.gz (321 Bytes)
AM04.dat.gz (507 Bytes)
Test_File.revised.dat.gz (3.0 KB)

Paul_Pedant · February 15, 2024, 11:11pm

Key question here is about the distribution of the values of src_substring within the input file: how many unique values are there?

It is not true that the gzip is spawned for every line of the input file. The cmd is essentially a unique tag for the pipe, as a string like:

"gzip > /home/CreditCardDir/thisSrcString.dat.gz"

If the value of that whole cmd string has already been piped to once, the | cmd for it will just be sent to the same gzip process. (If this were not so, each new gzip would empty the previous .dat.gz file.) A new gzip only gets initiated for each new value of src_substring.

Notably, you never close (cmd), so all existing pipes stay open. So if your 1GB file has 5,000 distinct values of src_substring, you are running 5,000 pipes to 5,000 gzip processes.

awk is subject to the ulimit open files 1024 thing. If you try to use more than that, it will close one of the cmd files (on a Least Recently Used basis), and if you later use the identical cmd string it will re-open the same file (magically, in append mode). That can be a serious performance hit -- hundreds of times slower if it keeps opening files for a couple of lines each time.

It cannot do that for piping to commands, because each gzip has a map of compression codes in memory which would be lost every time, and the cmd redirection is not under awk's control (which it would be if the redirect was in awk > "Filename" syntax).

You have two possible methods to fix this:

(1) Sort the input file using the search_col_pos and search_str_len as the key, and gzip each group of consecutive lines, closing the command as you go.

(2) Write the unsorted groups of lines to output files that can be appended to, and then gzip them all at the end.

I overcame the ulimit performance hit by reading ten million lines at a time into an array (indexed by e.g. X[src_substring,serial]), and then outputting the groups, actually only needing a single output file at a time, and also retaining the original line order within groups. I used X[src_substring,"P"] and X[src_substring,"E"] to keep the first and last serial numbers in each group, so I had something to iterate over for output. But to use a pipe, you would need the whole >1GB data to be held at once.

vgersh99 · February 15, 2024, 11:20pm

something along these lines - making some assumptions for matching strings based on your sample data. You can definitely gold-plate this:

$ ./pchang.sh Test_File.revised.dat.gz 'AM04|AM0G'

to do the actual gzip-ing:

$ ./pchang.sh Test_File.revised.dat.gz 'AM04|AM0G'  | bash

where pchang.sh is:

#!/bin/bash

FILE_src="${1}"
LIST_str="${2}"
DIR_target="${CREDIT_CARD_SOURCE_DIR:-./dirTarget}"

zcat "${FILE_src}" | awk -v LIST_str="${LIST_str}" -v DIR_target="${DIR_target}" '
   BEGIN {
   split(LIST_str, strA, "|")
   for ( i in strA)
     target[i]=DIR_target "/" strA[i] ".dat"
   gzip="gzip -vf "
 }
 {
   for( i in strA)
     if ($1 ~ (strA[i] "$") ) {
       print $0 > target[i]
       next
     }
 }
 END {
 for (i in  target)
   print gzip target[i]
}

If CREDIT_CARD_SOURCE_DIR is not defined in the env, it's defaulted to the sub-directory dirTarget of the current directory. All target directory are assumed to exist - you can improve that as well.

Obviously you can add other bells and whistles.

Paul_Pedant · February 16, 2024, 12:48am

I'm not sure how a half-MB test file says anything about a performance issue with a 1GB file.

Is the performance much worse than linear O(n) for n rows, and what %age of CPU do the unzip and the several gzips use relative to the awk part? Is the average line length really 4KB?

It seems difficult to comment on the efficiency of the awk script in the original post, as it does not produce any output. It clearly does not do what your description claims.

The argument AM04,AM0G is assigned to search_string_list and is never referenced again.

The IFS is set to comma, but is never used by any command.

search_elements is always empty, and is passed into the awk.

The index function searches an empty string, so it can never get a match with src_substring, so nothing can ever be sent to | cmd.

Indexing into a string to match against a list is rather expensive, especially for long lists. Placing the list names in an array, and testing with if (src_substring in List) ... is way faster.

You might upload the full script that you are using, instead of this "Snippet".

pmchang · February 16, 2024, 1:15am

Thank you Paul_Pendant for your response.

Currently we have 20 unique segment types but in the future it may increase to more.

pmchang · February 16, 2024, 1:17am

Thank you vgersh99 for your proposed solution - I will definitely take a look.

pmchang · February 16, 2024, 1:20am

Here's the full script as you requested.

Thank you.
edw_credit_card_file_splitter_compressed.sh (7.0 KB)

MadeInGermany · February 16, 2024, 8:17am

The script should have a shebang like #!/bin/sh
Of course it is only a comment if you run it with an explicit interpreter like
/bin/bash scriptname ...

Usually the first thing I do is to reformat the script to indented multi-line.
Here the last lines with the embedded awk script:

echo "${search_elements}"
zcat "${CREDIT_CARD_SOURCE_DIR}/${org_src_filename}" |
  awk -v search_col_pos=$search_col_pos -v search_str_len=$search_str_len -v search_elements="$search_elements" -v tgt_path="${CREDIT_CARD_SOURCE_DIR}/" '
    BEGIN { RS="\r?\n" }
    { src_substring=substr($0, search_col_pos, search_str_len) }
    { cmd=("gzip > " tgt_path src_substring ".dat.gz") }
    index(search_elements, src_substring)>0 { print $0 | cmd }
  '

awk has no operator for string concatenation; I put long ones in ( )
Now it is better readable - at least for me

Paul_Pedant · February 16, 2024, 10:32am

Awk would have no problems running at least 50 simultaneous piped output comments. Of course, the OS would be scheduling those 50 gzips against each other, but that is a different issue.

Paul_Pedant · February 16, 2024, 10:43am

I started to write performance comparisons, and found some issues. I made a 1GB file of plain text (8 copies of 130MB).

Firstly, is your >1MB file that size when compressed, or as uncompressed? Mine compresses down by about 60%. Your Test_File.revised.dat.gz compresses down by 99.3%. That's because it is 95% spaces, and that part compresses much better than real data would.

Secondly, when you run your tests, you probably do not clear cache memory every time. So all your test data will be available without even reading the actual disk (either because you just created it, or because you ran the timings more than once in a row). That won't be true for your full-size test, because you won't just have created the large file.

My .gz takes 4m20.316s to compress, and 0m22.538s to decompress.

That 0m22.538s is from time zcat uData.txt.gz > /dev/null i.e. dumping to nowhere.

If I run a pipe into an awk and discard the output, this takes 0m25.769s

time zcat uData.txt.gz | awk '/independent/' > /dev/null

So the zcat takes 87% of the time, and the additional awk takes only 13%.

I believe we are seeking to optimise the wrong process here. Even if awk was instantaneous, it would make the overall process only one-sixth faster. And I have not even considered the multiple gzips working on the output streams. Compression seems to be ten times slower than decompression.

MadeInGermany · February 16, 2024, 11:29am

50 pipes to gzip are not always slow but certainly cause system load.
The following collects all output in a buffer array, and at the END prints to one gzip at a time.

echo "${search_elements}"
zcat ${CREDIT_CARD_SOURCE_DIR}/${org_src_filename} |
  awk -v search_col_pos=$search_col_pos -v search_str_len=$search_str_len -v search_elements="$search_elements" -v tgt_path="${CREDIT_CARD_SOURCE_DIR}/" '
    BEGIN { RS="\r?\n" }
    { src_substring=substr($0, search_col_pos, search_str_len) }
    (index(search_elements, src_substring) > 0) { buffer[src_substring]=(buffer[src_substring] $0 "\n") }
    END {
      for (b in buffer) {
        cmd=("gzip > " tgt_path b ".dat.gz")
        printf "%s", buffer[b] | cmd
        close(cmd)
      }
    }
  '

The close() should end a running gzip, system load should be low.
It still uses the index() string search, because in my test (with GNU awk and 500 entries) I found an array lookup was not significantly faster.

pmchang · February 16, 2024, 2:25pm

Thank you Paul for spending time testing the different scenarios and giving out a very detail explanation of your test results.

pmchang · February 16, 2024, 2:26pm

Thank you MadeInGermany for a proposed solution - I will take a look at that also.

pmchang · February 16, 2024, 3:37pm

Thank you rtwolfe94022 for another proposed solution.

All of you have been a great help and very much appreciated.

MadeInGermany · February 16, 2024, 4:15pm

Hmm, according to my perception of the Posix awk man page, and confirmed by @Paul_Pedant , awk has an implicit hash of output expressions anyway. (It does not run the same gzip again, that would even overwrite its output file.) So there is no need for an explicit file_handles hash.
Further, search_elements is a string; you cannot use the array lookup src_substring in search_elements.

pmchang · February 16, 2024, 7:02pm

MadeInGermany:

echo "${search_elements}"
zcat ${CREDIT_CARD_SOURCE_DIR}/${org_src_filename} |
  awk -v search_col_pos=$search_col_pos -v search_str_len=$search_str_len -v search_elements="$search_elements" -v tgt_path="${CREDIT_CARD_SOURCE_DIR}/" '
    BEGIN { RS="\r?\n" }
    { src_substring=substr($0, search_col_pos, search_str_len) }
    (index(search_elements, src_substring) > 0) { buffer[src_substring]=(buffer[src_substring] $0 "\n") }
    END {
      for (b in buffer) {
        cmd=("gzip > " tgt_path b ".dat.gz")
        printf "%s", buffer[b] | cmd
        close(cmd)
      }
    }
  '

Hello MadeInGermany.

I tried your proposed code on a large compressed file and it ran for over an hour (and still running) with no outputs.

Using the same data file but uncompressed, it runs for one hour.

I guess anything that runs longer than an hour would not be a viable solution to replace the existing code.

Thank you.

pmchang · February 16, 2024, 7:04pm

rtwolfe94022:

#!/bin/bash

# Accept input parameters
source_filename="$1"
search_string_list="$2"
search_col_pos="$3"
search_str_len="$4"

# Convert input parameters
IFS=","
search_col_pos=$(($search_col_pos + 0))
search_str_len=$(($search_str_len + 0))

# Pre-declare associative array for file handles
declare -A file_handles

# Process the file
zcat "$source_filename" | awk -v search_col_pos="$search_col_pos" -v search_str_len="$search_str_len" -v tgt_path="${CREDIT_CARD_SOURCE_DIR}/" 'BEGIN { RS="\r?\n"; FS="," }
{
    src_substring = substr($0, search_col_pos, search_str_len);
    if (src_substring in search_elements) {
        if (!(src_substring in file_handles)) {
            file_handles[src_substring] = "gzip > " tgt_path src_substring ".dat.gz";
        }
        print $0 | file_handles[src_substring];
    }
}
END {
    for (handle in file_handles) {
        close(file_handles[handle]);
    }
}' search_elements="$search_string_list"

Hello rtwolfe94022

I tried your proposed solution but getting the following error:

"awk: cmd. line:4: (FILENAME=- FNR=1) fatal: attempt to use scalar `search_elements' as an array". Is there a typo in the code?

Thank you.

MadeInGermany · February 16, 2024, 7:29pm

No it's broken. I said it in my last post:

Regarding the run time,

Uhh.
Here comes the hash version, was (only) 4 times faster in my test.

echo "${search_elements}"
zcat ${CREDIT_CARD_SOURCE_DIR}/${org_src_filename} |
  awk -v search_col_pos=$search_col_pos -v search_str_len=$search_str_len -v search_elements="$search_elements" -v tgt_path="${CREDIT_CARD_SOURCE_DIR}/" '
    BEGIN {
      RS="\r?\n"
      split(search_elements, search_arr, /\|/)
      search_elements=""
      for (a in search_arr) { search_hash[search_arr[a]] }
      split("", search_arr)
    }
    { src_substring=substr($0, search_col_pos, search_str_len) }
    (src_substring in search_hash) { buffer[src_substring]=(buffer[src_substring] $0 "\n") }
    END {
      for (b in buffer) {
        cmd=("gzip > " tgt_path b ".dat.gz")
        printf "%s", buffer[b] | cmd
        close(cmd)
      }
    }
  '