Complex text parsing with speed/performance problem (awk solution?)

Michael_Stora · March 15, 2013, 1:29pm

I have 1.6 GB (and growing) of files with needed data between the 11th and 34th line (inclusive) of the second column of comma delimited files. There is also a lot of stray white space in the file that needs to be trimmed. They have DOS-like end of lines.

I need to transpose the 11th through 34th lines of col2 from each data file and append them as new rows to an existing file. I also need to add several variables to the front and back of each output line which will be parsed/calculated from the data file names and file metadata.

Input:
...,...
xxx, 9
xxx. 10
xxx, 11 <--need 11th through 34th row in col2.
...,...
xxx, 34
xxx, 35
xxx, 36
...,...

Output:
var1,var2,var3,var4,var5,var6,11,12,13,...,32,33,34,/original/directory/path/of/data/file/,original_data_file_name

Then the entire file including rows previously in it need to be sorted by several of the columns, and duplicate lines removed (excluding some columns from the duplicate determination).

I'm thinking there is a way to do the selecting, whitespace removal, transpose with padding of variables on both ends of the output line in one awk command which should speed things up a whole lot, but I am not that good at awk.

Mike

RudiC · March 15, 2013, 1:53pm

As much as I would like to help, you lost me. I can't possibly imagine what you want to achieve, coming from where.
Pls post directory structure (as you want that looong path included), algorithms to calculate the varns, an input file sample, and from that concoction the desired output.

EDIT: This may fulfill part of your requirements:

$ awk 'NR > 10 {printf "%s,", $2} NR == 34 {printf "%s", FILENAME; exit}' /some/path/file

EDIT 2: put sub (/[^\/]*$/, ",&",FILENAME); in front of the print FILENAME .

Michael_Stora · March 15, 2013, 2:08pm

RudiC, not to over complicate things, assume there is a loop through all the data files and the variables FilePathName, FilePath, FileName, Var1, ..., Var6 are defined within the loop before the code is called.

The transpose with padding code will be called for one file at a time within the loop. If awk is used, the variables can be passed using the -v option as needed. Var1-6 are not needed in the awk so they can be passed as one concatenated string.

After the loop, the sort is done and duplicate columns will be deleted. If you don't want to keep it general, assume it is the last two colums FilePath and FileName that are not part of the duplicate line determination.

Mike

DGPickett · March 15, 2013, 2:11pm

Certainly putting the tr to delete white space first may speed things up. Doing it all in C/c++/java/perl/python/awk might help. Move dos2unix after the various reducers. What is a 'foot'? tail?

RudiC · March 15, 2013, 2:19pm

I'm not sure if dos2unix is needed at all as awk doesn't encounter the <CR> char printing only $2. Unless there's more non- ASCII chars to translate, that is.
If you supply var1-6 as one long string to awk, I think the attempt above is pretty close. Printf var1-6 in a BEGIN section, or using NR==1.

Michael_Stora · March 15, 2013, 2:32pm

Thanks for giving me some code to work with. Here is an actual data file (with IP obfuscation)

XXXXXXX,,,,XXXXXXX
XXXXXXX,XXXXXXX,,,
XXXXXXX,XXXXXXX,,,
XXXXXXX,,,,
,,,,
XXXXXXX,,,,
XXXXXXX,XXXXXXX,,,
,,,,
XXXXXXX,,,,
XXXXXXX,XXXXXXX,,,
XXXXXXX,411.7075872,,,
XXXXXXX,388.756628,,,
XXXXXXX,384.2531634,,,
XXXXXXX,188.317418,,,
XXXXXXX,495.1306749,,,
XXXXXXX,495.7313397,,,
XXXXXXX,364.8057139,,,
XXXXXXX,128.1652694,,,
XXXXXXX,78.1880777,,,
XXXXXXX,47.85832595,,,
XXXXXXX,397.106979,,,
XXXXXXX,171.5723148,,,
XXXXXXX,452.5367818,,,
XXXXXXX,334.4613963,,,
XXXXXXX,245.0863368,,,
XXXXXXX,182.0549603,,,
XXXXXXX,495.5126526,,,
XXXXXXX,30.64512099,,,
XXXXXXX,291.9205658,,,
XXXXXXX,221.6485369,,,
XXXXXXX,24.33776897,,,
XXXXXXX,270.5466812,,,
XXXXXXX,32.99794073,,,
XXXXXXX,183.2580134,,,

Mike

RudiC · March 15, 2013, 2:36pm

Try this:

awk -F, 'NR == 1        {printf "%s,", var16}
         NR > 10        {gsub (/ /,"",$2); printf "%s,", $2}
         NR == 24       {sub (/[^\/]*$/, ",&",FILENAME); printf "%s\n", FILENAME; exit}
        ' var16="v1,v2,v3,v4,v5,v6" /some/path/to/file
v1,v2,v3,v4,v5,v6,411.7075872,388.756628,384.2531634,188.317418,495.1306749,495.7313397,364.8057139,128.1652694,78.1880777,47.85832595,397.106979,171.5723148,452.5367818,334.4613963,/some/path/to/,file

DGPickett · March 15, 2013, 2:43pm

I gotta take a sed swipe at this:

echo "var1,var2,var3,var4,var5,var6,$(
sed '
   1,10d
   s/^[^,]*, *\([^,]*[^ ,]\).*/\1/
   :l
   N
   s/\n[^,]*, *\([^,]*[^ ,]\).*/,\1/
   34q
   b l
  ' original_data_file_name | dos2unix
 ),/original/directory/path/of/data/file/,original_data_file_name"

Narrative for sed: delete 10 lines, make line 11 into csv column two without leading spaces and stop before any trailing spaces (assumes never blank, else we can either trim trailing spaces in one more 's' line or 't' detect no sub and sub in a bare comma), discarding the rest of that line, set a loop branch target 'l', get the next line, turn the linefeed and second line into a comma and column 2 as before without white space, quit if line 34, branch to 'l' otherwise. We can work on the raw file since space, comma are not different code in DOS, and maybe we do not need dos2unix unless there are other codes that need fixing; any carriage returns got tossed.

If the files are long, make your first command head?

Michael_Stora · March 15, 2013, 2:48pm

RudiC, Thanks a bunch.

I threw in some random whitespace that sometimes happens to test.

awk -F, 'NR == 1        {printf "%s,", HeaderRows}
         NR > 10        {gsub (/ /,"",$2); printf "%s,", $2}
         NR == 34       {sub (/[^\/]*$/, ",&",FILENAME); printf "%s\n", FILENAME; exit}
        ' HeaderRows="v1,v2,v3,v4,v5,v6" ~/example.txt

Output:

v1,v2,v3,v4,v5,v6,411.7075872,388.756628,384.2531634,188.317418,495.1306749,495.7313397,364.8057139,128.1652694,78.1880777,47.85832595,397.106979,171.5723148,452.5367818,334.4613963,245.0863368,182.0549603,495.5126526,30.64512099,291.9205658,221.6485369,24.33776897,270.5466812,32.99794073,183.2580134,/home/mestora/,example.csv

DGPickett, yes I meant "tail". Wrong animal
I will performance test your solution vs. the awk one. Unfortunately this will be exported to a cygwin over Win7 environment in final implementation (part of my performance issues)

Mike

RudiC · March 15, 2013, 2:57pm

So you've got 4 spaces in the output? When I seed several spaces in my test file, they get gsubbed. Pls post the input file...

DGPickett · March 15, 2013, 2:59pm

Are embedded spaces within field values a problem? Usually, they are called 'data'.

Michael_Stora · March 15, 2013, 3:23pm

The QUOTE block created the spaces in my reply, I switched to CODE and they are gone.

I am time testing both solutions right now.

Mike

---------- Post updated at 12:23 PM ---------- Previous update was at 12:10 PM ----------

I gave you both thanks.

awk solution 10000 times:

date
for i in {1..10000}
do
awk -F, 'NR == 1        {printf "%s,", HeaderRows}
         NR > 10        {gsub (/ /,"",$2); printf "%s,", $2}
         NR == 34       {sub (/[^\/]*$/, ",&",FILENAME); printf "%s\n", FILENAME; exit}
        ' HeaderRows="v1,v2,v3,v4,v5,v6" ~/example.csv > /dev/nul
done
date

Running time: 4:26 (~27 mS per file)

sed solution 10000 times:

 date
 for i in {1..10000}
 do
echo "var1,var2,var3,var4,var5,var6,$(
sed '
   1,10d
   s/^[^,]*, *\([^,]*[^ ,]\).*/\1/
   :l
   N
   s/\n[^,]*, *\([^,]*[^ ,]\).*/,\1/
   34q
   b l
  ' example.csv | dos2unix
 ),/home/mestora/,example.csv"  >/dev/nul 
done
date

Running time: 9:56 (~60 mS per file)

Mike

DGPickett · March 15, 2013, 3:56pm

Pull off the dos2unix, it might be exec time!As awk is more field oriented, it has an advantage on this.

Michael_Stora · March 15, 2013, 8:35pm

never mind, figured it out. I thought ^ here was a font anchor but it is part of a completed string RE instead.