Parsing file using AWK

Mannu25251 · June 19, 2020, 9:59am

Hi,

I am parsing a file and extracting something based on some pattern

example.txt

f44942 f
f44942/Add.cpp f
s2018 s
f44942/Add2Numerics.cpp f
s3645 s
f44942/Add_scalar.cpp f
s2018 s
f44942/check_count_transform.cpp f
s2217 s
f44942/column_name.cpp f
s2402 s
f44942/column_name.so f
s426312 s
f44942/count_analytic1.cpp f
s2801 s
f44942/getperIns.sql f
s9273 s
fALTER_PARTITION f
fALTER_PARTITION/add_projection_refresh_test.sql f
s2747 s
fALTER_PARTITION/alter_partition_neg_test_1.sql f
s1965 s
fALTER_PARTITION/alter_partition_neg_test_2.sql f
s2494 s
fALTER_PARTITION/alter_partition_positive_test_1.sql f
s2496 s

I need the data in this format

450686,44942,directory
2018,44942/Add.cpp,file
3645,44942/Add2Numerics.cpp,file
2018,44942/Add_scalar.cpp,file
2217,44942/check_count_transform.cpp,file
2402,44942/column_name.cpp,file
426312,44942/column_name.so,file
2801,44942/count_analytic1.cpp,file
9273,44942/getperIns.sql,file
7206.ALTER_PARTITION,directory
2747,ALTER_PARTITION/add_projection_refresh_test.sql,file
1965,ALTER_PARTITION/alter_partition_neg_test_1.sql,file
2494,ALTER_PARTITION/alter_partition_neg_test_2.sql,file

directory have sum of size of all the files under the folder.

Till now I trim it down to certain level but stuck to proceed further - Please help.

cat example.txt | awk '{if(substr($1, 1, 1) == "s") {print$0","p} else {print $0}}{p=$0}' | grep -vE '^f.*/' |sed 's/ s,/,/;s/ f$//;s/^f//;s/,f/,/;s/^s//g'
44942
2018,44942/Add.cpp
3645,44942/Add2Numerics.cpp
2018,44942/Add_scalar.cpp
2217,44942/check_count_transform.cpp
2402,44942/column_name.cpp
426312,44942/column_name.so
2801,44942/count_analytic1.cpp
9273,44942/getperIns.sql
ALTER_PARTITION
2747,ALTER_PARTITION/add_projection_refresh_test.sql
1965,ALTER_PARTITION/alter_partition_neg_test_1.sql
2494,ALTER_PARTITION/alter_partition_neg_test_2.sql

vgersh99 · June 19, 2020, 11:26am

something along these lines - a bit verbose, but should ease the understanding.
Assumes your gawk is 4.0++.

gawk -f mannu2.awk mannu2.txt

where mannu2.awk is:

BEGIN {
  OFS=","
}
/\// && $2=="f" {
    d=substr($1,2,index($1,"/")-2)
    f=substr($1,2)
    fA[d][f]
}
$2=="s" {
    s=substr($1,2)
    fA[d][f]=s
    dAs[d]+=s
    #printf("%s%s%s%sfile\n", fA[d][f], OFS, f, OFS)
}
END {
   PROCINFO["sorted_in"]="@ind_str_asc"
   for(d in dAs) {
     print dAs[d], d, "directory"
     for (f in fA[d])
       printf("%s%s%s%sfile\n", fA[d][f], OFS, f, OFS)
   }
}

produces:

450686,44942,directory
2018,44942/Add.cpp,file
3645,44942/Add2Numerics.cpp,file
2018,44942/Add_scalar.cpp,file
2217,44942/check_count_transform.cpp,file
2402,44942/column_name.cpp,file
426312,44942/column_name.so,file
2801,44942/count_analytic1.cpp,file
9273,44942/getperIns.sql,file
9702,ALTER_PARTITION,directory
2747,ALTER_PARTITION/add_projection_refresh_test.sql,file
1965,ALTER_PARTITION/alter_partition_neg_test_1.sql,file
2494,ALTER_PARTITION/alter_partition_neg_test_2.sql,file
2496,ALTER_PARTITION/alter_partition_positive_test_1.sql,file

The ALTER_PARTITION size is a bit different from your desired sample, but I believe it's correct based on the sample size you provided.
See how far it gets you...

Scrutinizer · June 19, 2020, 12:18pm

Another approach you could try:

awk '/^f[^\/]*$/ {print x}sub(/./,x)' example.txt | 
awk '
  {
    tot=0; rec=""
    for(i=3; i<=NF; i+=4) {
      rec=rec ORS $(i+2) OFS $i OFS "file"
      tot+=$(i+2)
    } 
    print tot, $1, "directory" rec
  }
' RS= OFS=,

Mannu25251 · June 19, 2020, 12:32pm

Few of the lines do not have directories, the are simply the files.

f44942 f
f44942/Add.cpp f
s2018 s
f44942/Add2Numerics.cpp f
s3645 s
f44942/Add_scalar.cpp f
s2018 s
f44942/check_count_transform.cpp f
s2217 s
f44942/column_name.cpp f
s2402 s
f44942/column_name.so f
s426312 s
f44942/count_analytic1.cpp f
s2801 s
f44942/getperIns.sql f
s9273 s
fALTER_PARTITION f
fALTER_PARTITION/add_projection_refresh_test.sql f
s2747 s
fALTER_PARTITION/alter_partition_neg_test_1.sql f
s1965 s
fALTER_PARTITION/alter_partition_neg_test_2.sql f
s2494 s
fALTER_PARTITION/alter_partition_positive_test_1.sql f
s2496 s
ftest.csv f
s123 s
fALTER_PART f
fALTER_PART/add_projection_refresh_test.sql f
s2747 s
fALTER_PART/alter_partition_neg_test_1.sql f
s1965 s

these types of lines

ftest.csv f
s123 s

Can this be handle in above awk code. I tried with my method and achieved a bit

cat example.txt | awk '{if(substr($1, 1, 1) == "s") {print$0","p} else {print $0}}{p=$0}' | grep -vE '^f.*/' |sed 's/ s,/,/;s/ f$//;s/^f//;s/,f/,/;s/^s//g' > example2.txt

input_file=$1

while read line
do
  check=$( echo $line | grep ',' | wc -l )
  if [[ $check -ge 1 ]]
  then
     echo $line,file
  else
     grep -w "$line" $input_file | awk -F ',' '{print $1}' | grep -E '[0-9]*' |\
     awk -vcasenum=$line '{sum+=$0;}END{print sum","casenum",""directory";}'
  fi
done < $input_file

vgersh99 · June 19, 2020, 12:50pm

For these type of records, what the output should be?

Mannu25251 · June 19, 2020, 1:03pm

Sorry, this would be the output

495628,44942,directory
2018,44942/Add.cpp,file
3645,44942/Add2Numerics.cpp,file
2018,44942/Add_scalar.cpp,file
2217,44942/check_count_transform.cpp,file
2402,44942/column_name.cpp,file
426312,44942/column_name.so,file
2801,44942/count_analytic1.cpp,file
9273,44942/getperIns.sql,file
9702,ALTER_PARTITION,directory
2747,ALTER_PARTITION/add_projection_refresh_test.sql,file
1965,ALTER_PARTITION/alter_partition_neg_test_1.sql,file
2494,ALTER_PARTITION/alter_partition_neg_test_2.sql,file
2496,ALTER_PARTITION/alter_partition_positive_test_1.sql,file
123,test.csv,directory
123,test.csv,file
4712,ALTER_PART,directory
2747,ALTER_PART/add_projection_refresh_test.sql,file
1965,ALTER_PART/alter_partition_neg_test_1.sql,file

vgersh99 · June 19, 2020, 1:10pm

ftest.csv f
fALTER_PART/add_projection_refresh_test.sql f

Hmmm... trying to see how to distiguish the 2 above...
Is it save to assume that some files (without dirs) have .csv suffix?
Too specific?
How would you distinguish them?

Mannu25251 · June 19, 2020, 1:11pm

suffix can be anything. I can treat all of the a single value rather than distinguish them on basis of suffix

vgersh99 · June 19, 2020, 1:12pm

so how would you distinguish this case (plane files) from the others?

Mannu25251 · June 19, 2020, 1:15pm

cat example.txt | awk '{if(substr($1, 1, 1) == "s") {print$0","p} else {print $0}}{p=$0}' | grep -vE '^f.*/' |sed 's/ s,/,/;s/ f$//;s/^f//;s/,f/,/;s/^s//g' > example2.txt

$cat parser.sh
input_file=$1

while read line
do
  check=$( echo $line | grep ',' | wc -l )
  if [[ $check -ge 1 ]]
  then
     echo $line,file
  else
     grep -w "$line" $input_file | awk -F ',' '{print $1}' | grep -E '[0-9]*' |\
     awk -vcasenum=$line '{sum+=$0;}END{print sum","casenum",""directory";}'
  fi
done < $input_file

$ ./parser.sh example2.txt

This code somehow serve the purpose

vgersh99 · June 19, 2020, 1:25pm

try this:

BEGIN {
  OFS=","
}
(/\// || (!/\// && /[.]/)) && $2=="f" {
    d=(/\//)?substr($1,2,index($1,"/")-2) : substr($1,2)
    f=substr($1,2)
    fA[d][f]
}
$2=="s" {
    s=substr($1,2)
    fA[d][f]=s
    dAs[d]+=s
    #printf("%s%s%s%sfile\n", fA[d][f], OFS, f, OFS)
}
END {
   PROCINFO["sorted_in"]="@ind_str_asc"
   for(d in dAs) {
     print dAs[d], d, "directory"
     for (f in fA[d])
       printf("%s%s%s%sfile\n", fA[d][f], OFS, f, OFS)
   }
}

results in:

450686,44942,directory
2018,44942/Add.cpp,file
3645,44942/Add2Numerics.cpp,file
2018,44942/Add_scalar.cpp,file
2217,44942/check_count_transform.cpp,file
2402,44942/column_name.cpp,file
426312,44942/column_name.so,file
2801,44942/count_analytic1.cpp,file
9273,44942/getperIns.sql,file
4712,ALTER_PART,directory
2747,ALTER_PART/add_projection_refresh_test.sql,file
1965,ALTER_PART/alter_partition_neg_test_1.sql,file
9702,ALTER_PARTITION,directory
2747,ALTER_PARTITION/add_projection_refresh_test.sql,file
1965,ALTER_PARTITION/alter_partition_neg_test_1.sql,file
2494,ALTER_PARTITION/alter_partition_neg_test_2.sql,file
2496,ALTER_PARTITION/alter_partition_positive_test_1.sql,file
123,test.csv,directory
123,test.csv,file

Mannu25251 · June 20, 2020, 11:15am

Awesome! It works. Thank you! You are a champ

system · August 19, 2020, 11:16am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.