Command to read between patterns in a while

venkidhadha · January 31, 2017, 9:02am

I am currently working on a requirement in a file wher I have to filter the characters between two specific fields/patters and get the count of total no of characters between the two fields.

REQUIREMENT:

The below content is in a file

I have to get the no of characters between each instance starting with <test> and </test1> throughout the file.
With the no of characters obtained,
Find
No of tags which has more than 30 characters between them
No of tags which has less than 30 characters between t hem.

Ex:

<test>12312 njh</test1>
<test>abcdedfg
hijklmno
abched</test1>

I tried to perform this with sed command but I am unable to get an output.

RudiC · January 31, 2017, 9:09am

Please show your attempts with sed so we can give you a hand.

venkidhadha · January 31, 2017, 9:33am

Hi Rudic,
Actually the input file[input.txt] had tags other than <test></test1>

ex :

<test>12312 njh</test1>
<tag1>apple</tag1>
<tag2>orange</tag2>
<test>abcdedfg
hijklmno
abched</test1>
<test>apple ball
cat orange
pineapple
mango</test1>

I used "

sed - n '/<test>/,/<\/test1>/{s/<tag1>.*//;s/<tag2>.*//;p;}' input.txt

So got only the required tags in the output, but I am not able to count the characters between tags as some tags have newline in between and some tags don't have. I am a newbie to Shell. Kindly, please help

Expected output is :

0-30 char : ?
>30 characters :?

RavinderSingh13 · January 31, 2017, 10:23am

Hello venkidhadha,

Based on your Input_file/samples shown, could you please try following and let me know if this helps you.

awk -vST=": TAG has characters replacement count is: " '($0 ~ /<test>/ && $0 ~ /<\/test1>/){gsub(/<test>|<\/test1>/,"");print ++instance ST gsub(/[a-zA-Z]/,"");if(Q>30){MAX++} else {MIN++};next} ($0 ~ /<test>/){A=1;sub(/<test>/,"");Q+=gsub(/a-zA-Z]/,"")} ($0 ~ /<\/test1>/ && A){A="";sub(/<\/test1>/,"");print ++instance ST Q+gsub(/[a-zA-Z]/,"");if(Q>30){MAX++} else {MIN++};next} A{Q+=gsub(/[a-zA-Z]/,"")} END{printf("%s%01d\n%s%01d\n","Number of tags having more than 30 replacement of characters are: ",MAX,"Number of tags having less than 30 replacement of characters are: ",MIN)}'    Input_file

Output will be as follows.

1: TAG has characters replacement count is: 3
2: TAG has characters replacement count is: 22
Number of tags having more than 30 replacement of characters are: 0
Number of tags having less than 30 replacement of characters are: 2

EDIT: Adding a non-one liner form of solution too now successfully.

awk -vST=": TAG has characters replacement count is: " '
                                                        ($0 ~ /<test>/ && $0 ~ /<\/test1>/){
                                                                                            gsub(/<test>|<\/test1>/,"");
                                                                                            print ++instance ST gsub(/[a-zA-Z]/,"");
                                                                                            if(Q>30){
                                                                                                     MAX++
                                                                                                    }
                                                                                            else    {
                                                                                                     MIN++
                                                                                                    };
                                                                                            next
                                                                                           }
                                                        ($0 ~ /<test>/)                    {
                                                                                            A=1;
                                                                                            sub(/<test>/,"");
                                                                                            Q+=gsub(/a-zA-Z]/,"")
                                                                                           }
                                                        ($0 ~ /<\/test1>/ && A)            {
                                                                                            A="";
                                                                                            sub(/<\/test1>/,"");
                                                                                            print ++instance ST Q+gsub(/[a-zA-Z]/,"");
                                                                                            if(Q>30){
                                                                                                     MAX++
                                                                                                    }
                                                                                            else    {
                                                                                                     MIN++
                                                                                                    };
                                                                                            next
                                                                                           }
                                                        A                                  {
                                                                                            Q+=gsub(/[a-zA-Z]/,"")
                                                                                           }
    END{
        printf("%s%01d\n%s%01d\n","Number of tags having more than 30 replacement of characters are: ",MAX,"Number of tags having less than 30 replacement of characters are: ",MIN)}
                                                       '  Input_file

Thanks,
R. Singh

rdrtx1 · January 31, 2017, 10:32am

awk '
/<\// {sub("</[^>]*>", ""); sub("<[^>]*>", ""); s=s $0; (length(s) <= 30) ? l30++ : g30++; s=""}
{sub("<[^>]*>", ""); sub("</[^>]*>", ""); s=s $0}
END {print "0-30 char : " l30;
     print ">30 characters : " g30;
}' input.txt

RudiC · January 31, 2017, 10:47am

I'm afraid sed (alone) can't do that, as it can't calculate nor count. On top, your request is not quite clear - does the term "char" as used by you include digits and punctuation etc, or not? Please specify. If all that is included, try a combination like

sed ':L; $ {s/\n//g;
            s/<\/test1>/\n/g
            s/<test>\|<tag.>.*<\/tag.>//g
            s/\n$//}
      N; bL
' file |
{ while read LN
     do [ ${#LN} -ge 30 ] && A=$((A+1)) || U=$((U+1))
     done
     echo "0 - 30 char: " $U
     echo "  > 30 char: " $A
}
0 - 30 char:  2
  > 30 char:  1

anbu23 · February 1, 2017, 7:15am

rudic:

I'm afraid sed (alone) can't do that, as it can't calculate nor count. On top, your request is not quite clear - does the term "char" as used by you include digits and punctuation etc, or not? Please specify. If all that is included, try a combination like
sed ':L; $ {s/\n//g;
   s/<\/test1>/\n/g
   s/<test>\|<tag.>.*<\/tag.>//g
   s/\n$//}
   N; bL
' file |
{ while read LN
   do [ ${#LN} -ge 30 ] && A=$((A+1)) || U=$((U+1))
   done
   echo "0 - 30 char: " $U
   echo "  > 30 char: " $A
}
0 - 30 char:  2
  > 30 char:  1

Just tried for fun

$ sed ':L; $ {s/\n//g;
>             s/<\/test1>/\n/g
>             s/<test>\|<tag.>.*<\/tag.>//g
>             s/\n$//}
>       N; bL
> ' f | sed -n "s/^.\{1,30\}$/lt/p" | sed -n "/lt/{$ =;}"
2
$ sed ':L; $ {s/\n//g;
>             s/<\/test1>/\n/g
>             s/<test>\|<tag.>.*<\/tag.>//g
>             s/\n$//}
>       N; bL
> ' f | sed -n "s/^.\{31,\}$/gt/p" | sed -n "/gt/{$ =;}"
1