Command to read between patterns in a while

I am currently working on a requirement in a file wher I have to filter the characters between two specific fields/patters and get the count of total no of characters between the two fields.

REQUIREMENT:

The below content is in a file

  1. I have to get the no of characters between each instance starting with <test> and </test1> throughout the file.
  2. With the no of characters obtained,
    Find
    No of tags which has more than 30 characters between them
    No of tags which has less than 30 characters between t hem.

Ex:

<test>12312 njh</test1>
<test>abcdedfg
hijklmno
abched</test1>

I tried to perform this with sed command but I am unable to get an output.

Please show your attempts with sed so we can give you a hand.

Hi Rudic,
Actually the input file[input.txt] had tags other than <test></test1>

ex :

<test>12312 njh</test1>
<tag1>apple</tag1>
<tag2>orange</tag2>
<test>abcdedfg
hijklmno
abched</test1>
<test>apple ball
cat orange
pineapple
mango</test1>

I used "

sed - n '/<test>/,/<\/test1>/{s/<tag1>.*//;s/<tag2>.*//;p;}' input.txt

So got only the required tags in the output, but I am not able to count the characters between tags as some tags have newline in between and some tags don't have. I am a newbie to Shell. Kindly, please help

Expected output is :

0-30 char : ?
>30 characters :?

Hello venkidhadha,

Based on your Input_file/samples shown, could you please try following and let me know if this helps you.

awk -vST=": TAG has characters replacement count is: " '($0 ~ /<test>/ && $0 ~ /<\/test1>/){gsub(/<test>|<\/test1>/,"");print ++instance ST gsub(/[a-zA-Z]/,"");if(Q>30){MAX++} else {MIN++};next} ($0 ~ /<test>/){A=1;sub(/<test>/,"");Q+=gsub(/a-zA-Z]/,"")} ($0 ~ /<\/test1>/ && A){A="";sub(/<\/test1>/,"");print ++instance ST Q+gsub(/[a-zA-Z]/,"");if(Q>30){MAX++} else {MIN++};next} A{Q+=gsub(/[a-zA-Z]/,"")} END{printf("%s%01d\n%s%01d\n","Number of tags having more than 30 replacement of characters are: ",MAX,"Number of tags having less than 30 replacement of characters are: ",MIN)}'    Input_file

Output will be as follows.

1: TAG has characters replacement count is: 3
2: TAG has characters replacement count is: 22
Number of tags having more than 30 replacement of characters are: 0
Number of tags having less than 30 replacement of characters are: 2

EDIT: Adding a non-one liner form of solution too now successfully.

awk -vST=": TAG has characters replacement count is: " '
                                                        ($0 ~ /<test>/ && $0 ~ /<\/test1>/){
                                                                                            gsub(/<test>|<\/test1>/,"");
                                                                                            print ++instance ST gsub(/[a-zA-Z]/,"");
                                                                                            if(Q>30){
                                                                                                     MAX++
                                                                                                    }
                                                                                            else    {
                                                                                                     MIN++
                                                                                                    };
                                                                                            next
                                                                                           }
                                                        ($0 ~ /<test>/)                    {
                                                                                            A=1;
                                                                                            sub(/<test>/,"");
                                                                                            Q+=gsub(/a-zA-Z]/,"")
                                                                                           }
                                                        ($0 ~ /<\/test1>/ && A)            {
                                                                                            A="";
                                                                                            sub(/<\/test1>/,"");
                                                                                            print ++instance ST Q+gsub(/[a-zA-Z]/,"");
                                                                                            if(Q>30){
                                                                                                     MAX++
                                                                                                    }
                                                                                            else    {
                                                                                                     MIN++
                                                                                                    };
                                                                                            next
                                                                                           }
                                                        A                                  {
                                                                                            Q+=gsub(/[a-zA-Z]/,"")
                                                                                           }
    END{
        printf("%s%01d\n%s%01d\n","Number of tags having more than 30 replacement of characters are: ",MAX,"Number of tags having less than 30 replacement of characters are: ",MIN)}
                                                       '  Input_file
 

Thanks,
R. Singh

awk '
/<\// {sub("</[^>]*>", ""); sub("<[^>]*>", ""); s=s $0; (length(s) <= 30) ? l30++ : g30++; s=""}
{sub("<[^>]*>", ""); sub("</[^>]*>", ""); s=s $0}
END {print "0-30 char : " l30;
     print ">30 characters : " g30;
}' input.txt

I'm afraid sed (alone) can't do that, as it can't calculate nor count. On top, your request is not quite clear - does the term "char" as used by you include digits and punctuation etc, or not? Please specify. If all that is included, try a combination like

sed ':L; $ {s/\n//g;
            s/<\/test1>/\n/g
            s/<test>\|<tag.>.*<\/tag.>//g
            s/\n$//}
      N; bL
' file |
{ while read LN
     do [ ${#LN} -ge 30 ] && A=$((A+1)) || U=$((U+1))
     done
     echo "0 - 30 char: " $U
     echo "  > 30 char: " $A
}
0 - 30 char:  2
  > 30 char:  1

Just tried for fun :slight_smile:

$ sed ':L; $ {s/\n//g;
>             s/<\/test1>/\n/g
>             s/<test>\|<tag.>.*<\/tag.>//g
>             s/\n$//}
>       N; bL
> ' f | sed -n "s/^.\{1,30\}$/lt/p" | sed -n "/lt/{$ =;}"
2
$ sed ':L; $ {s/\n//g;
>             s/<\/test1>/\n/g
>             s/<test>\|<tag.>.*<\/tag.>//g
>             s/\n$//}
>       N; bL
> ' f | sed -n "s/^.\{31,\}$/gt/p" | sed -n "/gt/{$ =;}"
1