Multi line extraction based on condition

reldb · July 21, 2014, 3:05am

Hi

I have some data in a file as below

******************************
Class 1A
Students absent are :
1. ABC
2. CDE
3. CPE

******************************
Class 2A
Students absent are :

******************************
Class 3A
Students absent are :

******************************
Class 17ACF
Students absent are :
1. ABCD
2. XYZ

From this file i just need to fetch/extract the data where ever there is some value for Students absent
Class name is dynamic and no of absent students are also dynamic
E.g. Output should look like

******************************
Class 1A
Students absent are :
1. ABC
2. CDE
3. CPE

******************************
Class 17ACF
Students absent are :
1. ABCD
2. XYZ

Pls help how could i do it via simple command or a script.

Thanks in advance
rel

bakunin · July 21, 2014, 11:08am

I suggest you read the man page of "grep" and find out what this utility can do for you. The man page is - like any other man page - accessible via

man grep

If you need to count lines you might want to give the "-c" options some special attention.

I hope this helps.

bakunin

DHeisenberg · July 21, 2014, 12:34pm

Hi reldb,
I can quickly give you an algorithm to this. Just convert it into unix code and make use of grep command for searching patterns.
create two temporary files file1.txt and file2.txt

scount=0
while read k
do

    if [ line starts with Class ]
    then
        put the line into a file1.txt
    
    elif [ line starts with Students ]
    then
        append the line into file1.txt
    
    elif [ line starts with a number ]
        scount+=1
        append the line into file1.txt
    elif [line is empty and scount >=1 ]
    then
        insert an empty line into file1.txt
        insert ****** into file1.txt
        append file1.txt data to file2.txt
        empty file1.txt
        scount=0
    elif [ line is empty and scount=0 ]
    then
        empty the file1.txt
    fi
done <"Sourcefile.txt"

The above could be used assuming the structure of your source file remains the same as you have provided.

RudiC · July 21, 2014, 1:55pm

You should have shown us what your attempts were. Anyhow, try

awk     '/\*\*\*/               {if (CNT>4) for (i=1;i<=CNT;i++) print T; CNT=0}
                                {T[++CNT]=$0}
         END                    {if (CNT>4) for (i=1;i<=CNT;i++) print T}
        '  file
******************************
Class 1A
Students absent are :
1. ABC
2. CDE
3. CPE

******************************
Class 17ACF
Students absent are :
1. ABCD
2. XYZ

EDIT: This was nice but it didn't quite satisfy your spec:

awk'(A=gsub (/\n/, "&"))>4||A==0' RS="*" ORS="*" file
******************************
Class 1A
Students absent are :
1. ABC
2. CDE
3. CPE

****************************************************************************************
Class 17ACF
Students absent are :
1. ABCD
2. XYZ
*

reldb · July 26, 2014, 5:36am

DHeisenberg - Thanks for suggestion. I wrote a program on similar patter in java and it is working perfectly fine.

RudiC - Thanks for your suggestion, Below one worked fine. (got some error in 2nd suggestion with live data)

awk     '/\*\*\*/               {if (CNT>4) for (i=1;i<=CNT;i++) print T; CNT=0}
                                {T[++CNT]=$0}
         END                    {if (CNT>4) for (i=1;i<=CNT;i++) print T}
        '  file

I have couple of question to understand it better and use it for other future requirement as well.

/\*\*\*/ is extracting the paragraph based on *** pattern and then based on number of line/rows count result is getting printed.
Instead of counting the number of lines if i want to check in this paragraph if any line starts with a number(.) then print it (kind of true or false logic) then how to do
I couldn't understand the logic of print 2 times (one before end and other after end with similar logic) even though final output is only once.

Thanks

MadeInGermany · July 26, 2014, 11:17am

You need another condition plus another variable.
This one uses a string to store the line (simpler than an array).

awk '
$1~/\*\*\*/ {if (c>0) print buf; c=0; buf=$0; next}
{buf=buf RS $0}
$1~/[0-9]+\./ {c++}
END {if (c>0) print buf}
' file

Because it prints only at the *** lines, and your example does not end with it, you need another print at the end, otherwise your last section is never printed.