Remove sections based on duplicate first line

ahmedwaseem2000 · January 16, 2015, 4:03am

Hi,

I have a file with many sections in it. Each section is separated by a blank line.
The first line of each section would determine if the section is duplicate or not.
if the section is duplicate then remove the entire section from the file.

below is the example of input and output. Wherein, the lines starting with *& is the first line and there are 2 sections with the same first line. I need to delete one of them.

Input:
*& abc def
1
2
3
4
5

*& cde efg
1
2
3

*& abc def
1
2
3
4
5

Output:
*& cde efg
1
2
3

*& abc def
1
2
3
4
5

Thanks for your help!!

disedorgue · January 16, 2015, 4:29am

Hello,
If order out of sections is not important, with (gnu) awk:

awk 'BEGIN{RS='\n\n'};{A[$0]=1};END{for (h in A) print h,"\n"}' file

Regards.

RudiC · January 16, 2015, 5:50am

That works if DOS <CR> line terminators are removed from the input file. Try also

awk '/^\*\&/ {STOP=($0 in T); T[$0]} /^ *$/ {STOP=0} !STOP' file

ahmedwaseem2000 · January 16, 2015, 3:14pm

Thanks for your help. your code worked fine. I had already tried similar code but the difference was I didn't set RS, and instead of A[$0] =1 I assigned A[$0]=$0 and the array was getting jumbled up. Do you know the reason?

Rudic - I dont quite understand this code. can you please help me understand?

awk '/^\*\&/ {STOP=($0 in T); T[$0]} /^ *$/ {STOP=0} !STOP' file4

Thank you both for your help!!

RudiC · January 16, 2015, 3:20pm

awk '/^\*\&/ {STOP=($0 in T)            # if header (identified by *&) is known, stop the printing
              T[$0]                     # remember the header line next time
             } 
     /^ *$/  {STOP=0}                   # empty line: reenable printing
     !STOP                              # use default action: print, if NOT STOPped
    ' file

disedorgue · January 16, 2015, 3:50pm

By default, Record Separator is one '\n' that represent end of line, if RS is set to '\n\n', for awk, one record (line) is terminate by '\n\n'.
With this way, one line is one section.

Regards.