How to extract entire stanza using awk?

prvnrk · September 10, 2018, 12:34pm

Hello friends,

I have a text file with lot of stanzas with each starting with "[Event " and I need to extract the stanzas which has string O-O-O .

Sample file :-

[Event "EU/C2016/ct02"]
[Site "ICCF"]
[Date "2016.03.15"]
[Round "?"]
[White "Tinture, Laurent"]
[Black "Sommerbauer, Dr. Norbert"]
[Result "1/2-1/2"]
[WhiteElo "2452"]
[BlackElo "2420"]
[PlyCount "54"]
[EventDate "2016.??.??"]
[Source "ICCF"]

1. d4 Nf6 2. c4 g6 3. Nc3 d5 4. cxd5 Nxd5 5. e4 Nxc3 6. bxc3 Bg7 7. Nf3 c5 8.
Rb1 O-O 9. Be2 cxd4 10. cxd4 Qa5+ 11. Bd2 Qxa2 12. O-O Bg4 13. Bg5 h6 14. Be3
Nc6 15. d5 Bxf3 16. gxf3 Nd4 17. Bd3 a5 18. f4 b5 19. Bxd4 Bxd4 20. Bxb5 Bc5
21. Qd3 Bb4 22. Rfd1 Rfc8 23. Bc6 Rab8 24. Rbc1 Qa3 25. Qg3 Qb2 26. Qd3 Qa3 27.
Qg3 Qb2 1/2-1/2

[Event "UKR/C28/final (UKR)"]
[Site "ICCF"]
[Date "2017.03.10"]
[Round "?"]
[White "Rudenko, Vitaly"]
[Black "Begliy, Mikhail"]
[Result "1/2-1/2"]
[WhiteElo "2398"]
[BlackElo "2427"]
[PlyCount "37"]
[EventDate "2017.??.??"]
[Source "ICCF"]

1. d4 Nf6 2. c4 g6 3. Nc3 d5 4. cxd5 Nxd5 5. e4 Nxc3 6. bxc3 Bg7 7. Bc4 c5 8.
Ne2 Nc6 9. Be3 O-O 10. O-O b6 11. Rc1 Bb7 12. Qd2 Rc8 13. Rfd1 e6 14. f3 Na5
15. Bb5 cxd4 16. cxd4 Rxc1 17. Rxc1 a6 18. Bd3 Nc6 19. Bc4 1/2-1/2

[Event "LIPEAD40/f (PER)"]
[Site "ICCF"]
[Date "2016.09.30"]
[Round "?"]
[White "Rost, Detlef"]
[Black "Rawlings, Alan J. C"]
[Result "1/2-1/2"]
[WhiteElo "2451"]
[BlackElo "2368"]
[PlyCount "39"]
[EventDate "2016.??.??"]
[Source "ICCF"]

1. d4 Nf6 2. c4 g6 3. Nc3 d5 4. cxd5 Nxd5 5. Bd2 Bg7 6. e4 Nxc3 7. Bxc3 O-O 8.
Qd2 Nc6 9. Nf3 Bg4 10. d5 Bxf3 11. gxf3 Ne5 12. O-O-O c6 13. Qd4 Qd6 14. Kb1
Qf6 15. dxc6 Nxc6 16. Qxf6 Bxf6 17. Bxf6 exf6 18. Bb5 Rfd8 19. Bxc6 bxc6 20.
Kc2 1/2-1/2

[Event "GER/CM/04-A (GER)"]
[Site "ICCF"]
[Date "2017.06.26"]
[Round "?"]
[White "Felkel, Siegfried"]
[Black "Schulz, G�nter"]
[Result "1/2-1/2"]
[WhiteElo "2394"]
[BlackElo "2403"]
[PlyCount "51"]
[EventDate "2017.??.??"]
[Source "ICCF"]

1. d4 Nf6 2. c4 g6 3. Nc3 d5 4. cxd5 Nxd5 5. e4 Nxc3 6. bxc3 Bg7 7. Nf3 c5 8.
Be3 Qa5 9. Qd2 Nc6 10. Rb1 a6 11. Rc1 cxd4 12. cxd4 Qxd2+ 13. Kxd2 e6 14. Bd3
O-O 15. Rc4 Bd7 16. Rhc1 Rfd8 17. Ke2 h6 18. Bf4 Rac8 19. h4 b5 20. R4c2 Nxd4+
21. Nxd4 Bxd4 22. Bxh6 Rxc2+ 23. Rxc2 Rc8 24. Rxc8+ Bxc8 25. Be3 Bxe3 26. Kxe3
1/2-1/2

[Event "CT20/pr41"]
[Site "ICCF"]
[Date "2013.11.30"]
[Round "?"]
[White "Pachnicke, Harald"]
[Black "Oppermann, Peter"]
[Result "0-1"]
[WhiteElo "2076"]
[BlackElo "2277"]
[PlyCount "82"]
[EventDate "2013.??.??"]
[Source "ICCF"]

1. d4 Nf6 2. c4 g6 3. Nc3 d5 4. Nf3 Bg7 5. Bg5 Ne4 6. cxd5 Nxg5 7. Nxg5 e6 8.
Qa4+ c6 9. dxc6 Nxc6 10. Nf3 Bd7 11. O-O-O O-O 12. Qa3 b5 13. Nxb5 Rb8 14. e4
Qb6 15. Kb1 Na5 16. Nd6 Ba4 17. Rd2 Bh6 18. Re2 Rfc8 19. Nxc8 Rxc8 20. Re3 Bc2+
21. Ka1 Bf8 22. Rc3 Rd8 23. Qxf8+ Kxf8 24. Rxc2 Nc6 25. Be2 Nxd4 26. Nxd4 Rxd4
27. Bf3 Kg7 28. g3 Qd8 29. Rf1 Rd3 30. Be2 Rd2 31. Rxd2 Qxd2 32. Bf3 Qd3 33.
Bg2 Qe2 34. Kb1 e5 35. a4 a5 36. Ka2 f5 37. exf5 gxf5 38. h4 Qc2 39. Ka3 e4 40.
b3 h5 41. Bh1 Kf6 0-1

Expected output:-

[Event "LIPEAD40/f (PER)"]
[Site "ICCF"]
[Date "2016.09.30"]
[Round "?"]
[White "Rost, Detlef"]
[Black "Rawlings, Alan J. C"]
[Result "1/2-1/2"]
[WhiteElo "2451"]
[BlackElo "2368"]
[PlyCount "39"]
[EventDate "2016.??.??"]
[Source "ICCF"]

1. d4 Nf6 2. c4 g6 3. Nc3 d5 4. cxd5 Nxd5 5. Bd2 Bg7 6. e4 Nxc3 7. Bxc3 O-O 8.
Qd2 Nc6 9. Nf3 Bg4 10. d5 Bxf3 11. gxf3 Ne5 12. O-O-O c6 13. Qd4 Qd6 14. Kb1
Qf6 15. dxc6 Nxc6 16. Qxf6 Bxf6 17. Bxf6 exf6 18. Bb5 Rfd8 19. Bxc6 bxc6 20.
Kc2 1/2-1/2

[Event "CT20/pr41"]
[Site "ICCF"]
[Date "2013.11.30"]
[Round "?"]
[White "Pachnicke, Harald"]
[Black "Oppermann, Peter"]
[Result "0-1"]
[WhiteElo "2076"]
[BlackElo "2277"]
[PlyCount "82"]
[EventDate "2013.??.??"]
[Source "ICCF"]

1. d4 Nf6 2. c4 g6 3. Nc3 d5 4. Nf3 Bg7 5. Bg5 Ne4 6. cxd5 Nxg5 7. Nxg5 e6 8.
Qa4+ c6 9. dxc6 Nxc6 10. Nf3 Bd7 11. O-O-O O-O 12. Qa3 b5 13. Nxb5 Rb8 14. e4
Qb6 15. Kb1 Na5 16. Nd6 Ba4 17. Rd2 Bh6 18. Re2 Rfc8 19. Nxc8 Rxc8 20. Re3 Bc2+
21. Ka1 Bf8 22. Rc3 Rd8 23. Qxf8+ Kxf8 24. Rxc2 Nc6 25. Be2 Nxd4 26. Nxd4 Rxd4
27. Bf3 Kg7 28. g3 Qd8 29. Rf1 Rd3 30. Be2 Rd2 31. Rxd2 Qxd2 32. Bf3 Qd3 33.
Bg2 Qe2 34. Kb1 e5 35. a4 a5 36. Ka2 f5 37. exf5 gxf5 38. h4 Qc2 39. Ka3 e4 40.
b3 h5 41. Bh1 Kf6 0-1

what I tried:-

awk '/^\[Event/{flag=1;if(flag && non_flag){print val};val=flag=non_flag=""} /O-O-O/{non_flag=1} {val=val?val ORS $0:$0}'  test_file

above cmd shows below:- (but it displays only first occurrence of the searching pattern but not all that too few missing lines)

[EventDate "2016.??.??"]
[Source "ICCF"]

1. d4 Nf6 2. c4 g6 3. Nc3 d5 4. cxd5 Nxd5 5. Bd2 Bg7 6. e4 Nxc3 7. Bxc3 O-O 8.
Qd2 Nc6 9. Nf3 Bg4 10. d5 Bxf3 11. gxf3 Ne5 12. O-O-O c6 13. Qd4 Qd6 14. Kb1
Qf6 15. dxc6 Nxc6 16. Qxf6 Bxf6 17. Bxf6 exf6 18. Bb5 Rfd8 19. Bxc6 bxc6 20.
Kc2 1/2-1/2

Please advise, thanks!

vgersh99 · September 10, 2018, 12:56pm

how about:

awk '/^[[]Event/ {e=$0;next} /O-O-O/ {print e ORS $0}' RS= ORS='\n\n' myFile

prvnrk · September 10, 2018, 1:11pm

Many thanks vgersh99.

I'm unable to add "solved" tag, could any admin/Mod please do that for me?

I find it extremely hard to learn awk and is highly confusing. Could anyone please suggest a book/link that explains awk in the easiest way? thanks!!

vgersh99 · September 10, 2018, 5:28pm

There're many awk resources out there including manuals and tutorials.
This is one I have bookmarked (among others) awhile back, but I cannot recall it was good or not.
See if it helps.

prvnrk · September 23, 2018, 1:11pm

Apologies to bump this thread but I have problems with large size of input files(~15MB size and 400K lines). The solution offered by vgersh99 did work for the sample provided and also for few small-sized input files. But it outputs entire input for large files. I'm not sure what's wrong with it.

Please advise, thanks!

RudiC · September 23, 2018, 1:49pm

It should not differentiate between small and large files. Do you have structural differences in the large files? Mayhap DOS line terminators (^M = <CR> = \r = 0x0D)? How are those files created?

prvnrk · September 23, 2018, 2:49pm

Thanks RidiC for correctly pointing out about dos format. I had to dos2unix which solved the issue.

bakunin · September 23, 2018, 3:02pm

These files denote chess games and are called PGN (portable game notation).

awk is a complete programming language, but its main design goal was to work on table-based data. Each awk-program has three sections:

one section (BEGIN) is executed before any data are read - think about printing a header
one is executed for each line of the input data and consists of rules. A "rule" basically is a list of commands and a description of a text pattern: if the line matches the text pattern the commands are executed, if not then not
one section (END) is executed after all data in the input file/data stream is processed - think of writing a footer and eventually totals, etc..

In my opinion the best book about awk (and sed - they have a lot in common) is "awk & sed" by Dale Dougherty, published by O'Reilly.

I hope this helps.

bakunin

wbport · September 24, 2018, 10:18am

I have that book, it's name is sed & awk and I also found it to be an excellent primer.

The symbol he is looking for represents Queenside Castling.

If you need to create a diagram from a point in a PGN file, this tool can help.