sed or awk to parse this text

bulgin · August 30, 2010, 10:25pm

I am just beginning with sed and awk and understand that they are "per" line input. That is, they operate on each line individually, and output based on rules, etc.

But I have multi-line text blocks that looks as follows, and wish to ONLY extract the text between the first hyphen (-) and the ending part of that phrase even though it is on a next line and may be several sentences. Note these text blocks are among many text blocks with similar features but the distinguishing feature of these text blocks are the *[digits]Some text with a hyphen - this is what I want to extract. Maybe even another sentence, too, on another line.

[42]Things to do - Wash clothes, clean house, write letters, take dog for walk, watch tv, eat dinner.
- [43]Business items - Provide instructions to clients on property locations, write listing reports, copy contracts to computer disk, call state agencies.

My preferred end result using the above sample is:

Wash clothes, clean house, write letters, take dog for walk, watch tv, eat dinner. Provide instructions to clients on property locations, write listing reports, copy contracts to computer disk, call state agencies.

I could really use some help on this.

Thanks.

agama · August 30, 2010, 10:42pm

This is a quick example. There are probably other ways to do it, but this is straight forward:

awk '
        /^[*]/ {                            # new section 
                if( snarf )
                        printf( "\n" );         # terminate the last section
                snarf = 1;                      # open the section
                n = index( $0, "-" );           # find first -
                printf( "%s ", substr( $0, n+1 ) );     # print everything after the first dash
                next;
        }

        NF == 0 { 
             if( snarf )
             {
                  snarf = 0; 
                  printf( "\n" );
              }
              next; 
       }     #terminate section on blank line

        snarf > 0 {                         # if in section print this line. 
                printf( "%s ", $0 );
                next;
        }

        END {
                if( snarf )                    # need to finish the last section with newline
                        printf( "\n" );
        }
'

It does assume that each section starts with an asterisk (*) and that if it is continued onto multiple lines the section ends with the next asterisk or a blank line. The output from each section is put on one line (no intermediate newlines) even if it was on multiple lines in the input. Each section is placed on a separate line.

Hope this helps.

rdcwayx · August 30, 2010, 10:48pm

cut -d \- -f2 < infile |tr "\n" " "

bulgin · August 30, 2010, 11:05pm

Agama thank you for your reply. Are you suggesting I run this like:

awk -f awk.script testfile.txt

because that is producing errors.

agama · August 30, 2010, 11:17pm

I generally put it into a kshell script, but you can run it that way provided that you do NOT put the awk command nor the opening/closing single quotes into awk.script.

I assumed you were probably adding it to an existing ksh/bash script, something like this (replace ksh with bash if you prefer bash):

#!/usr/bin/env ksh
awk '
        /^[*]/ {                            # new section 
                if( snarf )
                        printf( "\n" );         # terminate the last section
                snarf = 1;                      # open the section
                n = index( $0, "-" );           # find first -
                printf( "%s ", substr( $0, n+1 ) );     # print everything after the first dash
                next;
        }

        NF == 0 { 
             if( snarf )
             {
                  snarf = 0; 
                  printf( "\n" );
              }
              next; 
       }     #terminate section on blank line

        snarf > 0 {                         # if in section print this line. 
                printf( "%s ", $0 );
                next;
        }

        END {
                if( snarf )                    # need to finish the last section with newline
                        printf( "\n" );
        }'   <test_file.txt

I hope that is a bit clearer.

---------- Post updated at 23:17 ---------- Previous update was at 23:16 ----------

If you are still having issues with it, please post the error messages.

bulgin · August 30, 2010, 11:25pm

Thank you, Agama for clearing that up. when I run it as a bash script using test_file.txt, it runs and produces no output. Nor does it produce errors.

Here are a couple of lines of the test_file.txt:

[44]Amateur Astronomy and Space Website - Images, CCD information,
buy and sell, and links.
- [45]Ask an Astronomer - Questions answered by graduate students,
  including a question and answer archive, information on the solar
  system, the universe, SETI, observational astronomy, careers and
  history.
- [46]Astro World - Provides studies, images, movies, and equations.
- [47]Astronomical Observatory - Visual and CCD photomety, including
  classic and digital astrophotography. Presents equipment and
  resources. Located near Plomin in eastern Istria, Croatia.
- [48]Astronomy Awards - Supporting the hobby related web sites
  throughout world.
- [49]Astronomy Boy - Getting started, CG-5 mount, SAA 100 list,
  constellation portraits, barn door tracker, comet Hale Bopp,
  homemade eyepieces, EQ mount tutorial, millennium rant, and
  biography, home.

agama · August 30, 2010, 11:52pm

That's odd. I cut the sample to make sure I hadn't introduced a bug transferring it into the edit window, and was able to process the little bit of data that you posted.

The only thing that I can think of that might be causing issues, and I might not see it without your putting the data in code tags, is the position of the leading asterisk. Is it the very first character on the line? If not, that would prevent the script from seeing it as a section marker and thus it wouldn't print anything.

A small change to the first line would handle the case where it was indented by spaces or tabs:

        /^[ \t]*[*]/ {                            # new section

If the asterisks are the very first character, then it's possible that the awk isn't being executed at all. You can add this line before the 'new section
line in the script to print all input lines to the standard error device as they are read. This will verify that the script is being invoked and the file you think it is parsing is indeed being parsed.

(new line in bold, first few lines after to show placement, but not the whole thing)

awk '
        {print;}     # debugging -- print everything

        /^[*]/ {                            # new section 
                if( snarf )
                        printf( "\n" );         # terminate the last section
                snarf = 1;                      # open the section
                n = index( $0, "-" );           # find first -
                printf( "%s ", substr( $0, n+1 ) );     # print everything after the first dash
                next;
        }

Have a go with those ideas. Not sure what it could be otherwise.

bulgin · August 30, 2010, 11:58pm

Your new section fix, above, worked! Thank you, thank you! I now see the data only problem is I get some output which seems to have large spaces in it but I think I can live with that. Here's a sample output, including the longs spaces (tabs?) in there:

Project to provide an astronomy podcast        every day of the year, written, recorded and produced by people        around the world.
 Provides        information for people using their naked eyes, binoculars or small        telescopes. Includes articles, links, downloads and shopping.
 Images, CCD information,        buy and sell, and links.
 Questions answered by graduate students,        including a question and answer archive, information on the solar        system, the universe, SETI, observational astronomy, careers and        history.
 Provides studies, images, movies, and equations.
 Visual and CCD photomety, including        classic and digital astrophotography. Presents equipment and        resources. Located near Plomin in eastern Istria, Croatia.
 Supporting the hobby related web sites        throughout world.
 Getting started, CG-5 mount, SAA 100 list,        constellation portraits, barn door tracker, comet Hale Bopp,        homemade eyepieces, EQ mount tutorial, millennium rant, and        biography, home.
 Weekly podcast providing discussions on        astronomical topics ranging from planets to cosmology.
 Monthly podcast discussing what can be seen        in the night sky.
 Provides news, articles and resources        updated daily.
 Includes galleries, equipment reviews,        articles, observation planning, and links.
 Contains sections for equipment, the        beginner, books, the solar system and deep sky, web log, and links.
 Myths and misconceptions. Includes an        introduction, brief biography, and discussion board.
 An online astronomy journal by Math        Heijen, backyard astronomer from the Netherlands. Observing logs        from Sun, Moon and Deepsky, digital lunar and solar images,        equipment reviews, links, books etc. Articles about the Sun, Moon        and Deepsky.

agama · August 31, 2010, 12:07am

Great -- glad that worked.

Could be tabs at the beginning of each input line or something.

A simple fix would be to ditch all of the whitespace at the start of the line:

 { gsub( "^[ \t]*", "", $0 ); }

Add this before the test for a new section to delete leading space/tabs.

Could be spaces/tabs at the end of the line; I doubt it, but if it is:

gsub( "[ \t]*$", "", $0 );

in the above code block should work.

Glad this worked for you.

bulgin · August 31, 2010, 12:15am

The first code works fine. I really appreciate it! The end result, which I can work with, also includes the following text:

[83]Swedish (26)
[84]Thai (4)
[85]Turkish (44)
[86]Ukrainian (9)
[87]
A Review of the Universe: Structures, Evolutions, Observations, and Theories - A retired physicist surveys the entire extent of the universe touching upon phenomenon from the largest to the smallest size and covering the entire cosmic interval from past to present.
Facts and statistical information about planets, moons, constellations, stars, galaxies, and Messier objects.
Contains 3D maps of the universe zooming out from the nearest stars to the scale of the galaxy and out to the surrounding superclusters and finally to the scale of the known universe.
[120]A9 - [121]AOL - [122]Ask - [123]Clusty - [124]Gigablast - [125]Google - [126]Lycos - [127]MSN - [128]Yahoo [129]Google Web Directory

I'm wondering if I can further pipe this through sed or awk to remove all lines with brackets. FYI, the data is from a Lynx dump which removes tags from a website I am documenting and leaves this behind.

Thanks again!

agama · August 31, 2010, 9:07pm

Yes, you could pipe those through sed to eliminate, but I think it better to make the awk programme right such that it doesn't emit those in the first place.

I tried a couple of different combinations and nothing I do ends up with output like you've indicated. Can you post the lines from the input file from round the area that the indicated garbage output is coming?

bulgin · August 31, 2010, 9:56pm

I've attached the complete dump file produced by lynx. As you can see, your script produces almost everything I need perfectly except it also brings in other entries before and after the intended parsing. If you run your script against the attached file you will see what I mean. My hope is to completely eliminate all but the concatenated sentences after the hyphen.

Thank you for your help and patience.

agama · September 1, 2010, 12:11am

Having the complete file was the trick; thank you. A few things that weren't obvious from your small sample (like the fact that the dash could be on the following line) caused me approach it a bit differently. This works for me:

#!/usr/bin/env ksh
awk '
        function print_buf( )
        {
                if( !buffer )
                        return;

                if( (n = index( buffer, "-" )) > 0 )    # only want lines with a dash
                        printf( "%s\n\n", substr( buffer, n+2 ) );      # print everything after the first dash
                buffer = ""             # start fresh
        }

        { gsub( "^[ \t]*", " " ); }     # ditch leading whitespace

        /^[ \t]*[*][ \t]*[[]/ {         # assume: <whitespace>*<whitespace>[
                print_buf( );           # print previous section if there and has a -
                snarf = 1;              # signal collection of buffer is ok
                buffer = $0;           # initialise buffer with current line
                next;
        }

        /^[ \t]*[*]/ {                # not a section, and ends it if snarfing
                print_buf();
                snarf = 0;
                next;
        }
        NF <= 0 {                       # empty line terminates current section
                print_buf();            # print buffer if it exists
                snarf = 0;              # turn collection off
                next;
        }

        snarf > 0 {                     # snarfing, collect into the buffer until next section or end 
                buffer = buffer " " $0;  
                next;
        }

        END {
                print_buf( );
        }
'  astro.txt

There are two newlines in the printf() -- made it easier for me to read. Take one out if you don't want the extra space. I also chop the space that trailed the first dash; change 'n+2' to 'n+1' in the substring command if you want that space.

Hope this does the trick!!

bulgin · September 1, 2010, 12:15am

This does do the trick. You are a genius AND a gentleman.

Thank you for all your help.