regular expression grouping across multiple lines

chirish · September 26, 2012, 8:03pm

cat book.txt

book1 price 23
      sku   1234
      auth  Bill
book2 sku   1233
      price 22
      auth  John
book3 auth  Frank
      price 24
book4 price 25
      sku   129
      auth  Tod

import re
f = open('book.txt', 'r')
text = f.read()
f.close()
m = re.findall(r'(\w{5})\sprice\s(\d+)', text)
m


[('book1', '23'), ('book4', '25')]

desired output:
[('book1', '23'), ('book2', '22'), ('book3', '24') ('book4', '25') ]

Just started learning RE. Is RE the proper tool for this type of extraction?

Each index(book) have fix length; so no need to worry about stuff like book1022
Each book could have just one attribute or more attributes.

Thanks!!

---------- Post updated at 07:01 PM ---------- Previous update was at 06:47 PM ----------

Getting closer

>>> m = re.findall('(book\d)\sprice\s(\d+)|(book\d).+\n+.+price\s(\d+)', text)
>>> m
[('book1', '23', '', ''), ('', '', 'book2', '22'), ('', '', 'book3', '24'), ('book4', '25', '', '')]

---------- Post updated at 07:50 PM ---------- Previous update was at 07:01 PM ----------

Never mind...
It only semi-worked when price is on the 1st or 2nd line.

fail in this case

cat book.txt

book1 price 23
      sku   1234
      auth  Bill
book2 sku   1233
      price 22
      auth  John
book3 auth  Frank
      price 24
book4 price 25
      sku   129
      auth  Tod
book5 auth Joe
      sku   129
      price 13

---------- Post updated at 07:52 PM ---------- Previous update was at 07:50 PM ----------


missing book5

[('book1', '23', '', ''), ('', '', 'book2', '22'), ('', '', 'book3', '24'), ('book4', '25', '', '')]

---------- Post updated at 08:03 PM ---------- Previous update was at 07:52 PM ----------

yea!! dotall and non-greedy seems to be working ok

>>> m = re.findall(r'(book\d)\sprice\s(\d+)|(book\d).+?price\s(\d+)', text, re.DOTALL)
>>> m
[('book1', '23', '', ''), ('', '', 'book2', '22'), ('', '', 'book3', '24'), ('book4', '25', '', '')]

spacebar · September 26, 2012, 8:08pm

This is a simple script that I believe will do what you want, Try it out:

$ cat book.sh
line_out="["
while read p1 p2 p3
do
  if [ ${p1:0:1} = "b" -a ${p2:0:1} = "p" ]; then
    line_out="$line_out('$p1', '$p3'), "
  elif [ ${p1:0:1} = "b" -a ${p2:0:1} != "p" ]; then
    line_out="$line_out('$p1', "
  elif [ ${p1:0:1} = "p" ]; then
    line_out="$line_out '$p2'), "
  fi
done <input_file
#line_out="$line_out]"
line_out="${line_out%, }]"
echo $line_out

$ cat input_file
book1 price 23
      sku   1234
      auth  Bill
book2 sku   1233
      price 22
      auth  John
book3 auth  Frank
      price 24
book4 price 25
      sku   129
      auth  Tod

$ book.sh
[('book1', '23'), ('book2', '22'), ('book3', '24'), ('book4', '25')]

chirish · September 26, 2012, 8:20pm

duh..

>>> f = open('book.txt', 'r')
>>> text = f.read()
>>> f.close()
>>> m = re.findall(r'(book\d).+?price\s(\d+)', text, re.DOTALL)
>>> m
[('book1', '23'), ('book2', '22'), ('book3', '24'), ('book4', '25'), ('book5', '13')]