cat book.txt
book1 price 23
sku 1234
auth Bill
book2 sku 1233
price 22
auth John
book3 auth Frank
price 24
book4 price 25
sku 129
auth Tod
import re
f = open('book.txt', 'r')
text = f.read()
f.close()
m = re.findall(r'(\w{5})\sprice\s(\d+)', text)
m
[('book1', '23'), ('book4', '25')]
desired output:
[('book1', '23'), ('book2', '22'), ('book3', '24') ('book4', '25') ]
Just started learning RE. Is RE the proper tool for this type of extraction?
- Each index(book) have fix length; so no need to worry about stuff like book1022
- Each book could have just one attribute or more attributes.
Thanks!!
---------- Post updated at 07:01 PM ---------- Previous update was at 06:47 PM ----------
Getting closer
>>> m = re.findall('(book\d)\sprice\s(\d+)|(book\d).+\n+.+price\s(\d+)', text)
>>> m
[('book1', '23', '', ''), ('', '', 'book2', '22'), ('', '', 'book3', '24'), ('book4', '25', '', '')]
---------- Post updated at 07:50 PM ---------- Previous update was at 07:01 PM ----------
Never mind...
It only semi-worked when price is on the 1st or 2nd line.
fail in this case
cat book.txt
book1 price 23
sku 1234
auth Bill
book2 sku 1233
price 22
auth John
book3 auth Frank
price 24
book4 price 25
sku 129
auth Tod
book5 auth Joe
sku 129
price 13
---------- Post updated at 07:52 PM ---------- Previous update was at 07:50 PM ----------
missing book5
[('book1', '23', '', ''), ('', '', 'book2', '22'), ('', '', 'book3', '24'), ('book4', '25', '', '')]
---------- Post updated at 08:03 PM ---------- Previous update was at 07:52 PM ----------
yea!! dotall and non-greedy seems to be working ok
>>> m = re.findall(r'(book\d)\sprice\s(\d+)|(book\d).+?price\s(\d+)', text, re.DOTALL)
>>> m
[('book1', '23', '', ''), ('', '', 'book2', '22'), ('', '', 'book3', '24'), ('book4', '25', '', '')]