Python with Regex and Excel

Hello

I have a big excel file for Ticket Data Analysis. The idea is to make meaningful insight from Resolution Field. Now as people write whatever they feel like while resolving the ticket it makes quite a task.

  1. They may or may not tag it with something like below within the resolution field

Problem:
Analysis:
Resolution:

So I am suppose to pick the write ups after Resolution: tag and Put it in a new excel file with the ticket number.

Now the problem starts because people write their own tagging mechanism. Sometimes they Skip Problem and Analysis and just write Steps etc. It is very random and no common factors.

After going through lot of data we figured out the following tags which are most common.

So logic is

If we find the following tags :
pick after it
elif we find the following Tags:
pick after it.
The elseif condition can sometimes be true in if condition. So if statement gets presidence
else
pick the whole field

in the if statements the below are common words

regularexp="r'Action Performed\:(.*)|" \
           "ACTION PERFORMED\:(.*)|" \
           "Steps taken to resolve the issue\:(.*)|" \
           "steps taken to resolve issue\:(.*)|" \
           "Steps Taken to Resolved the issue\:(.*)|" \
           "Steps taken to resolve the issue \:(.*)|" \
           "Steps taken to resolve\:(.*)|" \
           "Steps\-(.*)|" \
           "steps \-(.*)|" \
           "steps\:(.*)|" \
           "Steps taken\:(.*)" \
           "Action taken\:(.*)|" \
           "Actions Taken\:(.*)|" \
           "Action\:(.*)|" \
           "Action \-(.*)|" \
           "Action taken \-(.*)|" \
           "Actions Taken\;(.*)|" \
           "Action Taken \:(.*)|" \
           "Action \:(.*)|" \
           "Action taken to resolve\:(.*)|" \
           "Resolution\-(.*)|" \
           "Resolution\:(.*)|" \
           "Resolution \-(.*)|" \
           "Action taken for resolution\:(.*)|" \
           "Solution \:(.*)|" \
           "analysis\:(.*)|" \
           "analysis \:(.*)|" \
           "Investigation\:(.*)" \
           "observed\/investigated \:(.*)'"

ELIF:
Now if the above is not found we need to check for the following. The above gets presidence if both are found

"Update\:(.*)|" \
           "Update \:(.*)|" \
           "UPDATE\-(.*)|" \
           "updates\:(.*)'"

Else:

Just re-write the whole statements found

I have written the following code without the elif block for now

# -*- coding: utf-8 -*-
"""
Created on Mon May 29 19:34:54 2017

@author: anirbaba
"""

from openpyxl import Workbook, load_workbook
import re
import xlsxwriter
workbook = xlsxwriter.Workbook('demo.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write('A1', 'IncidentID')
worksheet.write('B1', 'Resolution')


wb=load_workbook("D:\Backup\Drive_D\W0rk\Script\Python\RegularX\Output_file_high_effort_no_pks.xlsx", read_only=True)
sheet_ranges=wb['High_effort_without_burst']
regularexp="r'Action Performed\:(.*)|" \
           "ACTION PERFORMED\:(.*)|" \
           "Steps taken to resolve the issue\:(.*)|" \
           "steps taken to resolve issue\:(.*)|" \
           "Steps Taken to Resolved the issue\:(.*)|" \
           "Steps taken to resolve the issue \:(.*)|" \
           "Steps taken to resolve\:(.*)|" \
           "Steps\-(.*)|" \
           "steps \-(.*)|" \
           "steps\:(.*)|" \
           "Steps taken\:(.*)" \
           "Action taken\:(.*)|" \
           "Actions Taken\:(.*)|" \
           "Action\:(.*)|" \
           "Action \-(.*)|" \
           "Action taken \-(.*)|" \
           "Actions Taken\;(.*)|" \
           "Action Taken \:(.*)|" \
           "Action \:(.*)|" \
           "Action taken to resolve\:(.*)|" \
           "Resolution\-(.*)|" \
           "Resolution\:(.*)|" \
           "Resolution \-(.*)|" \
           "Action taken for resolution\:(.*)|" \
           "Solution \:(.*)|" \
           "analysis\:(.*)|" \
           "analysis \:(.*)|" \
           "Investigation\:(.*)" \
           "observed\/investigated \:(.*)'"
#           "Update\:(.*)|" \
#           "Update \:(.*)|" \
#           "UPDATE\-(.*)|" \
#           "updates\:(.*)'"
i=0           
for row in sheet_ranges.iter_rows(row_offset=1):
#    for i in range(0,50001):
    act_resolution=re.compile(regularexp, re.IGNORECASE)
    act_resolutiongroup=act_resolution.search(str(row[16].value))
    if act_resolutiongroup is not None:
        print(row[12].value,act_resolutiongroup.group())
        worksheet.write(i+1,0,row[12].value)
        worksheet.write(i+1,1,act_resolutiongroup.group())
        i+=1
    else:
        print(row[12].value,row[16].value)
        worksheet.write(i+1,0,row[12].value)
        worksheet.write(i+1,1,row[16].value)
        i+=1
workbook.close()
#    if act_resolutiongroup is None:
#        print(row[12].value)
  1. I need help in shortening the Regular expression search for variable regularexp.
  2. I have seen the keyword is there but still it goes into the Else loop and just write the whole statement instead of picking it.
  3. Once the above is done need to run 2 Grams 3 Grams TFIDF algorithm non english non numeric (This is far fetched and not my immediate requirement)

---------- Post updated 05-31-17 at 10:29 PM ---------- Previous update was 05-30-17 at 10:54 PM ----------

Hello

Can someone shorten the below regular expression.

regularexp="r'Action Performed\:(.*)|" \
           "ACTION PERFORMED\:(.*)|" \
           "Steps taken to resolve the issue\:(.*)|" \
           "steps taken to resolve issue\:(.*)|" \
           "Steps Taken to Resolved the issue\:(.*)|" \
           "Steps taken to resolve the issue \:(.*)|" \
           "Steps taken to resolve\:(.*)|" \
           "Steps\-(.*)|" \
           "steps \-(.*)|" \
           "steps\:(.*)|" \
           "Steps taken\:(.*)" \
           "Action taken\:(.*)|" \
           "Actions Taken\:(.*)|" \
           "Action\:(.*)|" \
           "Action \-(.*)|" \
           "Action taken \-(.*)|" \
           "Actions Taken\;(.*)|" \
           "Action Taken \:(.*)|" \
           "Action \:(.*)|" \
           "Action taken to resolve\:(.*)|" \
           "Resolution\-(.*)|" \
           "Resolution\:(.*)|" \
           "Resolution \-(.*)|" \
           "Action taken for resolution\:(.*)|" \
           "Solution \:(.*)|" \
           "analysis\:(.*)|" \
           "analysis \:(.*)|" \
           "Investigation\:(.*)" \
           "observed\/investigated \:(.*)'"

I do not have openpyxl module in my system, so I tried your regular expression with a simple text file with some dummy data that matches your regex pattern.

Note a couple of things:
1) Characters like ";" and ":" do not have special meaning in a regex, so they need not be escaped by backslash "\" character.
2) The hyphen "-" has special meaning only inside brackets. Otherwise, it need not be escaped.
3) You can reduce the regular expression, but it becomes unreadable very quickly. That's true for any regular expression. So you have to find a balance between readability and succinctness.

C:\data\>
C:\data\>type testdata.log
Line  1 : Action Performed:
Line  2 : ACTION PERFORMED:
Line  3 : Steps taken to resolve the issue:
Line  4 : steps taken to resolve issue:
Line  5 : Steps Taken to Resolved the issue:
Line  6 : Steps taken to resolve the issue :
Line  7 : Steps taken to resolve:
Line  8 : Steps-
Line  9 : steps -
Line 10 : steps:
Line 11 : Steps taken:
Line 12 : Action taken:
Line 13 : Actions Taken:
Line 14 : Action:
Line 15 : Action -
Line 16 : Action taken -
Line 17 : Actions Taken;
Line 18 : Action Taken :
Line 19 : Action :
Line 20 : Action taken to resolve:
Line 21 : Resolution-
Line 22 : Resolution:
Line 23 : Resolution -
Line 24 : Action taken for resolution:
Line 25 : Solution :
Line 26 : analysis:
Line 27 : analysis :
Line 28 : Investigation:
Line 29 : observed/investigated :

C:\data\>
C:\data\>type processdata.py
import sys
import re
regularexp = "Steps Taken to Resolve(d)*\s*(the)*\s*issue\s*:(.*)|" \
             "Steps taken to resolve:(.*)|" \
             "Steps taken:(.*)|" \
             "steps\s*[:-](.*)|" \
             "Action taken (for|to) resol(ve|ution):(.*)|" \
             "Action(s)* (Performed|Taken)\s*[;:-](.*)|" \
             "Action\s*[:-](.*)|" \
             "(Resolution|Solution|analysis|investigation|observed\/investigated)\s*[:-](.*)"
act_resolution = re.compile(regularexp, re.IGNORECASE)
datafile = sys.argv[1]
fh = open(datafile, 'r')
for line in fh:
    line = line.rstrip("\n")
    if act_resolution.search(line) is not None:
        print line
    else:
        print "[UNMATCHED]>> ", line
fh.close()
 
C:\data\>
C:\data\>python processdata.py testdata.log
Line  1 : Action Performed:
Line  2 : ACTION PERFORMED:
Line  3 : Steps taken to resolve the issue:
Line  4 : steps taken to resolve issue:
Line  5 : Steps Taken to Resolved the issue:
Line  6 : Steps taken to resolve the issue :
Line  7 : Steps taken to resolve:
Line  8 : Steps-
Line  9 : steps -
Line 10 : steps:
Line 11 : Steps taken:
Line 12 : Action taken:
Line 13 : Actions Taken:
Line 14 : Action:
Line 15 : Action -
Line 16 : Action taken -
Line 17 : Actions Taken;
Line 18 : Action Taken :
Line 19 : Action :
Line 20 : Action taken to resolve:
Line 21 : Resolution-
Line 22 : Resolution:
Line 23 : Resolution -
Line 24 : Action taken for resolution:
Line 25 : Solution :
Line 26 : analysis:
Line 27 : analysis :
Line 28 : Investigation:
Line 29 : observed/investigated :
  
C:\data\>
C:\data\>

If you try hard enough, you can reduce that regex to a single string.
But the maintainer of your script may not be very happy about it.

1 Like