Python Script for keyword and Stemming

Hello All,

I have python script that pulls out a keyword from the data set. The data set contains 3 columns,

  1. SysID 2. ID 3. Comment Section.

This script just pulls out keyword for certain extent from Comment section and display only keyword, not any other columns.

Can someone help out to alter this script so that script trim comment column sparing with precise key words from each row of columns, without truncating the other columns.


#!/usr/bin/env python2.7
import numpy as np
from collections import Counter
import csv

class Preprocess_data():


        def __init__(self, data, k_number_of_features=5000):
                self.k = k_number_of_features
                self.words = zip(*data)[2]


        def get_word(self, data):
                punc1 = ("~`!@#$%^&*()_-+=[]{}\|;:',<.>/?")
                punc2 = ('"')
                wordsbag = []
                words = zip(*data)[2]
                words = [item.lower().translate(None, punc1).translate(None, punc2) for item in words]
                self.words = [item.split() for item in words]
                for line in self.words:
                        wordsbag.extend(set(line))
                return wordsbag


        def count_attr(self,data):
                c = Counter(self.get_word(data))
                feature = c.most_common(100+self.k)[100:100+self.k]
                return feature


        def summarize_feature(self, data):
                words = self.words
                feature = self.count_attr(data)
                feature_value = np.zeros((len(data), len(feature)))
                for i in range(len(words)):
                        for j in range(len(feature)):
                                if (feature[j][0] in words):
                                        feature_value[j] = 1
                                else:
                                        feature_value[j] = 0
                return feature_value



if __name__=='__main__':
        file = open('testfile', 'rU')
        data = list(csv.reader(file, delimiter='\t'))
        preprocessed = Preprocess_data(data, k_number_of_features='n')
        wordsbag = preprocessed.get_word(data)
        feature = preprocessed.count_attr(data)
        feature_value = preprocessed.summarize_feature(data)
        #-------print the most common ten words---------#
        for i in range(3000):
                print 'WORD' + str(i+1), feature[0]

Sample Dataset


4819	810	The locker doors "Inside" were marked and not polished properly.
4885	1313	The seal around / on top of the flush panel is damaged.
4932	825	The clock facing the bag drop drive way is not set correctly / displays incorrect time.
5067	744	Gaps are visible between the interlock flooring tiles.
5027	737	The menu is damaged.
5067	748	The wall is seen blistered.
4845	825	The left side of the panel is fused.
4952	810	The terrace tiles are damaged.
5496	1044	tetst
5022	732	The service door is left open and construction equipment is left unattended.
5496	1044	test
5496	2009	test
4952	810	The terrace tiles are cracked /damaged.
5058	1110	The
5067	2022	The umbrella's  bases  of the restaurant are seen dusty and dirty.
5058	1110	The Interlock flooring is seen damaged and stained.
5058	1110	Gaps are visible between Interlock flooring.
5058	1110	Several toilet cubicles doors are seen chipped.
5489	824	tttt
5058	1110	The prayer timings electrical board has been removed during painting and never returned back and a mark is visible on the wall.
4771	693	The toilet cubicle skirtings are scratched.
5026	52	The terrace is damaged.
5027	737	The menu is damaged.
5026	743	The terrace is damaged.
4906	24	fgfgf
5059	829	The wall around the A/C grill is stained.
5059	829	The door stopper is missing and tile is damaged by door handle.
5059	829	The soap holder is missing.
5059	829	The douche tap fitting is loose.
5059	829	The corner of the wall is damaged and moldy.
5059	829	The ping pong table is damaged.
5059	829	The sign at the gate to pool area is faded.
5059	829	The protective net is not properly installed. The fitting is untidy.
5059	829	The pool loungers are stained.
5059	829	The corner of the wall is damaged and moldy.
5059	829	The corner of the wall is damaged and moldy.
5059	829	The corner of the wall is damaged and moldy.
5058	1117	The empty unit is seen not hoarded; window is dirty and dust is visible from the window.
5058	1110	The flooring arrows are faded and worn.
5490	1957	test
5022	732	There appears to be water damage on the dipped ceiling.
4825	833	The
5022	727	The information about where the stairs lead to is missing.
5022	732	The stairs walls are all blank. Information about what is at the top of the stairs needs to be added to those walls.
5022	732	The yellow exit sign painted on the wall is damaged above it and the paint is uneven and untidy.
5022	732	The yellow car park sign hanging from the ceiling is chipped at the lower left ledge.
5056	833	Ceiling access panels are still found missing.
5056	833	Main door is damaged on lower edge.
5022	732	There is yellow tape in a square shape left above the Tche Tche Cafe sign on the wall.
5056	833	Tiles panels are damaged.

Current Output from the script is below


WORD1 working
WORD2 correctly
WORD3 cover
WORD4 ac
WORD5 doors
WORD6 it
WORD7 full
WORD8 display
WORD9 parking
WORD10 heavily
WORD11 wooden
WORD12 for
WORD13 edges
WORD14 humidity
WORD15 cubicles
WORD16 fitted
WORD17 out
WORD18 room
WORD19 tree
WORD20 behind
WORD21 fence
WORD22 ok
WORD23 dusty
WORD24 cabinet
WORD25 along
WORD26 rusty
WORD27 overgrown
WORD28 as
WORD29 signs
WORD30 protruding
WORD31 painted
WORD32 fountain
WORD33 covered
WORD34 does
WORD35 dry
WORD36 availability
WORD37 lift
WORD38 operational
WORD39 severally
WORD40 poor
WORD41 found
WORD42 litter
WORD43 blistered


Expected Result should be


SysID    ID      Keywords

5067	2022	  umbrella's , dusty, dirty.
5058	1110	  Interlock, damaged, stained.
5058	1110	  Gaps, flooring.
5058	1110	  toilet, doors,  chipped.

Thanking you in advance, hope someone will address.

I foresee problems with the approach of excluding common words. "Damaged" is an important word, but also common in your data. "Not" is also common and kind of vital. And when your data changes, so will whatever words you exclude.

And how important many words are, depends on context. Data is not lost from deleting "left" from "door left open", but it is lost from "left door open".

You can build lists of exclusions and special words until the cows come home, and then one funny case will come along which blows it all out of the water. Add one more special case for that word and special case special cases for any odd but valid ways that word might be used. Rinse and repeat until you lose your mind or your code gains sentience.

I'm not sure true English language processing can be implemented in a tinkertoy.

Deleting common words like "the" and "is", that's certainly doable.