script

rintingtong · March 28, 2004, 4:38pm

i have about a million records in a text file.

I want to identity only those records which have an employee number like this '2342-3456-9870-9999',1232-7594-8888-0984' etc.

These numbers are unique.

Can anyone suggest any simple unix script for this.

I can use grep command to search for a particular record using a particular employee number.

But i dont know the options that i should use to extcrat all the records of the type ie '9999-9999-9999-9999'.
There are some records which have headers and blank spaces in the same field which i want to eliminate.

mbb · March 29, 2004, 8:29am

egrep allows you to use regular expressions (or pattern matching).

In your specfic example egrep for

"[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]"

Regular expressions can be powerful, yet complicted in their use. You should read more about them. Either search on the net or get hold of a good book on the subject.

You can also incorporate regular expression processing in any C program too. In HP-UX do 'man regcomp' to get more info.

I would imagine that their are similar functions in other Unix variants and windows. But the functions may not have the same name or calling convention. The regular expressions may also be evaluated differently (especially so for the windows versions of expression parsing).

zazzybob · March 29, 2004, 9:35am

It might be more concise to egrep for

egrep "[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}" the_file

I have tested this with egrep under HP-UX 10.20

Peace, ZB

Optimus_P · March 29, 2004, 9:38am

if useing the regular expresstions try useing the quantifer methods to cut down on the amount of repedative typeing.

in perl i would use something like

/\d+\-\d+\-\d+\-\d+/

this would match '2342-3456-9870-9999' and anything else that has 4 groupings of numbers followed seperated w/ a hyphen.