Read text between regexps and write into files based on a field in the text

r3d3 · October 16, 2013, 10:59pm

Hi,

I have a huge file that has data something like shown below:

huge_file.txt

start regexp
Name=Name1
Title=Analyst
Address=Address1
Department=Finance
end regexp
some text 
some text
start regexp
Name=Name2
Title=Controller
Address=Address2
Department=Finance
end regexp
some text 
some text
start regexp
Name=Name3
Title=Associate
Address=Address3
Department=Marketing
end regexp
some text
some text

I can extract the records between the start and end regular expressions using either Awk or Sed. It will really help me if I can read these records and write them into multiple files named after the Department, so they can be reviewed rather quickly.

Output I am expecting is -

Finance.txt

start regexp
Name=Name1
Title=Analyst
Address=Address1
Department=Finance
end regexp

start regexp
Name=Name2
Title=Controller
Address=Address2
Department=Finance
end regexp

Marketing.txt

start regexp
Name=Name3
Title=Associate
Address=Address3
Department=Marketing
end regexp

I thought using single command line as follows would work, but clearly I am missing something.

cat huge_file.txt | sed -n '/start regexp/,/end regexp/p' | tee record_buffer | grep Department | awk -F\= '{print $2}' | xargs cat record_buffer > {}

I think the problem was that output from sed is not single record but rather all the records between regexps. Any suggestions?

Chubler_XL · October 17, 2013, 12:02am

Here is a solution in awk:

awk 'F{F=F "\n" $0}
/start regexp/ {F=$0}
/end regexp/ { 
  $0=F "\n"
  if(gsub(/.*Department=/, "", F)) gsub(/\n.*/, ".txt", F)
  else F="Unknown.txt"
  print > F
  F=x}' huge_file.txt

RavinderSingh13 · October 17, 2013, 12:13am

Hi,

Could you please use the following code.

awk '/start regexp/,/end regexp/' file_name | awk '/start regexp/ {print$NR}1'  | sed 's/^start$//g'

Output will be as follows.

start regexp
Name=Name1
Title=Analyst
Address=Address1
Department=Finance
end regexp
 
start regexp
Name=Name2
Title=Controller
Address=Address2
Department=Finance
end regexp
 
start regexp
Name=Name3
Title=Associate
Address=Address3
Department=Marketing
end regexp

Hope this wil help.

Thanks,
R. Singh

Akshay_Hegde · October 17, 2013, 1:22am

ravindersingh13:

Hi,

Could you please use the following code.

awk '/start regexp/,/end regexp/' file_name | awk '/start regexp/ {print$NR}1'  | sed 's/^start$//g'

Output will be as follows.

start regexp
Name=Name1
Title=Analyst
Address=Address1
Department=Finance
end regexp
 
start regexp
Name=Name2
Title=Controller
Address=Address2
Department=Finance
end regexp
 
start regexp
Name=Name3
Title=Associate
Address=Address3
Department=Marketing
end regexp

Hope this wil help.

Thanks,
R. Singh

@ RavinderSingh13 User requirement is different please read thread before posting answer

Don_Cragun · October 17, 2013, 1:43am

The awk script provided by ChublerXL should work well as long as huge_file.txt contains a small number of different departments. The number of open files allowed varies with different implementations of awk. If you have more than nine different departments, you might want to consider a slightly more complex awk script, such as:

awk -F'=' '
$0 == "start regexp" && !inre {
        inre = 1
        out = $0
        next
}
$0 == "end regexp" && inre {
        out = out "\n" $0 "\n"
        if(dept == "") dept = "Unknown"
        if(!(dept in depts)) {
                depts[dept] = dept ".txt"
                print "" > depts[dept]
        }
        print out >> depts[dept]
        close(depts[dept])
        out = dept = ""
        inre = 0
        next
}
inre {  out = out "\n" $0
        if($1 == "Department") dept = $2
}' huge_file.txt

As always, if you want to use awk on a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of just awk .

r3d3 · October 17, 2013, 9:33pm

@Chubler_XL, @Don Cragun, thank you very much for your help. Both of these scripts worked on the sample I posted. When I tried it on the actual text file I have (about 600K lines, around 300-400 lines between start and end regexs), the scripts are taking a lot of time. Do you have any suggestions on reducing the process time?

Chubler_XL · October 17, 2013, 10:11pm

Limits are most definitely implementation specific, GNU awk, for example, seems to have no limit on open files, apart from those enforced by the OS limits.

GNU awk 3.1.7 on RHEL 6.4 will allow 24768 and with GNU awk 4.1.0 on cygwin 6.1 I could get 1834.

---------- Post updated at 12:11 PM ---------- Previous update was at 11:54 AM ----------

I'm surprised this solution is running slow for you, awk is fairly efficient and even if it's re-coded in C I wouldn't expect much improvement.

Are you writing the output department files to the same disk as the huge_file.txt is stored on, contention between read and write may be slowing it down. If you have more than 1 drive in the system try writing the output elsewhere eg make current directory on /disk1/tmp and read input file from /disk2/data.

Don_Cragun · October 17, 2013, 10:55pm

What OS are you using? (I.e., what is the output from uname -a ?)

How many different departments are in huge_file.txt?

Is there any chance that start_regexp and end_regexp occur in unmatched pairs? (My code will copy a start_regexp line found between a start_regext and the next end_regexp without restarting a copy, and will ignore an end_regexp if there was no start_regexp since the last seen end_regexp.)

Will there ever be a sequence of lines between the start and end lines that does not contain a Department=value line?

Answers to the above questions could be used to improve speed with an increased chance of things going wrong if the input data is malformed for some reason.

Expanding on what Chubler_XL said: If your input and output files are on different disk controllers, that might improve performance. If your input files are on one filesystem on a disk drive and your output files are on a different filesystem on the same drive, that will be worse than having the input and output files on the same filesystem.