Extraction of various lines from a hugh file

Dear Members,
I have a huge file generated by the command 'whois' for hundred of IPs. Each section in the file starts with [Querying whois

I want to extract those lines which start with any of these words: [Querying whois, OrgName, NetRange, inetnum, descr, owner, Country in that section.

Input:

[Querying whois.XJHIOUIIOOPIOP]

OrgName: University of C
OrgID: U1
Address: OIT
Address: NH
City: BC
StateProv: XY
PostalCode: 000000
Country: MN

NetRange: XXX.YYY.M.N - XXX.YYY.M.Q
CIDR: LMANERIE
NetName: UC

[Querying whois.ABCE.TSD]

% Rights restricted by copyright.
% See

% Note: This output has been filtered.
% To receive output for a database update, use the "-B" flag

inetnum: XXX.YYY.M.N - XXX.YYY.M.Q
netname: NET-C
descr: HB
descr: The University
country: PQ
admin-c: TYE
tech-c: SDF
status: FGRG
mnt-by: FSDGFG
source: FGDFSG

role: OPRROKROTR
address: The University
address: DJFIEJRE
address: DIJAIRJEJ
address: EIREROERE

Required output:

[Querying whois.BUHIOUJIOU]
OrgName: HHHHHHHHHH (May or may not present)
NetRange:TTTTTTTTT (May or may not present)
inetnum: FTYFYYYUII (May or may not present)
descr: HIJKJKLLKL (It will be better if only first occurrence)
owner: JHKJOJOIPI (May or may not present)
Country: OIOPOPOP (1st occurrence)

Thanking you
With regards

Different registrars use different output formats. So unless you are querying a very restricted set of domains, for example domains all registered by one person, or for other reasons all registered with the same registrar or only a small set of registrars, this may turn out to be more complex than you thought.

Perhaps it would be useful as a first step to separate the entries to different files depending on the [Querying ... line? Try the csplit command for that. Then you can create a parser for each of the formats you find in there.

How do you know when to stop? Often a record will include hierarchical information (especially for the ARIN information, which is what your ABCE.TSD example looks like) in which the later lines are more specific than the earlier ones. Then you often want the later lines, not the earlier ones. (But this depends on what you need this for, of course.)

Anyway, here's an attempt at implementing your current spec. This simply picks out the first of anything after the Querying line:

perl -ne 'if (/^\[Querying/) {
  print; @wanted = qw(OrgName NetRange inetnum descr owner Country);
  $wanted = &wanted(@wanted);
}
sub wanted {
  return "^(" . join ("|", map { quotemeta $_ } @_) . "):";
}
if ($wanted && $_  =~ m/$wanted/i) {
  print;
  @wanted = grep { $_ ne $1 } @wanted;
  $wanted = @wanted ? &wanted(@wanted) : "";
}' file

This came out a little more monstrous than I'd like it to be, but maybe you can use it as a starting point.

(In retrospect, maybe it would have been better to use a hash to keep track of which values are already captured, and not capture if the hash says we already have the one we are looking at. Push the captured ones to an array if preserving order is important.)

Hi,
Thank you very much for the help. The script is very useful upto 70% of my need. I will try to do something for rest of my 30%.

Thanking you
With regards
Satya

Dear Era,
I want the script should take the input file as a variable as well as output file. I have two text files: (1) List of folders in which the script should work (2) List of input files on which the script should work.
Due to lack of Perl knowledge I tried unsuccessful. In Shell script I use:

for i in `(cat countries.txt)`
do

for j in `(cat year.txt)`

do

for k in `(cat countries/$i/$j)`

do

Same way I want the perl script take the inputfile as variable

Thanks

As a matter of shell coding style, the parentheses are completely unnecessary, and stuff in backticks works badly if there's a file name with spaces in it.

I don't see why you couldn't use that shell script to wrap the Perl code; there's nothing much there which Perl does better than the shell, other than not having to read the country file over and over again (but you could optimize that in the shell script, too). But anyway, here goes. I'm afraid this is completely untested.

#!/usr/bin/perl

die "Usage: $0 dir yearfile countryfile" unless (@ARGV == 3);

open (Y, "$ARGV[1]") || die "$0: Could not open $ARGV[1]: $!\n";
open (C, "$ARGV[2]") || die "$0: Could not open $ARGV[2]: $!\n";
my @countries = <C>;
close C;
while ($year = <Y>) {
  for $country (@countries) {
    handle ("$ARGV[0]/$year/$country");
  }
}
close Y;

sub handle {
  my ($file) = @_;
  open (F, $file) || die "$0: Could not open $file: $!\n";
  while (<F>) {
    if (/^\[Querying/) {
      print; @wanted = qw(OrgName NetRange inetnum descr owner Country);
      $wanted = &wanted(@wanted);
    }
    if ($wanted && $_  =~ m/$wanted/i) {
      print;
      @wanted = grep { $_ ne $1 } @wanted;
      $wanted = @wanted ? &wanted(@wanted) : "";
    }
    close F;
  }
}  
sub wanted {
  return "^(" . join ("|", map { quotemeta $_ } @_) . "):";
}

Thank you very much for the code

Regards