Shell script to extract data in a file

gillesi · February 15, 2017, 12:17pm

I have this 5GB file, and i want to extract from the file particulars pattern.

this is my script:
//

count=`grep -wc "MSISDN" file_name`
k=1
>OUTPUT
>OUTPUT_Final
while [ $k -le $count ]
do
cat file_name | awk -F":" -v var="$k" '$1=="MSISDN" {m++}m==var{print; exit}' >> OUTPUT
cat file_name |awk -F":" -v var="$k" '$1=="IMSI" {m++}m==var{print; exit}' >> OUTPUT
cat file_name |awk -F":" -v var="$k" '$1=="NAM"  {m++}m==var{print; exit}' >> OUTPUT
cat file_name | awk -F":" -v var="$k" '$1=="TS11" {m++}m==var{print; exit}' >> OUTPUT
cat file_name | awk -F":" -v var="$k" '$1=="TS21" {m++}m==var{print; exit}' >> OUTPUT
cat file_name | awk -F":" -v var="$k" '$1=="TS22" {m++}m==var{print; exit}' >> OUTPUT
cat file_name | awk -F":" -v var="$k" '$1=="TS62" {m++}m==var{print; exit}' >> OUTPUT
cat file_name | awk -F":" -v var="$k" '$1=="BAIC" {m++}m==var{print; exit}' >> OUTPUT
cat file_name | awk -F":" -v var="$k" '$1=="BAOC" {m++}m==var{print; exit}' >> OUTPUT
cat file_name | awk -F":" -v var="$k" '$1=="APNID1" {m++}m==var{print; exit}' >> OUTPUT
cat file_name | awk -F":" -v var="$k" '$1=="APNID2" {m++}m==var{print; exit}' >> OUTPUT
echo " " >> OUTPUT
k=`expr $k + 1`
done
paste -d"," - - - - - - - - - - - - <OUTPUT > OUTPUT_Final

//

So in my file called file_name, i want to extract only the values MSISDN,NAM,OBO,TS,etc... append the results in the OUTPUT file then use the paste command to put them in the same line.
The script is working fine for a smaller size of file. But with a file size of 5GB it's 2 days running.

Please i need help!

Corona688 · February 15, 2017, 12:45pm

Please use code tags for code.

I'm not surprised it takes days to run, you are running shell externals 22 times per line, and processing the entire file each time when you probably only meant to process a line. So you are processing the file 22*n times more than you needed to, with n being the number of lines in the file.

How about:

awk '{ A[$1]++ }
END {
        for(X in A) { printf("%s%s", P, X); P="\t" }
        printf("\n");
        P="" ;
        for(X in A) { printf("%s%s", P, A[X]); P="\t" }
        printf("\n"); }' < inputfile > outputfile

If that doesn't work, please show the input you have and the output you want.

RudiC · February 15, 2017, 12:58pm

Without reasonable, representative input and output samples it's difficult to see WHAT you really want done. To me it seems you have count records in the file, and want to print the records' corresponding lines, blockwise. (Which is not what you specify verbally: "... extract only the values MSISDN,NAM,OBO,TS,etc...", except $1 is the only field in a line)

Are a single record's lines contiguous? Or do records overlap?
Are the input records in the order that you want them printed?
Are there more fields in a line or just $1? Should those be printed?
Are there more lines that you want suppressed?

gillesi · February 15, 2017, 1:57pm

Let's say this is the input, 2(actually they are 4milion) blocks of line.
(dn:serv=...) begining of a block and space the end of block.
In each block, i want to extract only the MSISDN,IMSI,NAM,TS11,21,22,BAIC,APNID,OBO,OBI then for each block the expected values should be in the same line separated by semi colon or colon.

dn:serv=CSPS,mscId=00015640ccf345a7914718d3d3eff6f6,ou=multiSCs,dc=mtncg
structuralObjectClass: CP1
objectClass: CP1
objectClass: CUDBServiceAuxiliary
objectClass: CP2
objectClass: CP3
objectClass: CP4
objectClass: CP5
objectClass: CP6
objectClass: CP7
objectClass: CP8
objectClass: CP9
objectClass: CPA
objectClass: CPB
objectClass: CPC
objectClass: CPD
objectClass: CPE
objectClass: CPF
objectClass: CPG
objectClass: CPH
objectClass: CPI
objectClass: CPJ
objectClass: CPK
objectClass: CPL
objectClass: CPM
objectClass: CPM1
objectClass: CPM2
objectClass: CPM3
objectClass: CPM4
objectClass: CPZ
objectClass: CP04s
objectClass: CP0A
objectClass: CP11
entryDS: 1
nodeId: 1
createTimestamp: 20170104222519Z
modifyTimestamp: 20170113111014Z
MSISDN: 242064493944
IMSI: 629100113334650
NAM: 1
CDC: 6
CSP: 6
SUBSCSPVERS: 3
PDPCP: 12
SUBSPDPCPVERS: 1
RSA: 3
SUBSRSAVERS: 10
APNID1: 2
APNVERS1: 1
APNID2: 1
APNVERS2: 1
APNID3: 0
APNVERS3: 1
EQOSIDV1:: AAAC
EQOSIDV2:: AAAC
EQOSIDV3:: AAAC
serv: CSPS
CSLOC: 2
RVLRI: 0
RSGSNI: 0
GSMUEFEAT: 0
OBO: 1
OBI: 1
MCA: 1
CAT: 10
DBSG: 1
OFA: 0
SOCB: 1
PWD: 0000
PWDC: 0
SOCFB: 0
SOCFNRC: 0
SOCFNRY: 0
SOCFU: 0
SODCF: 0
SOSDCF: 7
SOCLIP: 0
SOCLIR: 0
SOCOLP: 0
BS26: 1
BS3G: 1
TS11: 1
TS21: 1
TS22: 1
TS62: 1
CAW: 1
HOLD: 1
MPTY: 1
OICK: 10
BAIC: 1
BAOC: 1
BICRO: 1
BOIC: 1
BOIEXH: 1
CFB: 1
CFNRC: 1
CFNRY: 1
CFU: 1
CLIP: 1
CAWTS10ST: 8
CFBTS10ST: 8
CFUTS10ST: 8
CFNRCTS10ST: 8
CFNRYTS10ST: 8
BAICTS10ST: 8
BAOCTS10ST: 8
BICROTS10ST: 8
BOICTS10ST: 8
BOIEXHTS10ST: 8
BAICTS20ST: 8
BAOCTS20ST: 8
BICROTS20ST: 8
BOICTS20ST: 8
BOIEXHTS20ST: 8
CAWTS60ST: 8
CFBTS60ST: 8
CFUTS60ST: 8
CFNRCTS60ST: 8
CFNRYTS60ST: 8
BAICTS60ST: 8
BAOCTS60ST: 8
BICROTS60ST: 8
BOICTS60ST: 8
BOIEXHTS60ST: 8
CAWBS30ST: 8
CFBBS30ST: 8
CFUBS30ST: 8
CFNRCBS30ST: 8
CFNRYBS30ST: 8
BAICBS30ST: 8
BAOCBS30ST: 8
BICROBS30ST: 8
BOICBS30ST: 8
BOIEXHBS30ST: 8
CAWBS20ST: 8
CFBBS20ST: 8
CFUBS20ST: 8
CFNRCBS20ST: 8
CFNRYBS20ST: 8
BAICBS20ST: 8
BAOCBS20ST: 8
BICROBS20ST: 8
BOICBS20ST: 8
BOIEXHBS20ST: 8
EQOSID1: 0
PDPTYPE1: 0
VPAA1: 0
EQOSID2: 0
PDPTYPE2: 0
VPAA2: 0
EQOSID3: 0
PDPTYPE3: 0
VPAA3: 0

dn: serv=CSPS,mscId=0001b7b4ad4d44bbb73484e858270eb3,ou=multiSCs,dc=mtncg
structuralObjectClass: CP1
objectClass: CP1
objectClass: CUDBServiceAuxiliary
objectClass: CP2
objectClass: CP3
objectClass: CP4
objectClass: CP5
objectClass: CP6
objectClass: CP7
objectClass: CP8
objectClass: CP9
objectClass: CPA
objectClass: CPB
objectClass: CPC
objectClass: CPD
objectClass: CPE
objectClass: CPF
objectClass: CPG
objectClass: CPH
objectClass: CPI
objectClass: CPJ
objectClass: CPK
objectClass: CPL
objectClass: CPM
objectClass: CPM1
objectClass: CPM2
objectClass: CPM3
objectClass: CPM4
objectClass: CPZ
objectClass: CP04s
objectClass: CP0A
objectClass: CP11
entryDS: 1
nodeId: 1
createTimestamp: 20170119174955Z
modifyTimestamp: 20170119174956Z
MSISDN: 242068626345
IMSI: 629100114187228
NAM: 0
CDC: 3
CSP: 6
SUBSCSPVERS: 3
PDPCP: 12
SUBSPDPCPVERS: 1
RSA: 3
SUBSRSAVERS: 10
APNID1: 2
APNVERS1: 1
APNID2: 1
APNVERS2: 1
APNID3: 0
APNVERS3: 1
EQOSIDV1:: AAAC
EQOSIDV2:: AAAC
EQOSIDV3:: AAAC
serv: CSPS
CSLOC: 2
PSLOC: 2
RVLRI: 0
RSGSNI: 0
GSMUEFEAT: 0
MCA: 1
CAT: 10
DBSG: 1
OFA: 0
SOCB: 1
PWD: 0000
PWDC: 0
SOCFB: 0
SOCFNRC: 0
SOCFNRY: 0
SOCFU: 0
SODCF: 0
SOSDCF: 7
SOCLIP: 0
SOCLIR: 0
SOCOLP: 0
BS26: 1
BS3G: 1
TS11: 1
TS21: 1
TS22: 1
TS62: 1
CAW: 1
HOLD: 1
MPTY: 1
OICK: 10
BAIC: 1
BAOC: 1
BICRO: 1
BOIC: 1
BOIEXH: 1
CFB: 1
CFNRC: 1
CFNRY: 1
CFU: 1
CLIP: 1
CAWTS10ST: 8
CFBTS10ST: 8
CFUTS10ST: 8
CFNRCTS10ST: 8
CFNRYTS10ST: 8
BAICTS10ST: 8
BAOCTS10ST: 8
BICROTS10ST: 8
BOICTS10ST: 8
BOIEXHTS10ST: 8
BAICTS20ST: 8
BAOCTS20ST: 8
BICROTS20ST: 8
BOICTS20ST: 8
BOIEXHTS20ST: 8
CAWTS60ST: 8
CFBTS60ST: 8
CFUTS60ST: 8
CFNRCTS60ST: 8
CFNRYTS60ST: 8
BAICTS60ST: 8
BAOCTS60ST: 8
BICROTS60ST: 8
BOICTS60ST: 8
BOIEXHTS60ST: 8
CAWBS30ST: 8
CFBBS30ST: 8
CFUBS30ST: 8
CFNRCBS30ST: 8
CFNRYBS30ST: 8
BAICBS30ST: 8
BAOCBS30ST: 8
BICROBS30ST: 8
BOICBS30ST: 8
BOIEXHBS30ST: 8
CAWBS20ST: 8
CFBBS20ST: 8
CFUBS20ST: 8
CFNRCBS20ST: 8
CFNRYBS20ST: 8
BAICBS20ST: 8
BAOCBS20ST: 8
BICROBS20ST: 8
BOICBS20ST: 8
BOIEXHBS20ST: 8
EQOSID1: 0
PDPTYPE1: 0
VPAA1: 0
EQOSID2: 0
PDPTYPE2: 0
VPAA2: 0
EQOSID3: 0
PDPTYPE3: 0
VPAA3: 0

vgersh99 · February 15, 2017, 2:52pm

something along these lines...
awk -f gil.awk myInputFile where gil.awk is:

BEGIN {
  FS=": *"
  OFS=","

  tags="MSISDN,IMSI,NAM,TS11,TS21,TS22,TS62,BAIC,BAOC,APNID1,APNID2,OBO,OBI"
  ntagsA=split(tags, tA, OFS)
  for(i=1; i<=ntagsA;i++)
    tagsA[tA]=i

  split("", outA)
}

function outRec(a,   i)
{
  for(i=1; i<=ntagsA;i++)
    printf("%s%s", a, (i==ntagsA)?ORS:OFS)
}
FNR==1 { print tags}
$1=="dn" {
   if (1 in outA) outRec(outA)
   split("", outA)
}
$1 in tagsA {
  outA[tagsA[$1]]=$2
}
END {
  if (1 in outA) outRec(outA)
}

RudiC · February 15, 2017, 2:55pm

Similar problems have been solved in these forums umpteen times. Would this adaption of one of those come close to what you need?

awk -F: '
BEGIN                   {HD="MSISDN,IMSI,NAM,TS11,TS21,TS22,TS62,BAIC,BAOC,APNID1,APNID2"
#                        print HD
                         HDCnt  = split(HD, HDArr, ",")
                         NXTREC = "dn"
                         HDCM   = ","HD","
                        }

#                       {gsub (/[\t ]*|\*/, "", $1)}

$1 == NXTREC && PR      {for (i=1; i<=HDCnt; i++) printf "%s,", RES[HDArr]
                         printf RS
                         delete RES
                        }

$1 == NXTREC            {PR=1}

HDCM ~ "," $1 ","       {RES[$1]=$0
                        }

END                     {for (i=1; i<=HDCnt; i++) printf "%s,", RES[HDArr]
                         printf RS
                        }
' FS=":" OFS="," file
MSISDN: 242064493944,IMSI: 629100113334650,NAM: 1,TS11: 1,TS21: 1,TS22: 1,TS62: 1,BAIC: 1,BAOC: 1,APNID1: 2,APNID2: 1,
MSISDN: 242068626345,IMSI: 629100114187228,NAM: 0,TS11: 1,TS21: 1,TS22: 1,TS62: 1,BAIC: 1,BAOC: 1,APNID1: 2,APNID2: 1,

gillesi · February 15, 2017, 3:16pm

Thanks man, but still got some questions. where does the program really start from?and gil.awk it's a file that i should save as .awk

---------- Post updated at 03:16 PM ---------- Previous update was at 03:11 PM ----------

Yes i just hope it'll give me the results fast. Thanks man, i'll get back to you when i run it

vgersh99 · February 15, 2017, 3:18pm

gil.awk is a file you save the quoted code in

Corona688 · February 15, 2017, 3:19pm

$ cat mline.awk

BEGIN {
        L=split("MSISDN IMSI NAM TS11 TS21 TS22 BAIC APNID1 APNID2 OBO OBI", C);
        for(X in C) COL[ C[X] ":"]= 1
}

function data() {
        S=P=""
        for(N=1; N<=L; N++) { S=S P D[C[N]":"]; delete D[C[N]":"]; P=OFS; }
        print S;
}

NR==1 {
        S=P=""
        for(N=1; N<=L; N++) { S=S P C[N]; P=OFS;        }
        print S;
}

$1 && $1 in COL { D[$1]=$2 ; next }

/^[ \t]*$/ { data() }
END { data() }


$ awk -f mline.awk mline.dat

MSISDN,IMSI,NAM,TS11,TS21,TS22,BAIC,APNID1,APNID2,OBO,OBI
242064493944,629100113334650,1,1,1,1,1,2,1,1,1
242068626345,629100114187228,0,1,1,1,1,2,1,,

$

gillesi · February 16, 2017, 3:09am

@Rudic what is the input and output file in your lines of code, please?

@vgersh99 i have syntax error near line 13

@Corona688 syntax error near line 1

RudiC · February 16, 2017, 3:48am

Replace "file" with your input file name; output is to stdout.

And,

gillesi · February 16, 2017, 4:31am

@Rudic my file has about 4 milions line, i'll have the output on the stdout does it mean on the shell?

RudiC · February 16, 2017, 5:17am

stdout in an interactive session normally is the terminal, but you can redirect it into a file, or even do both (file / terminal) using the tee command. I'd recommend you try the script first with a (smaller) subset of your input data.

gillesi · February 16, 2017, 6:15am

@Rudic please, i need the full output in a file, then be able to process it for my needs. Can you give me a hint on how to do that with your code please?

RudiC · February 16, 2017, 6:17am

Use a redirection like awk '...' file > outputfile

gillesi · February 16, 2017, 6:54am

@ Rudic,

Below it's my screen, when i tried to run. I created a vi file called output.awk , wrote your code in it, then gave the root permissions to the file, then executed with sh output.awk , i got an error awk not found .

"output.awk" [New file] 24 lines, 815 characters
[gil@cx01:~]$
>chmod 777 output.awk
[gil@cx01:~]$
>ls -la
total 9570034
drwxr-xr-x   2 bscscx   bscs           4 Feb 16 12:25 .
drwxr-xr-x   8 bscscx   bscs          19 Feb 16 08:12 ..
-rw-r--r--   1 bscscx   bscs     4895889520 Feb 16 08:37 input
-rwxrwxrwx   1 bscscx   bscs         815 Feb 16 12:25 output.awk
[gil@cx01:~]$
>sh output
output.awk: wk: not found

RudiC · February 16, 2017, 7:25am

As I can't see that file's contents, I can't comment. There's no wk command in my script. Trying to replicate what you describe (a file with the script in post#6 as is), I find it works perfectly.
Did you consider the hint in post#11?

wisecracker · February 16, 2017, 7:26am

Did you NOT read RudiC's post #11?

Why not try to find it using the which command?

And why are you using '777' for your 'output.awk' file?

vgersh99 · February 16, 2017, 9:40am

gillesi:

@ Rudic,

Below it's my screen, when i tried to run. I created a vi file called output.awk , wrote your code in it, then gave the root permissions to the file, then executed with sh output.awk , i got an error awk not found .
"output.awk" [New file] 24 lines, 815 characters
[gil@cx01:~]$
>chmod 777 output.awk
[gil@cx01:~]$
>ls -la
total 9570034
drwxr-xr-x   2 bscscx   bscs           4 Feb 16 12:25 .
drwxr-xr-x   8 bscscx   bscs          19 Feb 16 08:12 ..
-rw-r--r--   1 bscscx   bscs     4895889520 Feb 16 08:37 input
-rwxrwxrwx   1 bscscx   bscs         815 Feb 16 12:25 output.awk
[gil@cx01:~]$
>sh output
output.awk: wk: not found

As I already noted in here....
Save the code post 5 in a file called gil.awk and execute it as awk -f gil.awk input > output
I'm not sure how much easier the explanation can be....
Where exactly are you stuck?

gillesi · February 16, 2017, 9:47am

@Rudic, please i'm very new with this. Below your code, just how i wrote it on my file. I created a vi file called gil. In it i put your code, and then i executed it as a sh script with sh filename(gil) command. My input file is DS2_export100217.ldif, and i redirect the result in an output file called output. Please tell me if i did something wrong.

awk -F: '
BEGIN                   {HD="MSISDN,IMSI,NAM,TS11,TS21,TS22,TS62,BAIC,BAOC,APNID1,APNID2"
#                        print HD
                         HDCnt  = split(HD, HDArr, ",")
                         NXTREC = "dn"
                         HDCM   = ","HD","
                        }

#                       {gsub (/[\t ]*|\*/, "", $1)}

$1 == NXTREC && PR      {for (i=1; i<=HDCnt; i++) printf "%s,", RES[HDArr]
                         printf RS
                         delete RES
                        }

$1 == NXTREC            {PR=1}

HDCM ~ "," $1 ","       {RES[$1]=$0
                        }

END                     {for (i=1; i<=HDCnt; i++) printf "%s,", RES[HDArr]
                         printf RS
                        }
' FS=":" OFS="," DS2_export100217.ldif >> output