Script for extraction of pattern

gillesi · February 19, 2017, 7:57am

Anyone can help here, with a script to extract the highlighted details from this two blocks?Actually there are milions of block, this is a sample?

dn: EpsStaInfId=EpsStaInf,serv=EPS,mscId=aaaaaa001aaaaaaaa629100100138702,ou=multiSCs,dc=mtncg
structuralObjectClass: EpsStaticInf
objectClass: EpsStaticInf
entryDS: 1
nodeId: 1
createTimestamp: 20170119123400Z
modifyTimestamp: 20170119123400Z
EpsStaInfId: EpsStaInf
EpsProfileId: 29
EpsOdb: 0
EpsRoamAllow: TRUE
CDC: 1
EpsIndSubChargChars: 123

dn: EpsStaInfId=EpsStaInf,serv=EPS,mscId=aaaaaa001aaaaaaaa629100100165619,ou=multiSCs,dc=mtncg
structuralObjectClass: EpsStaticInf
objectClass: EpsStaticInf
entryDS: 1
nodeId: 1
createTimestamp: 20170211115449Z
modifyTimestamp: 20170211115449Z
EpsStaInfId: EpsStaInf
EpsProfileId: 29
EpsOdb: 0
EpsRoamAllow: TRUE
CDC: 1
EpsIndSubChargChars: 123

drysdalk · February 19, 2017, 8:10am

Hi,

Sure, no problem. Here's one way to do it:

$ grep ^dn sample | awk -F, '{print $3}'
mscId=aaaaaa001aaaaaaaa629100100138702
mscId=aaaaaa001aaaaaaaa629100100165619
$

So the idea is we're using 'grep' to look for lines that start with 'dn' (that's the meaning of the carat symbol in this context), and then using 'awk' to print the third field, with the field separator being specified as a comma via the -F flag.

Hope this helps.

gillesi · February 19, 2017, 8:17am

@Thank you very much, but i need the output in a file and i also need the line EpsProfileId:29
so bellow is what i want in a file:

mscId=aaaaaa001aaaaaaaa629100100138702, EpsProfileId: 29
mscId=aaaaaa001aaaaaaaa629100100165619, EpsProfileId: 29

Cause if i have 2 or 3 milions blocks, they should all be in the same a file. Thanks

drysdalk · February 19, 2017, 8:20am

Hi,

(Have just edited the lines that print the output so you get everything in the format you want)

Sorry, my solution won't quite do - just noticed you need two lines from the blocks, not just the first one (my apologies).

Something like this should do the trick:

$ cat script.sh
#!/bin/bash
cat sample.txt | while read -r line
do
        if echo $line | grep ^dn >/dev/null 2>/dev/null
        then
                echo $line | grep ^dn  | awk -F, '{printf $3","}'
        fi

        if echo $line | grep ^EpsProfileId: >/dev/null 2>/dev/null
        then
                echo $line
        fi
done
$ ./script.sh
mscId=aaaaaa001aaaaaaaa629100100138702,EpsProfileId: 29
mscId=aaaaaa001aaaaaaaa629100100165619,EpsProfileId: 29
$

gillesi · February 19, 2017, 8:45am

But i tried i don't have the output in a file?can we find a way to put the output in a file?please

RudiC · February 19, 2017, 8:47am

Imagine creating 52 million (2 mio blocks * 13 lines * 2 tests) processes to run a grep in each - might take some time. Try

awk 'match ($0, /mscId=[^,]*/) {printf "%s, ", substr ($0, RSTART, RLENGTH)}; /EpsProfileId/' file
mscId=aaaaaa001aaaaaaaa629100100138702, EpsProfileId: 29
mscId=aaaaaa001aaaaaaaa629100100165619, EpsProfileId: 29

And, although having been raised in your other thread, the question for your own attempts still is valid. PLEASE answer it in your future requests!

I'd propose that you READ, UNDERSTAND, and HEED peoples' answers ...

drysdalk · February 19, 2017, 8:57am

Hi,

Putting the output in a file is fairly straightforward - in general, the standard output of any command can be re-directed to a file by means of the > re-director.

So for example:

$ ./script.sh
mscId=aaaaaa001aaaaaaaa629100100138702,EpsProfileId: 29
mscId=aaaaaa001aaaaaaaa629100100165619,EpsProfileId: 29
$ ./script.sh > output.txt
$ cat output.txt
mscId=aaaaaa001aaaaaaaa629100100138702,EpsProfileId: 29
mscId=aaaaaa001aaaaaaaa629100100165619,EpsProfileId: 29
$

As you can see, we got no output on the terminal from 'script.sh' when we ran it with output re-direction the second time. Instead, the output was re-directed to the file 'output.txt' instead. So doing something like this should be sufficient, and will work in general for most things that write to standard output on a UNIX-style system.

gillesi · February 19, 2017, 9:35am

@ RudiC PLEASE i don't really get your point. When i said that i need the output file i was referring to the post #4. Which's been adressed in post #7.
As for my own attempts, i don't have any. If i had any attempt i wouldn't hesitate to put it here. Please some of these scripts are asked ASAP and i'm learner in scripting so it's somehow not easy for me. Please hope you get it.

---------- Post updated at 09:35 AM ---------- Previous update was at 09:23 AM ----------

Please this is another sample, actually we have blocks with mscId =aaa8c4c6b30a4.....(emcrypted) in the same file, but i don't need those.
@RudiC your code gave me the output but it gave me the encrypted instead of the one i want.

dn: EpsStaInfId=EpsStaInf,serv=EPS,mscId=aaa8c4c6b30a4b1abcfce3990027d6a4,ou=multiSCs,dc=mtncg
structuralObjectClass: EpsStaticInf
objectClass: EpsStaticInf
entryDS: 1
nodeId: 1
createTimestamp: 20170202223552Z
modifyTimestamp: 20170202223552Z
EpsStaInfId: EpsStaInf
EpsProfileId: 29
EpsOdb: 0
EpsRoamAllow: TRUE
CDC: 1
EpsIndSubChargChars: 123

dn: EpsStaInfId=EpsStaInf,serv=EPS,mscId=aaa9526a822d46979331a964be201cba,ou=multiSCs,dc=mtncg
structuralObjectClass: EpsStaticInf
objectClass: EpsStaticInf
entryDS: 1
nodeId: 1
createTimestamp: 20170119122933Z
modifyTimestamp: 20170119122933Z
EpsStaInfId: EpsStaInf
EpsProfileId: 29
EpsOdb: 0
EpsRoamAllow: TRUE
CDC: 1
EpsIndSubChargChars: 123

dn: EpsStaInfId=EpsStaInf,serv=EPS,mscId=aaaaaa001aaaaaaaa629100100138702,ou=multiSCs,dc=mtncg
structuralObjectClass: EpsStaticInf
objectClass: EpsStaticInf
entryDS: 1
nodeId: 1
createTimestamp: 20170119123400Z
modifyTimestamp: 20170119123400Z
EpsStaInfId: EpsStaInf
EpsProfileId: 29
EpsOdb: 0
EpsRoamAllow: TRUE
CDC: 1
EpsIndSubChargChars: 123

dn: EpsStaInfId=EpsStaInf,serv=EPS,mscId=aaaaaa001aaaaaaaa629100100165619,ou=multiSCs,dc=mtncg
structuralObjectClass: EpsStaticInf
objectClass: EpsStaticInf
entryDS: 1
nodeId: 1
createTimestamp: 20170211115449Z
modifyTimestamp: 20170211115449Z
EpsStaInfId: EpsStaInf
EpsProfileId: 29
EpsOdb: 0
EpsRoamAllow: TRUE
CDC: 1
EpsIndSubChargChars: 123

And about my own attempt, please i'm new in scripting and i'm learning, so for now i don't have any attempt, when i'm good i'll definitely post my attempt.

Don_Cragun · February 19, 2017, 12:45pm

Hi gillesi,
Note that the way to learn is to do. If you "don't have any attempt", you don't have any way to learn. If you expect this site to act as your unpaid programming staff instead of as a tool to help you learn how to improve your attempts to solve your own problems, I'm afraid we will all be disappointed.

gillesi · February 19, 2017, 2:44pm

Waouh...ok. I'm really beginner here. But saying that you act as "unpaid programming staff" that's very frustrating to me. I feel like i'm begging for your help or you asking me out of this site. For my 1st thread below was my attempt and i said it was very slow, i spent days waiting for the output.

count=`grep -wc "MSISDN" DS1_HLR_export170217.ldif` >> OUTPUT

for k in `seq $count`
do

cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="MSISDN" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="IMSI" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="NAM"  {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="TS11" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="TS21" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="TS22" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="TS62" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="BAIC" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="BAOC" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="APNID1" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="APNID2" {m++}m==var{print; exit}' >> OUTPUT
echo " " >> OUTPUT
done

Don_Cragun · February 20, 2017, 1:02am

gillesi:

Waouh...ok. I'm really beginner here. But saying that you act as "unpaid programming staff" that's very frustrating to me. I feel like i'm begging for your help or you asking me out of this site. For my 1st thread below was my attempt and i said it was very slow, i spent days waiting for the output.

count=`grep -wc "MSISDN" DS1_HLR_export170217.ldif` >> OUTPUT

for k in `seq $count`
do

cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="MSISDN" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="IMSI" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="NAM"  {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="TS11" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="TS21" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="TS22" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="TS62" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="BAIC" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="BAOC" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="APNID1" {m++}m==var{print; exit}' >> OUTPUT
cat DS1_HLR_export170217.ldif| awk -F":" -v var="$k" '$1=="APNID2" {m++}m==var{print; exit}' >> OUTPUT
echo " " >> OUTPUT
done

We are not asking you out of this site; we are asking you to pay attention to the suggestions that have been made, we are asking you to clearly describe the format of the input you are processing, we are asking you to clearly describe the output you are trying to produce, we are asking you to answer the questions that we ask, and we are asking you to show us the output you get when you try to run the code we suggest. We are asking you to actively participate in the production of the code needed to produce the output you want! (Note that saying: "It doesn't work." without showing us the EXACT diagnostic messages produced DOES NOT HELP US HELP YOU.)

Above you have shown us a script that you say produces the output you want, but runs too slow. If your input records are in the same order as the output you want, that script could be replaced by a single grep command. But, the multi-line output this code produces is nothing at all like the single output line per input record CSV format output that you said you wanted in your other thread. And, in this thread, you have said that you want only two output fields (but it isn't clear whether the first field is the entire line started with dn: , whether you just want the mscId=value , or whether you want anything at all from that line other than to note that it is the 1st line of each input file record.

For us to to be able to help you in either of these threads, you need to understand what output you are trying to produce, explain clearly to us what that output format is, and show us exactly what output you want produced from a representative sample input file.

You have said that there are some records with emcrypted (sic) values that are to be ignored, but you haven't given us any indication of how to determine which records are to be included and which records are to be ignored. Please clarify your requirements in this regard.

gillesi · February 20, 2017, 2:43am

---------- Post updated at 09:35 AM ---------- Previous update was at 09:23 AM ----------

INPUT FILE:
The mscId in the first two block(highlighted in red, encrypted) i don't need them.
In my input file there are thousands of blocks like these followed by the last two block in which i need the mscId highlighted in green. So if there are 500 blocks with the encypted mscId there are also 500 blocks with mscId highlighted in green. Therefore i need the mscId and EpsProfile highlighted in green.

dn: EpsStaInfId=EpsStaInf,serv=EPS,mscId=aaa8c4c6b30a4b1abcfce3990027d6a4,ou=multiSCs,dc=mtncg
structuralObjectClass: EpsStaticInf
objectClass: EpsStaticInf
entryDS: 1
nodeId: 1
createTimestamp: 20170202223552Z
modifyTimestamp: 20170202223552Z
EpsStaInfId: EpsStaInf
EpsProfileId: 29
EpsOdb: 0
EpsRoamAllow: TRUE
CDC: 1
EpsIndSubChargChars: 123

dn: EpsStaInfId=EpsStaInf,serv=EPS,mscId=aaa9526a822d46979331a964be201cba,ou=multiSCs,dc=mtncg
structuralObjectClass: EpsStaticInf
objectClass: EpsStaticInf
entryDS: 1
nodeId: 1
createTimestamp: 20170119122933Z
modifyTimestamp: 20170119122933Z
EpsStaInfId: EpsStaInf
EpsProfileId: 29
EpsOdb: 0
EpsRoamAllow: TRUE
CDC: 1
EpsIndSubChargChars: 123

dn: EpsStaInfId=EpsStaInf,serv=EPS,mscId=aaaaaa001aaaaaaaa629100100138702,ou=multiSCs,dc=mtncg
structuralObjectClass: EpsStaticInf
objectClass: EpsStaticInf
entryDS: 1
nodeId: 1
createTimestamp: 20170119123400Z
modifyTimestamp: 20170119123400Z
EpsStaInfId: EpsStaInf
EpsProfileId: 29
EpsOdb: 0
EpsRoamAllow: TRUE
CDC: 1
EpsIndSubChargChars: 123

dn: EpsStaInfId=EpsStaInf,serv=EPS,mscId=aaaaaa001aaaaaaaa629100100165619,ou=multiSCs,dc=mtncg
structuralObjectClass: EpsStaticInf
objectClass: EpsStaticInf
entryDS: 1
nodeId: 1
createTimestamp: 20170211115449Z
modifyTimestamp: 20170211115449Z
EpsStaInfId: EpsStaInf
EpsProfileId: 29
EpsOdb: 0
EpsRoamAllow: TRUE
CDC: 1
EpsIndSubChargChars: 123

wanted output:

mscId=aaaaaa001aaaaaaaa629100100138702,EpsProfileId: 29
mscId=aaaaaa001aaaaaaaa629100100165619,EpsProfileId: 29

Don_Cragun · February 20, 2017, 3:24am

It is very nice to know that you don't need the encrypted records and the you need the unencrypted records, but I repeat:

Given that the string values for the mscId fields that you have highlighted in red as being encrypted are:

aaa8c4c6b30a4b1abcfce3990027d6a4
aaa9526a822d46979331a964be201cba

and the string values for the mscId fields that you have highlighted in green as being unencrypted are:

aaaaaa001aaaaaaaa629100100138702
aaaaaa001aaaaaaaa629100100165619

both sets of which are 32-character lower-case alphanumeric strings and the remainder of those lines are identical in the encrypted and unencrypted cases, how did you determine which of those strings are encrypted and which are unencrypted?

gillesi · February 20, 2017, 5:37am

don cragun:

It is very nice to know that you don't need the encrypted records and the you need the unencrypted records, but I repeat:

Given that the string values for the mscId fields that you have highlighted in red as being encrypted are:
aaa8c4c6b30a4b1abcfce3990027d6a4
aaa9526a822d46979331a964be201cba
and the string values for the mscId fields that you have highlighted in green as being unencrypted are:
aaaaaa001aaaaaaaa629100100138702
aaaaaa001aaaaaaaa629100100165619
both sets of which are 32-character lower-case alphanumeric strings and the remainder of those lines are identical in the encrypted and unencrypted cases, how did you determine which of those strings are encrypted and which are unencrypted?

In the unencrypted format the values 629100 should distinguish the unencrypted mscId i want to extract aaaaaa001aaaaaaaa629100100165619

RudiC · February 20, 2017, 1:33pm

Is that always 629100 or can it be another string? Is it always in that exact position within the field, or any place?

gillesi · February 20, 2017, 4:10pm

For each block, it's always in the same position and in the same format and the epsProfileId also in the same position

dn: EpsStaInfId=EpsStaInf,serv=EPS,mscId=aaaaaa001aaaaaaaa629100100138702,ou=multiSCs,dc=mtncg
structuralObjectClass: EpsStaticInf
objectClass: EpsStaticInf
entryDS: 1
nodeId: 1
createTimestamp: 20170119123400Z
modifyTimestamp: 20170119123400Z
EpsStaInfId: EpsStaInf
EpsProfileId: 29
EpsOdb: 0
EpsRoamAllow: TRUE
CDC: 1
EpsIndSubChargChars: 123

Aia · February 20, 2017, 6:05pm

Would that work?

perl -00 -ne '($m,$e)=/(mscId=[^,]*).*(EpsProfileId:\s+\d+)/s; $m=~/629100/ and print "$m,$e\n"' gillesi.input

or

perl -00 -ne '($m,$e)=/(mscId=[^,]*).*(EpsProfileId:\s+\d+)/s;$m=~/\d{15}$/ and print "$m,$e\n"' gillesi.input

mscId=aaaaaa001aaaaaaaa629100100138702,EpsProfileId: 29
mscId=aaaaaa001aaaaaaaa629100100165619,EpsProfileId: 29

Don_Cragun · February 20, 2017, 6:54pm

If you don't like perl , or if you want a commented script to help you understand what is going on, you could also consider the following bash and /usr/xpg4/bin/awk scripts that should work fine on a Solaris 10 system:

#!/bin/bash
# Extract script name from final component of path used to invoke this script.
IAm=${0##*/}

# Verify that we have been called with two operands...
if [ $# -ne 2 ]
then	# We were not called with two operands, print usage diagnostic and exit.
	printf 'Usage: %s input_filename output_filename\n' "$IAm" >&2
	exit 1
fi

# The 1st operand (input_filename) is assumed to contain multi-line records
# with each record containing a line of the form (without the leading "# "):
# dn: EpsStaInfId=EpsStaInf,serv=EPS,mscId=string1,ou=multiSCs,dc=mtncg
# followed by a later line in the record of the form (again without the
# leadin "# ":
# EpsProfileId: string2
# where "string1" in the 1st line is a 32 character fixed-length alphanumeric
# string.  If the 6 characters starting at position 18 in this string are
# "629100" as in "aaaaaa001aaaaaaaa629100100138702", we need to extract and
# print the 3rd comma separated field from the 1st line and the entire 2nd line
# and print them as comma separated fields in a single output line in the file
# named by the 2nd operand (output_filename).  If those 6 characters are any
# other string, nothgin from that record from input_filename will be included
# in output_filename.

# Use awk with input fields separated by colons and commas and the output field
# separator set to comma to perform the above specified actions.
/usr/xpg4/bin/awk -F '[:,]' -v OFS=, '
# If the 1st field on the line is the string "dn", determine whether or not
# this is an unencrypted record.
$1 == "dn" {
	# Uncomment the next line to see the fields we are examining.
	#printf("*$1=\"%s\" $4=\"%s\" substr($4,24,6)=\"%s\"\n", $1, $4, substr($4, 24, 6))

	# If the string "629100" appears starting at position 24 (note that we
	# have "mscId=string1" here, not just "string1"):
	if(substr($4, 24, 6) == "629100") {
		# it is, so we have an unencrypted record.
		unencrypted = 1
		# save the mscId field from this line to be printed later.
		mscId = $4
	} else {# t is not, so we have an encrypted record.
		unencrypted = 0
	}

	# We are done with this record, so move on to the next input line.
	next
}
# If we are processing an unencrypted record, look for the "EpsProfileId" data.
unencrypted && ($1 == "EpsProfileId") {
	# We found it.  Print the saved mscId from the previous "dn:" line and
	# the "EpsProfileId" data from this line.
	print mscId, $0
	# and clear the unencrypted flag to speed up processing until we find
	# the next "dn:" line.
	unencrypted = 0
}' "$1" > "$2"	# End the awk script naming the file to be proccessed and
		# redirecting output to the specified output file.

or an alternative awk approach (without the comments):

#!/bin/bash
# Extract script name from final component of path used to invoke this script.
IAm=${0##*/}

# Verify that we have been called with two operands...
if [ $# -ne 2 ]
then	# We were not called with two operands, print usage diagnostic and exit.
	printf 'Usage: %s input_filename output_filename\n' "$IAm" >&2
	exit 1
fi
/usr/xpg4/bin/awk -F '[:,]' -v OFS=, '
$1 == "dn" { m = (substr($4, 24, 6) == "629100")? $4 : 0; next }
m && ($1 == "EpsProfileId") { print m, $0; m = 0 }' "$1" > "$2"

both of which produce the output you said you wanted in post #12.

In both of these nawk should work as well as /usr/xpg4/bin/awk .

If you uncomment the debugging printf command in the 1st script, you'll get the output:

*$1="dn" $4="mscId=aaa8c4c6b30a4b1abcfce3990027d6a4" substr($4,24,6)="cfce39"
*$1="dn" $4="mscId=aaa9526a822d46979331a964be201cba" substr($4,24,6)="331a96"
*$1="dn" $4="mscId=aaaaaa001aaaaaaaa629100100138702" substr($4,24,6)="629100"
mscId=aaaaaa001aaaaaaaa629100100138702,EpsProfileId: 29
*$1="dn" $4="mscId=aaaaaa001aaaaaaaa629100100165619" substr($4,24,6)="629100"
mscId=aaaaaa001aaaaaaaa629100100165619,EpsProfileId: 29

showing how awk split the fields in the dn: lines.

gillesi · February 22, 2017, 2:18am

The first one is working fine. Gave me the output just how i wanted. ThANK YOU. Now i need a tutorial on how to use perl and awk please. A very simple straight forward tutorial.

---------- Post updated at 02:18 AM ---------- Previous update was at 02:10 AM ----------

Don Cragun thank you very much. This was a tutorial to me. The awk is working fine.

Corona688 · February 22, 2017, 9:53am

I was wondering if you were using disks over nfs or something. That would explain the poor performance.