Extracting fixed length number from a text file

Hi,

I have a text file with sample records as

CASE ID: 20170218881083  
Original presentment record for ARN  [24013935350549886999873] not found
for Re-presentment

I want to extract the 23 digit number from this file. I thought of using grep but initially couldn't extract the required number. However, after googling, I found out the usage of

 -P, --perl-regexp: Interpret PATTERN as a Perl regular expression.

and

-o, --only-matching: Show only the part of a matching line that matches PATTERN.

I did manage to extract the 23 digit number from the sample text above using grep -Po as suggested in a forum, but am confused as to what the usage is for. Can someone please explain it and suggest any other commands which do the same work.

$ echo 'Original 123 presentment record for ARN  [24013935350549886999873] not found'|grep  -Po "\d{23}"
$ uname -a
Linux 2.6.18-417.el5 #1 SMP Sat Nov 19 14:54:59 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
echo 'Original presentment record for ARN  [24013935350549886999873] not found' | awk -F'[][]' '/^Original presentment/ {print $(NF-1)}'
1 Like

Hi that command will extract all 23 digit numbers from the text. The -o option is a GNU and BSD grep extension.

You do not need the perl extension.

grep  -Eo "[0-9]{23}" file

should work as well

The problem with this command is that it will also return part of numbers that are larger than 23 digits if present..

So it should be:

grep  -Eo '\<[0-9]{23}\>' file

An equivalent awk command would be:

awk '$1~/^[0-9]{23}$/{print $1}' RS=\[ FS=\] file

---
( grep -Eo '\<\d{23}\>' file will also work with BSD grep )

1 Like

Sed alternative

sed -n 's/.*\[\([0-9]\{23\}\)\].*/\1/p' filename
1 Like

@vgersh. awk -F ' ' is for defining delimiters as far as I know, what does your command mean

awk -F'[][]'

. The next part is the start of the string I got that and I think $NF would be number of fields, but why are you subtracting 1 from it and then printing that out.

@Scrutinizer: true my command did print out numbers more than 23 in length.

grep  -Eo '\<[0-9]{23}\>' file

the above command should work perfect as square brackets wont be in delimiters always so the awk command wont work in all occasions.

I am kinda confused as to why we are using single quotes in a grep expression. Cos I was reading the other day that single quotes remove any meaning from the special characters. Shouldn't we use double quotes?

Also the fact the you have used <,>. Are they working as a block to extract only 23 digits numbers/characters?

---------- Post updated at 04:33 PM ---------- Previous update was at 04:25 PM ----------

@andy391791

I tweaked your command a bit cos the square brackets may or may not be present in future texts for finding the 23 digit number. The tweak seems to be working fine

$ echo 'Original presentment record for ARN  24013935350549886999873 not found'|sed -n 's/.*\([0-9]\{23\}\).*/\1/p'
24013935350549886999873
[dsiddiqui@lxserv01 scripts]$ echo 'Original presentment record for ARN  [24013935350549886999873] not found'|sed -n 's/.*\([0-9]\{23\}\).*/\1/p'
24013935350549886999873

but when I increase the length of the 23 digit number and run the command, it just extracts 23 numbered digits from the complete numeric strings and pastes it, which is not the case that I want. I just want to print 23 digits only if present

-F defines field delimiters. In this particular case the field delimiters are [] as the 23-digit long string is surrounded by [] .
We subtract 1 from $NF because the LAST field is following the ] . The field next to last is your 23-digit long string.

1 Like

Hi dsid

The single quotes are better at protecting the regular expression from the shell, than double quotes, so that is why I prefer to use them. When you read that they remove any meaning from the special characters, they meant shell special characters, not regex special characters ...

\< and \> are word boundary operators and match the empty string at the beginning/end of a word respectively..

So if the 23 digits are enclosed by anything other than word characters ( [0-9A-Za-z_] , or more precisely: [[:alnum:]_] , including the start or end of a line) then it will match the 23 digits.

1 Like

sed command posted originally by @andy391791
Can someone please tell me why is the sed command in the below output printing 23 positions from the right to the left if I add more than 23 digits

$ echo 'Original presentment record for ARN 24013935362551925806644 not found'|sed -n 's/.*\([0-9]\{23\}\).*/\1/p'
24013935362551925806644
[dsiddiqui@lxserv01 scripts]$ echo 'Original presentment record for ARN 24013935362551925806644123123 not found'|sed -n 's/.*\([0-9]\{23\}\).*/\1/p'
35362551925806644123123
[dsiddiqui@lxserv01 scripts]$

this is because the leading .* is greedy - it "chows" as many characters as possible (including numbers) and then "chows up" 23 numbers and then everything else till the end of the record/line (that's the trailing .* )

To overcome this let's make .* greedy, but anchored by the trailing space:

echo 'Original presentment record for ARN 11124013935362551925806644 not found'|sed -n 's/.* \([0-9]\{23,\}\).*/\1/p
1 Like

I suggest to use Perl instead which is what that support was named after.

perl -nle '/\[(\d+)\]/ and print $1' dsid.file
24013935350549886999873
perl # Perl binary.
-n # loop through the lines of the file dsid.file
-l  # deal with newlines.
-e # execute what follows as Perl code.
/\[(\d+)\]/  # capture any amount of digits as long as there are inside opening and closing brackets.
and print $1 # if a capture was successful in the line, display what it was captured.

To prevent consumption of leading digits you can exclude them from the leading pattern

sed -n 's/[^0-9]*\([0-9]\{23\}\).*/\1/p'
1 Like

I tweaked your code a little as it may happen that I get a text file from which I have to extract 23 digits and these digits can be surrounded by alphanumeric characters and the fact that I am only looking for 23 digits

$ cat ARNs.txt
AD. 16.03.

[adfasdfasdfa82401393536255192580664asdfjkadhfa]

CASE ID: 20170218881083
Original presentment record for ARN  [24013935350549886999873] not found
for Re-presentment

CASE ID: 20170218881444
Original presentment record for ARN  [24013935361551920891659] not found
for Re-presentment

CASE ID: 20170218881447
Original presentment record for ARN  [24013935356550908226927] not found
for Re-presentment

CASE ID: 20170221894303
Original presentment record for ARN  [24013936003600942122783] not found

CASE ID: 20170221894378
Original presentment record for ARN  [24013935362551925806644] not found
for Re-presentment

tweaked code

$perl -nle '/(\d{23})/ and print $1' ARNs.txt

can you please suggest your comments on the new code

Also with perl you could add boundary operators :

perl -nle '/\b(\d{23})\b/ and print $1'

Note that this code differs from other approaches, in the sense that it only prints the first occurrence on the line..

1 Like

true, got your point. tried it out a sample file by adding a new 23 digit number next to an already present one and does not print the new number but the earlier code does, thanks for pointing that out.

You could try this modification:

perl -nle 'print for /\b(\d{23})\b/g ' file

by @Aia

perl -nle '/\[(\d+)\]/ and print $1' ARNs.txt

need your advise on the below questions please

  1. what is the difference b/w print for and and print
  2. why did you place the print for at the start of the command
  3. /g I suppose is to check for the pattern in the complete file or globally.
  4. I modified the perl code by @Aia with your comments but it didn't print the additional 23 digit number on the 1st line , why
$perl -nle '/\b(\d{23})\b/g and print $1' ARNs.txt
  1. an 2. print for loops over all the matches by /\b(\d{23})\b/g

/\[(\d+)\]/ and print $1 means if there is a match then print the first match
3.globally per line. And since this is done for every line, because of of the perl -lne command line options, this works out to be all matches..
4. Because the boundary operators ( \b ) only work if there are non-word characters surrounding the digits, as explained in post #7 . So there is no match because the 23 digits are surrounded by word characters, so there is no word boundary.

1 Like