extract part of text file

I need to extract the following lines from this text and put it in different files.

From xxxx@gmail.com  Thu Jun 10 21:15:46 2010
Return-Path: <xxxxx@gmail.com>
X-Original-To: xxx@localhost
Delivered-To:xxxx@localhost
Received: from ubuntu (localhost [127.0.0.1])
    by ubuntu (Postfix) with ESMTP id 53FDD2575A
    for <xxxxxx@localhost>; Thu, 10 Jun 2010 12:15:46 -0700 (PDT)
MIME-Version: 1.0
Received: from gmail-pop.l.google.com [xxxxx]
    by ubuntu with POP3 (fetchmail-6.3.9-rc2)
    for <xxxxxx@localhost> (single-drop); Thu, 10 Jun 2010 21:15:46 +0200 (CEST)
Received: by xxxxxx with HTTP; Thu, 10 Jun 2010 12:13:40 -0700 (PDT)
Date: Thu, 10 Jun 2010 21:13:40 +0200
Delivered-To: xxxxxxxr@gmail.com
Message-ID: <xxxxxxxxx@mail.gmail.com>
Subject: TOPIC
From: NAME <xxxxxxxx@gmail.com>
To: xxxxxxxxxxx@gmail.com
Content-Type: multipart/alternative; boundary=001485f1ea94fa4e4d0488b1d13c
X-Antivirus: avast! (VPS 100610-0, 10/06/2010), Inbound message
X-Antivirus-Status: Clean

--001485f1ea94fa4e4d0488b1d13c
Content-Type: text/plain; charset=ISO-8859-1

This is an exemple from text

--001485f1ea94fa4e4aaaa8b1d13c
Content-Type: text/html; charset=ISO-8859-1

This is an exemple from text

--001485f1ea94fa4e4aaaa8b1d13c--

From xxxx@gmail.com  Thu Jun 10 21:15:46 2010
Return-Path: <xxxxx@gmail.com>
X-Original-To: xxx@localhost
Delivered-To:xxxx@localhost
Received: from ubuntu (localhost [127.0.0.1])
    by ubuntu (Postfix) with ESMTP id 53FDD2575A
    for <xxxxxx@localhost>; Thu, 10 Jun 2010 12:15:46 -0700 (PDT)
MIME-Version: 1.0
Received: from gmail-pop.l.google.com [xxxxx]
    by ubuntu with POP3 (fetchmail-6.3.9-rc2)
    for <xxxxxx@localhost> (single-drop); Thu, 10 Jun 2010 21:15:46 +0200 (CEST)
Received: by xxxxxx with HTTP; Thu, 10 Jun 2010 12:13:40 -0700 (PDT)
Date: Thu, 10 Jun 2010 21:13:40 +0200
Delivered-To: xxxxxxxr@gmail.com
Message-ID: <xxxxxxxxx@mail.gmail.com>
Subject: TOPIC
From: NAME <xxxxxxxx@gmail.com>
To: xxxxxxxxxxx@gmail.com
Content-Type: multipart/alternative; boundary=001485f1ea94fa4e4d0488b1d13c
X-Antivirus: avast! (VPS 100610-0, 10/06/2010), Inbound message
X-Antivirus-Status: Clean

--001485f1ea94fa4e4d0488b1d13c
Content-Type: text/plain; charset=ISO-8859-1

this text can be
1 or more lines
like this

--001485f1ea94fa4e4d0asdfadgad3c
Content-Type: text/html; charset=ISO-8859-1

this text can be
1 or more lines
like this

--001485f1ea94fa4e4d0asdfadgad3c--

I need an output file like this

Subject: TOPIC
From: NAME <xxxxxxxx@gmail.com>
this text can be
1 or more lines
like this

thank you for helping

if there is only 1 file. then u can do the below code:-

egrep "Subject|From|Text" infile

but as far as ur text is considered i am sure it has more than one line. is their any specific pattern in ur text. (that u can check for.)
this is not the best solution but this will work.
for others wait for the masters of awk and sed :smiley:

btw do u knw any yaxo ? ?

1 Like

The only pattern I found is this.

search for this line "Content-Type: text" , and while not "--" print.

Content-Type: text/plain; charset=ISO-8859-1

--

but I have no idea from how to make it.

awk '/Content-Type: text\/plain; charset=ISO-8859-1/,/--/' infile
Content-Type: text/plain; charset=ISO-8859-1

this will be the starting pattern but what will be the ending pattern?

is this ur ending pattern ??

--001485f1ea94fa4e4aaaa8b1d13c--

if this is true then u can use sed or awk. let me check wht i can do bcz i am not good with either sed or awk bt i try and give u an answer

1 Like

the end pattern is "--" because the following numbers change every time.

:slight_smile:

---------- Post updated at 07:34 AM ---------- Previous update was at 07:27 AM ----------

Content-Type: text/plain; charset=ISO-8859-1

this text can be
1 or more lines
like this

--001485f1ea946fb0b20488b1e401

It works! but I need only the text and no the patterns.

awk '/Content-Type: text\/plain; charset=ISO-8859-1/,/--/{if (!/(Content-Type: text\/plain; charset=ISO-8859-1)|(--)/){print}' infile
1 Like

Something like this?

awk '/^Content-Type.+plain/{f=1;getline}/^--/{f=0}/^(Subject|From)/||f{print}' infile
1 Like

I find another post with more information!

:smiley:

thanks to everyone

---------- Post updated at 10:41 AM ---------- Previous update was at 07:53 AM ----------

for my sourprise I discover that hotmail, takes other format.

From x@hotmail.com  Fri Jun 11 17:09:14 2010
Return-Path: <x@hotmail.com>
X-Original-To: x@localhost
Delivered-To: x@localhost
Received: from ubuntu (localhost [127.0.0.1])
    by ubuntu (Postfix) with ESMTP id BEB892696C
    for <x@localhost>; Fri, 11 Jun 2010 17:09:14 +0200 (CEST)
Delivered-To: x@gmail.com
Received: from gmail-pop.l.google.com [209.85.229.109]
    by ubuntu with POP3 (fetchmail-6.3.9-rc2)
    for <x@localhost> (single-drop); Fri, 11 Jun 2010 17:09:14 +0200 (CEST)
Message-ID: <BAY133-W394AF85AA21804B1A7DEDA6D90@phx.gbl>
Content-Type: multipart/alternative;
    boundary="_cc7e7bd6-5175-4c48-804d-8a25692af1e8_"
X-Originating-IP: [xxxx]
From: xx <xxx@hotmail.com>
To: <xxx@gmail.com>
Subject: TEST
Date: Fri, 11 Jun 2010 15:02:23 +0000
Importance: Normal
MIME-Version: 1.0
X-OriginalArrivalTime: 11 Jun 2010 15:02:24.0582 (UTC) FILETIME=[199F4E60:01CB0977]

--_cc7e7bd6-5175-4c48-804d-8a25692af1e8_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


Here comes the message
_________________________________________________________________
=BFUn navegador seguro buscando est=E1s? =A1Protegete ya en www.ayudartepod=
ria.com!
www.ayudartepodria.com=

--_cc7e7bd6-5175-4c48-804d-8a25692af1e8_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 10pt=3B
font-family:Verdana
}
--></style>
</head>
<body class=3D'hmmessage'>
nene                           <br /><hr />=BFUn navegador seguro buscando est=E1s? <a hre=
f=3D'www.ayudartepodria.com' target=3D'_new'>=A1Protegete ya en www.ayudart=
epodria.com!</a></body>
</html>=

--_cc7e7bd6-5175-4c48-804d-8a25692af1e8_--

the problems are. How take the information in the new format.
How can I determine wich awk have to execute ?

From: xx <xxx@hotmail.com>
Subject: TEST
Here comes the message

thanks

Try this:

 awk '/^Content-Type.+plain/{f=1;while($0!="")getline}/^--/||/^__/{f=0}/^(Subject:|From:)/||f{print}'  infile

Can someone explain me what exactly does this code ?

thanks

I think that, first searches the text between "Content-Type.+plain" and "--" prints it.
then searches de lines with the text Subject|From and puts in the top. (f=0) print in line 0 ??

is it ?

'/^Content-Type.+plain/{f=1;getline}

when line matched by "^Content-Type.+plain" is found, then next line is put in $0 by getline, and "f" is set to 1.

/^--/{f=0}

When line is matched by "^--", then "f" is set back to 0. So Those two commands set "f" variable to 1 for all lines between "^Content-Type.+plain" and "^--" in file. Next command:

/^(Subject|From)/||f{print}

checks if line contain "Subject" of "From" as first word, or that "f" variable is true (other than 0). If that is the case, then line is printed. So last command prints lines that were matched by "^(Subject|From)" or those between "^Content-Type.+plain" and "^--", becouse "f" variable was set to 1 for them.

1 Like