Perl or Awk script to copy a part of text file.

Great thanks Frans

---------- Post updated 11-05-09 at 04:18 PM ---------- Previous update was 11-04-09 at 07:09 PM ----------

Hi Frans,
I really appreciate the time and effort u r spending on this, I've a request, can this code be changed to read from a DB2 table and write to DB2 table?

Thanks,

I can try if i can get a sample of data but i believe that there are powerful (and free) tools to manage such tables. I don't know what kind of system and distribution you are using but i'm sure you can find an appropriate one.
Did you get that xml code by exporting from DB2 ?

Hi Frans,
Let me explain the scenario.
We are using SAS DI studio to do the ETL.
The source DB2 table has 6 columns and one of the column is CLOB datatype containing the xml. When we imported into SAS DI it used lot of memory for the CLOB datatype, since we needed just the part of the xml we thought we'll process the xml thru shell scripting to get the string we need.
Instead of exporting the db2 table and processing the xml, it will be easier if we can read the xml column from the db2 table process it and save it in another new column in the same db2 source table.

Hope I'm clear.

Thanks,

Doesn't the ETL module of SAS DI provide any ability to pre-process xml data ?
Don't you have any tool (SQL...) to extract data from the DB2 ? It could then be processed and stored in an understandable format for the ETL.
It would be a pity that the ETL could'nt do that by entering a couple of lines of code to configure it.
Bash, Awk and Sed can do a lot, Perl could be more appropriate, hey can all proccess any kind of CSV by coding scripts.

Hi Frans,
Ok let me explain the complexity of the process.
First of all SAS DI processes xml well, the issue here is all the xml doesn't have one XSD. There are more than 100 XSD's to process these xml's and now we have to group xml's according to their XSD's to be able for any xml parser to read it(which by itself is a big process and SAS doesn't recognize XSD if it is more than three levels so we need to create an equivalent .map file) secondly reading all xml thru SAS DI will translate into 100's of tables.
We can reduce the number tables by selecting only the nodes we need, but again there is complexity in associating the records from xml tables to their original records from db2 table as there is no link inside the xml.
Next is there is not much time line for this project(1 month) and finally the memory and the load window constrains, so taking all this into consideration we thought it will be better if we could process the xml as string.

Thanks,

OK.
P.S. I'm totally beginner with Sed but found a one-line command that works with your sample :

sed '/\(<text>.*<\/text>\)/!d;s/<text>\(.*\)<\/text>/\1/;s/\(<.*>\)//' inputfile > outputfile

Or if the <text>... ... </text> may be on multiple lines then procede so:

cat inputfile | tr "\n" " " | sed '/\(<text>.*<\/text>\)/!d;s/<text>\(.*\)<\/text>/\1/;s/\(<.*>\)//' > outputfile

These commands look weird but Sed is really powerful to process data streams.
The output file doesn't contain any reference to any record and/or field. I think we schould also extract something from the xml to get an index for further processing. Am i right ?
Ok, I don't know where you are but here it's 1:00 AM, � demain.

I'm in Canada, so my time is GMT -7.00. We decided to consider this a seperate project and assigned the task to our PERL developers and right now they are working on this.
Atleast for time being I think we can relax. I really appreciated your effort.
Thank you very much.