How to convert .xml file data to .indexfile which is a data file?

Sahithi_7 · April 3, 2023, 8:25am

I have xml file with indexvalues ..and i will have .ini file in which i will have list of indexvalues which should be written in to data file. By using .ini file as referece i should be able to choose which indexvalues realted data need to be written in data file.

Im not so familiar with shell scripting ..could someone help me on this. Any input will be really appreciated

Thanks in advance

drysdalk · April 3, 2023, 12:06pm

Hello,

Welcome to the forum ! We hope you enjoy your time here, and that you find this to be a friendly and welcoming place.

So, am I right in saying that the situation is this:

File 1 contains a list of indexes, and their values
File 2 contains a list of indexes
For each of the indexes in File 2, you want to write out that index along with its value from File 1

Is that correct, or is the situation different from described above ?

Also, could you please provide information on your operating environment (e.g. Red Hat Enterprise Linux 7.9 on x86_64; AIX 7.3 on POWER9; etc.), the shell you are using to implement your solution (e.g. Bash, tcsh, etc.), and a sample of the data from each file, along with an example of what you would want the output to look like ?

Lastly (and arguably most importantly), could you please share your own thoughts and attempts so far, even if they have been unsuccessful ? Here we generally like to operate on the basis of you first attempting to implement a solution yourself, and then letting us know which particular problems or issues you've run in to so we can help you to resolve those to get your task over the finishing line, so to speak.

So, if you can please get back to us addressing each of the above points, then we can take things from there. Thanks !

Sahithi_7 · April 3, 2023, 5:09pm

Hello,

Thanks for your reply.

The situation you mentioned is correct.
Just to add on that : File 1 is xml file.
File 2 needs to decided which file type would fit for this kind of situation.

Environment im not sure. I will get back to you with that info by morning.

Shell we are using is bash.

Sample data i will provide it clearly with example by tomorrow.

Unfortunately, i have not tried ot because i dnno where to start with what kind of commands to be used.

If you give me intial hint i will try for my self first and come back to you with my experiences with that.

Thanks

drysdalk · April 3, 2023, 6:38pm

Hello,

OK, thanks - if you can get back to us with that information, that would be great.

By way of general advice however, the awk utility is quite often one of the tools that people choose when wanting to pick out a particular piece of data from a file. In the most commonly-encountered usage case, awk is used to print a particular numbered field out of its given input, optionally with a specific field delimiter being specified.

So, let's imagine we have a fairly simple sample file that looks like this:

$ cat fruits.xml
<Fruits>
        <Value1>Apple</Value1>
        <Value2>Orange</Value2>
        <Value3>Banana></Value3>
</Fruits>
$

Now let's say we've been tasked with printing the contents of the Value2 field from this file. We can use awk, like so:

$ awk -F '[<>]' '/Value2/ {print $3}' fruits.xml
Orange
$

Let's break down bit-by-bit what we've seen here. Firstly, this part of the awk command specifies the field separator:

-F '[<>]'

The -F flag tells awk that we're setting our field separator, and in this case we're using a multi-character field separator. The effect of putting multiple characters within the square brackets is to consider any of the characters within as a field separator. So in our case, any angled bracket will be regarded as the start or end of a field, as appropriate.

Next, we have this part of our command:

/Value2/

This tells awk that we want it to search through its input for lines which match the pattern "Value2". We choose this because, in our example, this is the particular field we want to print the contents of.

Lastly, we have this:

{print $3}

This, quite simply, tells awk to print the third field. Now, remember that our field separator is either an opening or closing angled bracket. So in the case of our sample line, awk would therefore now read it thusly:

        <Value2>Orange</Value2>
^           ^       ^
|           |       |
|           |       3rd field
|           |
1st field   2nd field

So the empty space (before the first angled bracket) is field 1; the string "Value2" (after the first angled bracket) is the field 2; and the string "Orange" (after the second angled bracket) is field 3.

So the end result is that we get the output we want, namely the contents of "Value2" within our sample file.

Anyway, hope this helps !

Sahithi_7 · April 3, 2023, 8:31pm

Hello,

That was pretty clear explanation. Thanks a lot.
I will try to implement with given example above ..and see how its working for me.

I will get back to you tomorrow with details as mentioned.

Thanks a lot

Sahithi_7 · April 6, 2023, 1:23pm

@drysdalk
Sorry i was late getting details
Example xml file looks like:

<?xml version="1.0">
<documents>
  <document>
    <indexvalue>
        <value1>xxxx</value1>
        <value2>1234</value2>
    </indexvalue>
    <file>file1.pdf</file>
  </document>
 </documents>

I need it like below: .csv
Header:

Value1,Value2,file
xxx,1234,file1.pdf

Need to map values between header and xml indexvalues .

Sahithi_7 · April 6, 2023, 1:36pm

Just to add to above point.

In a folder if i have more than one xml sometimes. If that is the case indivual csv file should generated for each .xml files avaliae in the folder.

vgersh99 · April 6, 2023, 2:13pm

@Sahithi_7,
please use proper markdown code tags when providing code/data samples. The markdown code tags are reference in the forum welcome page here.
I don't think we're getting any traction withOUT any specific simple, representative and reproducible input data samples (if there're multiple) AND the corresponding desired output example.
So far we have lots of "hand waving" and sprinkles of the unformatted inputs.
I don't think we'll get anywhere with this thread unless/until you provide all the requested and properly formatted data.
Also, we'd ask you to provide your own attempt(s) at solving this issue - this will enable us to help you in any tangible way.

Let's see how far you get with the above.

MadeInGermany · April 6, 2023, 2:21pm

I just edited your post and added the markdown tags.

Sahithi_7 · April 6, 2023, 2:39pm

Thanks @MadeInGermany

I was not of these markdown tags before. I will make sure i follow it next time when write code.

My attemps
I tries using command: xmllint

I am not sure till now is how to write to csv file. Iam able to print values not able to write into csv file which has header.

vgersh99 · April 6, 2023, 3:17pm

You really need to use XML-specific tools like xmlstarlet, xmllint, etc - as they're domain/xml specific. I'd suggest dive into the xmlstarlet if your xml format changes.
But for the posted input/output sample, you could try (adopted from here): awk -f sahithi.awk myFile.xml where sahithi.awk is:

BEGIN {
    FS="[<>]"
    OFS=","
    n=split("value1 value2 file",tags,/ /)
    for (i=1; i<=n; i++) {
        printf "%s%s", tags[i], (i<n?OFS:ORS)
    }
}
FNR==1 {next}
NF==5 { tag2val[$2] = $3 }
/<\/file>/ {
    for (i=1; i<=n; i++) {
        printf "%s%s", tag2val[tags[i]], (i<n?OFS:ORS)
    }
}
/<\/document>/ { delete tag2val }

yeilding:

value1,value2,file
xxxx,1234,file1.pdf

Sahithi_7 · April 6, 2023, 4:10pm

Xmlstarlet is not supported in environment i have.

I will first try solution proposed by you. Then will let you know my what i have experienced.

Thanks @vgersh99

vgersh99 · April 6, 2023, 5:26pm

Then it looks like xmllint was an option. You could dive into that instead.
All it takes is just a little patience and some googling. Worth the effort in the long run.
Good luck.

MadeInGermany · April 6, 2023, 5:58pm

The following variant ensures that the wanted tags are between the document tags.

BEGIN {
    FS="[<>]"
    OFS=","
    n=split("value1 value2 file",tags,/ /)
    for (i=1; i<=n; i++) {
        printf "%s%s", tags[i], (i<n?OFS:ORS)
    }
}
/<document>/ { doc=1 }
doc==1 {
  if (NF==5) { tag2val[$2] = $3 }
  if (/<\/file>/) {
    for (i=1; i<=n; i++) {
      printf "%s%s", tag2val[tags[i]], (i<n?OFS:ORS)
    }
  }
}
/<\/document>/ { delete tag2val; doc=0 }

A Posix awk needs split("", tag2val) rather than delete tag2val

I managed to extract the wanted values with xmllint --xpath but that seemed too cumbersome for the requirement.

Sahithi_7 · April 11, 2023, 10:08am

Thanks for above solution ...
How can write tjose to .csv file. I have tried witting by
'</file.csv>'
At the endo of printf but seems to be not working.

MadeInGermany · April 11, 2023, 11:27am

How do you run the awk?
Add a redirection to an output file. For ex
awk -f script.awk inputfile.xml > outputfile.csv

Sahithi_7 · April 11, 2023, 12:34pm

So i should have a created csv file already in folder where just need to mention it??

vgersh99 · April 11, 2023, 12:50pm

No. Just try it and experiment a little bit on your own.

Sahithi_7 · April 11, 2023, 1:03pm

Doing it ..will be back if i have any questions

Sahithi_7 · May 12, 2023, 4:06pm

Hello,

I am stuck could you help trying since 2 days...
In a folder i have .pdf files and .csv files.
In .csv file there is value which holds .pdf file name.
I want to validate no of .pdf file in folder and no of .pdf entries in .csv file.
I used grep -c "csv filename" to find no of .pdf entrues in csv file
Problem is i may have more than one csv file.i have to sum up all count in both .csv files and compare that against no of .pdf files in folder.