Construct 3 column table from one column list

Hi all!

trying my best to parse a public site for information (ie fiscal year and turnover) about corporations.

Doing this by having a file with business name and registration number (search key)
the file bolag.txt currently looks like this

Burgundy 556732-7217
AcademicSearch 556406-9879

and looping through it and using cURL to fetch the file and then doing some rather unpretty sanitization of the output.

#!/bin/sh

awk '{print $1, $2}' bolag.txt | while read BNAME BNUMBER
do

URL="http://allabolag.se?what=$BNUMBER"
echo $BNAME
curl -s $URL |  grep bokTable | sed -e 's/<[^>]*>//g' | awk 'NR==1,NR==9' | awk 'NR!=4 && NR!=5 && NR!=6'  
echo "--"
done 

My problem is that the output is like this for each found entry

Business Name
            2010-12
            2009-12
            2008-12
            10 532
            2 084
            0
--

I'm not getting my head around how to make the first 3 rows into column headers and populating the second row with lines 4-6 in a new file.

I obviously need to make the cleanup better as well, but currently at least that works.

Any help is greatly appriciated!

Cheers

Martin

I'm not clear wit your requirement.

In the example you posted,

2010-12            2009-12            2008-12 

need to treated as columns, what about next three rows of data? ( 5 values exists in the example you provided, how you wil fit them under 3 columns).

Hi

Sorry. Each result will give a header row with dates. Which will be quite redundant. But I reckon I'll be able to deal with that mess once all the data is aligned in columns.

I'm sure there's a better way, but I can't see it :frowning:

Not sure still.

Instead you please post the sample input data and the output you are expecting.

ok

so currently with my two row sample input file i get

Burgundy
            2010-12
            2009-12
            2008-12
            10 532
            2 084
            0
--
AcademicSearch
            2010-12
            2009-12
            2008-12
            46 706
            37 984
            53 632
--

What I'd like to get would be an output like

   year1       year2        year3
firm1 turnover1 turnover2 turnover3
firm2 turnover1 turnover2 turnover3

Pls post the sample "output" that you get for the input you provided.

Sorry , but I don't knw what the "turnovers" here are from.

my previous post includes the output i currently get.

What I want to achive is to get

                  2010-12       2009-12        2008-12
Burgundy 10 532 2 084 0 
AcademicSearch 46 706 37 984 53 63

So. Each result/company from the script will yield 6 rows where the first 3 will be years and the last 3 will be each year's turnover.

how about this:

 
cat input_file | paste - - - - 

You might need to do a bit of formating and may need to "cut" the company name from first line and append at the start of next line.

some thing like this:

 
cat input_file| paste - - - - | awk 'NR==1 {a=$1;$1="";print;next} {$0=a" "$0}1'

Need to do a bit formatting with tabs.

that didn't quite to the trick I'm afraid.

I also can't modify the input file. It will look like this

company_name, registration_number
company_name, registration_number
company_name, registration_number
company_name, registration_number

I'm using this to access a website using cURL to download each company's turnover. After I sanitize the output from cURL i get a list like this. Comments are for illustration only here, not in actual output.

2010-12 # year 1
2009-12 # year 2
2008-12 # year 3
46.706 # turnover year 1
37.984  # turnover year 2
53.632 # turnover year 3

In my script I append the company name from the original input file as i don't want to get that from the web each time. Currently I add this to the top of each batch of 6 lines so the output looks like

Burgundy # Company name appended from input file
2010-12 # year 1
2009-12 # year 2
2008-12 # year 3
46.706 # turnover year 1
37.984  # turnover year 2
53.632 # turnover year 3

Now, I need to have all this in 4 columns like this

Burgundy year1 year2 year3
- to1 to2 to3

Your example gave an output like this

10.5322
2.0842 0

so no year information is there...

 
/home/TESTBOX>cat input_file
Burgundy
2010-12
2009-12
2008-12
46.706
37.984
53.632
/home/TESTBOX>cat input_file  | paste - - - -
Burgundy        2010-12         2009-12         2008-12
46.706  37.984  53.632
 
/home/TESTBOX>cat input_file  | paste - - - - | awk 'NR>1{$1="-\t"$1}1'
Burgundy        2010-12         2009-12         2008-12
-       46.706 37.984 53.632

Any thing else?

1 Like

Thanks for your help! The problem was that the output had DOS newlines and I had to take care of the trailing and leading spaces. So the paste didn't work. Once converted it works like a charm. The full, ugly script now looks like

#!/bin/sh

INPUTFILE=$1
OUTPUTFILE=$2

awk '{print $1, $2}' $INPUTFILE | while read BNAME BNUMBER
touch $OUTPUTFILE
do
URL="http://allabolag.se?what=$BNUMBER"
echo $BNAME >> $OUTPUTFILEcurl -s $URL |  grep bokTable | sed -e 's/<[^>]*>//g' | awk 'NR==1,NR==9' | awk 'NR!=4 && NR!=5 && NR!=6' | awk '{
gsub(/^[ \t]+|[ \t]+$/,"");print}' | sed 's/\ /./g' |sed 's/.$//' | paste - - - >> $OUTPUTFILE
#echo "--"
done
#!/bin/ksh                                           
start="Y"                                            
while read comp                                      
do                                                   
        read year1                                   
        read year2                                   
        read year3                                   
        if [ start = "Y" ]                           
        then                                         
                echo Company $year1 $year2 $year3  
                start="N"                            
                                                     
        fi                                           
        read a b                                     
        read a1 b1                                   
        read a2 b2                                   
        echo $comp $a $b $a1 $b1 $a2 $b2             
done                                                 

If the two -- characters between the sets are actually part of the data, either remove them first, or add another read statement before the done statement.

I didn't see all the solutions on page 2 until I finished my post. -:slight_smile:

1 Like