Related Posts

Archive

Automated Transformation of Microsoft Office Documents via Web-based Form

August 3rd, 2006

*gasps for breath* I really could revise this more…

The following walkthrough will describe setup of a system utilizing Microsoft Windows, Apache, Ruby, and Microsoft Excel to transform XLS format spreadsheets into other desired formats, in a fashion that can be automated.

In short, this is a “How To”. But first, “Why For?”.

The scenario I’m programing for is this. The client sends us a spreadsheet - as an Excel 2003 document. It is our job to post that spreadsheet to their website, in another pre-defined format. Which was a crazy and convoluted format. If our automated process can be the back end of a web-based form, the people upstairs can do the work themselves.

Required Software

The most dissapointing issue with setting up the system was that the newest versions of Apache and Ruby can not be used.

The second big dissapointment (On the IT departments part) was the difficulty of getting IIS to work with the same scripts. Although Ruby and IIS can get along, and Ruby and Excel can get along, IIS and Excel throw vague security error after vague security error.

Now, the software:

Possibly less restrictive is The versions of Microsoft Excel and Windows required. Our final system is running with Excel 2003 and Windows 2000.

Setup

This setup is partially based on “WARR Installation” at rawlinson.us

  1. Install Apache 2.0.x - the default directory in “C:\Program Files\” is acceptable.
  2. Install Ruby 1.8.2 - I would suggest changing the default directory to “C:\Program Files\ruby\”
  3. Install RubyForApache 1.3.1 - It shhould be able to detect the previouly selected installation directories
  4. Install Microsoft Excel and confirm it is working

Configuration

  1. Edit “C:\Program Files\Apache Group\Apache2\conf\httpd.conf” to add the following lines:

    In the “LoadModule” section, uncomment the line:

    LoadModule rewrite_module modules/mod_rewrite.so

    and add the lines:

    LoadModule fastcgi_module modules/mod_fastcgi.so
    LoadModule ruby_module modules/mod_ruby.so

    After “Bring in additional module-specific configurations”, add:


    <IfModule mod_fastcgi.c>
    AddHandler fastcgi-script .fcgi
    </IfModule>
     
    <IfModule mod_ruby.c>
    # for Apache::RubyRun
    RubyRequire apache/ruby-run
    RubySafeLevel 1
     
    # exec files under /ruby as Ruby scripts.
    <Location /ruby>
    SetHandler ruby-object
    RubyHandler Apache::RubyRun.instance
    Options +ExecCGI
    </Location>
     
    # exec *.rbx as Ruby scripts.
    <Files *.rbx>
    SetHandler ruby-object
    RubyHandler Apache::RubyRun.instance
    </Files>
    </IfModule>

  2. Edit “C:\Program Files\Apache Group\Apache2\conf\mime.types”

    Find the line
    text/html html htm
    And edit it ro read:
    text/html html htm rbx

  3. Open the Apache Monitor (a new icon should have appeared in the taskbar for it), and use it to restart Apache2. If the configuration was correctly changed, it will restart with no errors.
  4. Point your browser to either http://localhost/ , http://127.0.0.1/, or wherever your server is located. The standard Apache 2.0 “Not what you Expected” page should show up.

The Ruby Script

Now, according to the conventions set in the configuration above, “C:\Program Files\Apache Group\Apache2\htdocs\” is the directory you webpage will go in, and “C:\Program Files\Apache Group\Apache2\htdocs\ruby” (a folder that will need to be created) will house your .rbx scripts and custom .rb includes.

The Form

<form method="post" action="/ruby/parse.rbx" enctype="multipart/form-data">
<fieldset>
<legend>Upload Excel Spreadsheet</legend>
<p>Uploading The same file name again will overwrite previous results.</p>
<input type="file" name="myfile" id="myfile" /><br /><br />
<input type="submit" value="Submit File"/>
</fieldset>
</form>

The method and enctype are required for file uploads.

The Parsing Script

These are selections of the “parse.rbx” mentioned above. whatever the file name is, it should be in the ruby folder.

#!"C:\Program Files\ruby\bin\ruby.exe"
# above line changes depending on installation
 
require 'cgi' #POST
require 'stringio' #treat files like strings
require 'win32ole' #control Excel
require 'erb' #interpret with template
require 'date' #time script was run
require 'FileUtils' #local copy for command line usage
 
puts "<html><body>"
 
#####################################
# Constants
# like 'Get date when parser is run'
#####################################
 
$updated = DateTime.now

installed_dir = FileUtils.pwd.to_s.gsub('ruby','')
installed_dir.untaint
 
template = "leaseTemplate.tpl"
 
fromfile = ""
fromfilename = ""
tofile = ""
 
logfile = File.new( installed_dir + "log.txt", 'a+') # { |file| file.write($updated); file.write("\n") }
 
logfile.write($updated)
logfile.write("\n")

Here yoy will notice the first use of untaint. In a server application, any string obtained from outside sources - like from pwd - is considered tainted, and a security issue. Programmers are at least forced to see where these issues can come up.

# if being called from apache/IIS
if (ARGV.empty?)
 
# save file posted by form locally
# copy the file - 'wb' is write in binary mode (IMPORTANT FOR WINDOWS)
cgi = CGI.new()
 
# should only get files from upload form
if cgi.referer !~ /upload\.rbx/
puts '</head><body><p>'
puts 'Error: not called with leaseparser form'
puts '</p><p><a href="/">Go to full form</a></p></body></html>'
logfile.write("\tError: not called with leaseparser form")
exit
end
 
fromfile = cgi.params['myfile'].first
fromfilename = fromfile.original_filename.split(/\\/)[-1]
 
# File not specified error
# reload upload.rb with specified error showing
# log error
if fromfilename == nil
puts ‘</head><body><p>’
puts ‘Error: No file specified’
puts ‘</p><p><a href=”/ruby/upload.rbx”><- back</a></p></body></html>’
 
logfile.write(”\tError: No file specified\n”)
exit
end
 
if fromfilename !~ /\.xls$/
puts ‘</head><body><p>’
puts ‘Error: ‘ + fromfilename.inspect + ‘ is not a spreadsheet’
puts ‘</p><p><a href=”/ruby/upload.rbx”><- back</a></p></body></html>’
 
logfile.write(”\tError: ” + fromfilename.inspect + ” is not a spreadsheet\n”)
exit
end
 
tofile = installed_dir + “parsed/” + fromfilename
File.open(tofile.untaint, ‘wb’) { |file| file.write(fromfile.read) }
 
else
 
if ARGV[0] !~ /\.xls$/
puts ARGV[0].inspect + ” is not a spreadsheet”
exit
end
 
fromfile = ARGV[0].gsub(/[A-Za-z]:\\/, ‘/’).gsub(/\\/, ‘/’)
fromfilename = fromfile.split(/\//)[-1]
tofile = installed_dir + “parsed/” + fromfilename
 
FileUtils.cp(fromfile, tofile)
end
 
logfile.write(”\t” + fromfilename.inspect + ” has been uploaded\n”)
outdir = installed_dir.untaint + “parsed/” + fromfilename.gsub(/\.xls/,”).gsub(/ /, ‘_’) + ‘/’
 
FileUtils.mkdir outdir.untaint if !FileTest.exist?( outdir.untaint )

This code has an trick to it - to debug a ruby script running through apache, you have to look at the apache server logs. ARGV is the array of arguments past on the command line. So, if ruby recieves command line arguments, it will copy a local file and not use cgi. Makes for much easier debugging.

Also, “wb” as a flag for file copying. Without it, ruby will treat an excel file as a text file, and “fix” the line endings, corrupting the file.

# open up excel for processing file (no gui)
begin
excel = WIN32OLE::new('excel.Application')
rescue
print "Location: upload.rb?error=" + u( "Error: " + $!) + "\n\n" if (ARGV.empty?)
logfile.write("\tError: " + $!)
exit
end
 
workbook = excel.Workbooks.Open(tofile)
 
# three-dimentional arrays for storage, instead of text files
name = Array.new
data = Array.new
date = Array.new
 
#read info into arrays
workbook.Worksheets.count.downto(1) { |workbookNo|
worksheet = workbook.Worksheets(workbookNo)
worksheet.Select
name[workbookNo] = workbook.Worksheets(workbookNo).name
data[workbookNo] = worksheet.Range(worksheet.range(”c8″).currentregion.address.gsub(”$A$6″,”$C$8″)).value2
date[workbookNo] = worksheet.Range(”$F$1″).text
}
 
#exit excel and free up memory
excel.Quit
logfile.write(”\tData Imported from Excel\n”)

Here, a headless instance of Excel is started, and finished as quickly as possible. Since the instance is headless, an error in your code before the application is closed will leave an unusable 15MB chunk of memory on your system (at least until it is manually killed).

After the program is started and your uploaded spreadsheet imported by the excel instance, individual workbooks and worsheets can be accessed as arrays (through win32ole). More examples of this are at http://wiki.rubygarden.org/Ruby/page/show/ScriptingExcel

html_template = ""
File.open(template, 'r') { |file| html_template << file.read }
logfile.write("\tHTML Template Loaded\n")

After this, my program goes on to load a template for erb.

Our data is sorted and filtered and ends up being stored in a global variable - $leasedata. I found that only global variabes work for transfering the array’s data to erb and our template.

erb = ERB.new(html_template.untaint)
@output = erb.result()
logfile.write("\t\tTransformation run\n")
 
outfile = outdir + name[idx].downcase + “.html”
 
File.open(outfile, ‘wb’) { |file| file.write(@output.to_s) }
logfile.write(”\t\t” + outfile.to_s.split(/\//)[-1] + ” Created\n”)
 

Now, as you might have guessed by after seeing some inconsistant variables, the working code is inside a loop - one loop for evry worksheet in the loaded workbook.

The erb template also allows loops, so don’t discount using code lok $leasedata.each or $leasedata.collect { |xx| xx.variable}. It is possible to make a template NO hardcoded values starting with these functions.

Posted in Ruby |

One Response to “Automated Transformation of Microsoft Office Documents via Web-based Form”

  1. Automated Transformation of Microsoft Office Documents via Web-based Form Says:

    [...] The erb template also allows loops, so don’t discount using code lok $leasedata.each or $leasedata.collect { |xx| xx.variable}. It is possible to make a template NO hardcoded values starting with these functions. Source: Automated Transformation of Microsoft Office Documents via Web-based Form - Ruby [Feed] [...]

Previous post: In the Office Today: Flex 2

Next Post: Forget Petrol - Hydrogen as Fuel