WDQuick Links: Tank Software, Tank Ammo ForumsSite Map
William Denniss
tankammo.com
Tank Image Current Projects
PSP Title
I am currently working on a commercial PSP title, details of which will be available on release.
Tank Image Finished Projects
Developed in 2005 with Philip Worthington, "Projected Lineriders cars follow lines that people draw on the surface with pens, speeding up and slowing down according to a visual annotation language. The cars skid and crash and jump over obstacles like hands, bridging the space between the physical and virtual. The game is open ended, nurturing peoples' own creativity and imagination as they strive to create the perfect track."
A high performance accellerated 3D graphics rendering engine and scenegraph with some compatible functionality to the old Java3D scenegraph. I was previously an active member of the Xith3D development team.
Odejava is an API which allows Java developers to use the uses the ODE physics engine with their Java projects and in an Object Orientated fashion. It is capable of working closely with Xith3D. I was previously an active member of the Odejava development team.
Digital picture manager for use with digital cameras and online photo galleries.
Internet password and username remembering program.

Who are you calling slow?22 Oct 2003

Faster to program, faster to run.

Over the last two days, I have been graced with the fun task of taking nine thousand word documents which are letters, and extracting the letter bodies - as plain text into a new file (i.e. without the names and addresses).

I decided that Java would do the best job of the extraction. Alas there is no reader for word documents. Using a great little program named WordConvs I was able to convert all of the documents into HTML format (a process which I started at 2:30pm and was still running when I left at 5).

Next was the fun step - using the new Regular Expressions of Java 1.4 and extracting the guts of the letters and removing HTML formatting. This turned out to be quite simple. I'll post my code below:

	public static String extractLetter (String fileText) {
		
		Pattern p = Pattern.compile("([\\w\\W])*

Sincerely

"); Matcher m = p.matcher(fileText); m.find(); return fileText.substring(m.start()+20, m.end()-16); } public static String extractHTML (String fileText) { Pattern p = Pattern.compile("(<([/]|)[IAP]( [\\w]*=\"[\\w]*\")*>)| "); Matcher m = p.matcher(fileText); return m.replaceAll(""); }

That's it! All letters had a bookmark "Start Here" and ended with "Sincerly".

Then all my program does is get a list of all files from a given directory and go though them one by one, extracting the guts, stripping the HTML and placing the resulting plain text into one big file.

Why am I telling you all this?

Well, to my amazement Java (in Linux) was able to read nine thousand files - off a mounted networked drive - process them with regex and write the output to disk in less than two minutes! And the computer is hardly a powerhouse running Linux with 224MB ram and a P4 1.8GHz CPU.

The entire .java file (with debugging code and blank lines removed) weighs in at 53 lines. It only took 53 lines of java code and two minutes to dissect nine thousand HTML letters and write them to a plain text file - in my books that is pretty good

Will.

References




Return to the Main Page