Faster to program, faster to run.
Over the last two days, I have been graced with the fun task of taking nine thousand word documents which are letters, and extracting the letter bodies - as plain text into a new file (i.e. without the names and addresses).
I decided that Java would do the best job of the extraction. Alas there is no reader for word documents. Using a great little program named WordConvs I was able to convert all of the documents into HTML format (a process which I started at 2:30pm and was still running when I left at 5).
Next was the fun step - using the new Regular Expressions of Java 1.4 and extracting the guts of the letters and removing HTML formatting. This turned out to be quite simple. I'll post my code below:
public static String extractLetter (String fileText) { Pattern p = Pattern.compile("([\\w\\W])*Sincerely
"); Matcher m = p.matcher(fileText); m.find(); return fileText.substring(m.start()+20, m.end()-16); } public static String extractHTML (String fileText) { Pattern p = Pattern.compile("(<([/]|)[IAP]( [\\w]*=\"[\\w]*\")*>)| "); Matcher m = p.matcher(fileText); return m.replaceAll(""); }
That's it! All letters had a bookmark "Start Here" and ended with "Sincerly".
Then all my program does is get a list of all files from a given directory and go though them one by one, extracting the guts, stripping the HTML and placing the resulting plain text into one big file.
Why am I telling you all this?
Well, to my amazement Java (in Linux) was able to read nine thousand files - off a mounted networked drive - process them with regex and write the output to disk in less than two minutes! And the computer is hardly a powerhouse running Linux with 224MB ram and a P4 1.8GHz CPU.
The entire .java file (with debugging code and blank lines removed) weighs in at 53 lines. It only took 53 lines of java code and two minutes to dissect nine thousand HTML letters and write them to a plain text file - in my books that is pretty good
Will.References