Thursday, February 3, 2011

Finding "Keywords" with potentially damaged HTML Files and Counting Hits

Programmer Question

I'm trying to create a master index file for a bunch of HTML files sitting in a directory. There could be anywhere from 5 to 5000. These files aren't clean or nice, so some of the libs I looked at don't seem like they would play nice. Many of these files come from the temp directory or are carved out of the file slack (ergo incomplete files in many cases). Plus, sometimes people just write sloppy HTML.



I've basically decided to enumerate through the directory and use something like



string[] FileEntries = Directory.GetFiles(WhichDirectory);

foreach (string FileName in FileEntries)
{
using (StreamReader sr = new StreamReader(FileName))
{
HTMLContents = sr.ReadToEnd();
}


I'm hoping that the StreamReader can dump the contents into a character array the same way it would a text file.



Anyways, given that this might not be the cleanest HTML in the world, there a few things I'd like to parse out of the array.




  1. Any Instance of a date in ANY format (e.g. 1/1/11, January 1st, 2011, 1-1-11, Jan-1-2011, etc) and dump these into a string to be read back later. Hopefully there is a lib or something for finding "instances" of dates.


  2. Read a text file line by line with various "keywords" to look for in the mess of HTML. Things like "Bob Evans" or "Sausage Factory Ltd" etc. I then want to count the number of times each "keyword" shows up. The problem is I don't want to have to resort to the user having to know regex expressions.




So, the desired output would be something like this:




BobEvans9304902.html

Title: Bob Evans Secret Sausage Recipe



Dates Found: "October 2nd, 2009" , "7/22/09"



"Bob Evans Sausage" : 30 hits



"Paprika" : 2 hits



"Don't overwork it" : 5 hits




All the solutions I have seen so far seem like they only work for single characters or words (LINQ) or split a "neat' sentence into words. I'm hoping I won't have to create a new copy of the string and strip out all the HTML tags, since it's not always going to be neat and I don't want to add another step to mass file processing. If that's the only way to do it, though, so be it.



Find the answer here

No comments:

Post a Comment

LinkWithin

Related Posts with Thumbnails