Tuesday, April 13, 2010

What would I use to remove escaped html from large sets of data.

Programmer Question

Our database is filled with articles retrieved from RSS feeds. I was unsure of what data I would be getting, and how much filtering was already setup (WP-O-Matic Wordpress plugin using the SimplePie library). This plugin does some basic encoding before insertion using Wordpress's built in post insert function which also does some filtering. Between the RSS feed's encoding, the plugin's encoding using PHP, Wordpress's encoding and SQL escaping, I'm not sure where to start.



The data is usually at the end of the field after the content I want to keep. It is all on one line, but separated out for readability:



<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:V_sGLiPBpWU" border="0"></img>



<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?d=qj6IDK7rITs" border="0"></img>



<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:D7DqB2pKExk"



Notice how some of the images are escape and some aren't. I believe this has to do with the last part being cut off so as to be unrecognizable as an html tag, which then caused it to be html endcoded while the actual img tags were left alone.



Another record has only this in one of the fields, which means the RSS feed gave me nothing for the item (filtered out now, but I have a bunch of records like this):



<img src="http://farm3.static.flickr.com/2183/2289902369_1d95bcdb85.jpg" alt="post_img" width="80"



All extracted samples are on one line, but broken up for readability. Otherwise, they are copied exactly from the database from the command line mysql client.



Question: What is the best way to work with the above escaped html (or portion of an html tag), so I can then remove it without affecting the content?



I can do it in Perl, PHP, SQL, Ruby, and even Python. I believe Perl to be the best at text parsing, so that's why I used the Perl tag. And PHP times out on large database operations, so that's pretty much out unless I wanted to do batch processing and what not.



[EDIT]
If it was just a matter of pulling the html I wanted out and doing a strip_tags and reinserting the data, I wouldn't be asking this question.



The portion that I have a problem with is that what used to be an img tag was html encoded and the end cut off. If it's deencoded it will not be an html tag, so I cannot parse it the usual way.



With all the <img src=" crap, I can't get my head around searching for it other than SELECT ID, post_content FROM table WHERE post_content LIKE '<img' which at least gets me those posts. But when I get the data, I need a way to find it, remove it, but keep the rest of the content.



[/EDIT]



PS One of the nice things about using Wordpress's insert post function, is that if you use php's strip_tags function to strip out all html, insert post function will insert <p> at the paragraph points.



Let me know if there's anything more that I can answer.



Some article that didn't quite answer my questions.
(http://stackoverflow.com/questions/2016751/remove-text-from-within-a-database-text-field)
(http://stackoverflow.com/questions/462831/regular-expression-to-escape-html-ampersands-while-respecting-cdata)



Find the answer here

No comments:

Post a Comment

LinkWithin

Related Posts with Thumbnails