Programmer Question
Suppose I have an HTML document that I have tokenized, how could I transform it into a new document or apply some other transformations?
For example, suppose I have this HTML:
<html>
<body>
<p><a href="/foo">text</a></p>
<p>Hello <span class="green">world</span></p>
</body>
</html>
What I have currently written is a tokenizer that outputs a stream of tokens. For this document they would be (written in pseudo code):
TAG_OPEN[html] TAG_OPEN[body] TAG_OPEN[p] TAG_OPEN[a] TAG_ATTRIBUTE[href]
TAG_ATTRIBUTE_VALUE[/foo] TEXT[text] TAG_CLOSE[a] TAG_CLOSE[p]
TAG_OPEN[p] TEXT[Hello] TAG_OPEN[span] TAG_ATTRIBUTE[class]
TAG_ATTRIBUTE_VALUE[green] TEXT[world] TAG_CLOSE[span] TAG_CLOSE[p]
TAG_CLOSE[body] TAG_CLOSE[html]
But now I don't have any idea how could I use this stream to create some transformations.
For example, I would like to rewrite TAG_ATTRIBUTE_VALUE[/foo]
in TAG_OPEN[a] TAG_ATTRIBUTE[href]
to something else.
Another transformation I would like to do is make it output TAG_ATTRIBUTE[href]
attributes after the TAG_OPEN[a]
in parenthesis, for example,
<a href="/foo">text</a>
gets rewritten into
<a href="/foo">text</a>(/foo)
What is the general strategy for doing such transformations? There are many other transformations I would like to do, like stripping all tags and just leaving TEXT content, adding tags after some specific tags, etc.
Do I need to create the parse tree? I have never done it and don't know how to create a parse tree from a stream of tokens. Or can I do it somehow else?
Any suggestions are welcome.
And one more thing - I would like to learn all this parsing myself, so I am not looking for a library!
Thanks beforehand, Boda Cydo
No comments:
Post a Comment