Journal connection - obsessive mode
Jan. 14th, 2012 05:01 pmAs usual I can't specify exactly what inspired me to undertake this project, but over the last couple of days I was suddenly driven to enhance the connection between my journal and my personal site (remember when people had those?) The journal has always been on the front page of it, but all that page did was to read the RSS feed and lay it out, so it could only ever display the latest few posts.
My aim was to make my site's impression of my journal much more full-featured, so that - ideally - Livejournal would be made a backend and you could explore the whole thing directly through the site. Livejournal's API doesn't really provide for this, seeming more geared towards clients that will actually draw out all the journal data and keep itself synchronized, so I wrote a parser to do it myself.
Actually, calling it a "parser" is extremely generous - it uses the normal LJ pages as web services, and these are full of Javascript and other extras that make them fairly unparseable, meaning that most of its job is to chew through the page until it finds something that it considers interesting. It does this by looking for pairs of delimiters to find the post title, content, date and so on - if it finds the start delimiter, it returns everything from that point on until it finds the appropriate end delimiter.
<title>davidn: The greatest bug report ever</title>An easy one - just get it out of the page's title tag, and it's always prefixed with "davidn: "
<div class="b-singlepost-body">
<div align="justify">Blah blah blah, I'm DavidN and I never shut up</div>
</div>This one is a bit risky - further on than this, the tags change depending on whether the post has tags or comments or not, and I can't just check for a div closing tag because the entries themselves might use them. This only works because I don't tend to have four spaces followed by a div closing tag anywhere in my journal markup.
<span class="b-singlepost-author-date">
<a href="http://davidn.livejournal.com/2010/">2010</a>
-
<a href="http://davidn.livejournal.com/2010/09/">09</a>
-
<a href="http://davidn.livejournal.com/2010/09/07/">07</a>
15:04:00
</span>The general position of the date is easy to identify but is in an awkward tangle of links - this fragment gets XML parsed to get the numbers out.
Lists of posts from the months surrounding the current entry are provided through the calendar page (the HTML for which is actually much easier to parse), and navigation is also possible by grabbing the links from the forward and back arrows at the top of each post. Livejournal does this in a rather odd way, sending a "go=next/prev" parameter along with the original ID to redirect you to the new post instead of going to the 'view post' page with the new ID directly - but copying this behaviour worked without problems, as long as I remembered to get the post ID out of the HTML that came in, instead of relying on the ID passed to the page being the actual ID of the post.
Take, for example, this post I made about Red Alert two years ago - the navigation is quite basic and there are a couple of other things I want to do (like replacing all links to my own journal with links to this parsing page as it writes them out), but it gives you pretty much all you need to flick through the journal. It means I can now link people directly to my own site when I mention something I've written in an entry.
I could give out the source if anyone would be interested in doing this themselves, though be warned it looks fairly hideous. Though having said that, it's PHP - what do you expect?
My aim was to make my site's impression of my journal much more full-featured, so that - ideally - Livejournal would be made a backend and you could explore the whole thing directly through the site. Livejournal's API doesn't really provide for this, seeming more geared towards clients that will actually draw out all the journal data and keep itself synchronized, so I wrote a parser to do it myself.
Actually, calling it a "parser" is extremely generous - it uses the normal LJ pages as web services, and these are full of Javascript and other extras that make them fairly unparseable, meaning that most of its job is to chew through the page until it finds something that it considers interesting. It does this by looking for pairs of delimiters to find the post title, content, date and so on - if it finds the start delimiter, it returns everything from that point on until it finds the appropriate end delimiter.
<title>davidn: The greatest bug report ever</title>An easy one - just get it out of the page's title tag, and it's always prefixed with "davidn: "
<div class="b-singlepost-body">
<div align="justify">Blah blah blah, I'm DavidN and I never shut up</div>
</div>This one is a bit risky - further on than this, the tags change depending on whether the post has tags or comments or not, and I can't just check for a div closing tag because the entries themselves might use them. This only works because I don't tend to have four spaces followed by a div closing tag anywhere in my journal markup.
<span class="b-singlepost-author-date">
<a href="http://davidn.livejournal.com/2010/">2010</a>
-
<a href="http://davidn.livejournal.com/2010/09/">09</a>
-
<a href="http://davidn.livejournal.com/2010/09/07/">07</a>
15:04:00
</span>The general position of the date is easy to identify but is in an awkward tangle of links - this fragment gets XML parsed to get the numbers out.
Lists of posts from the months surrounding the current entry are provided through the calendar page (the HTML for which is actually much easier to parse), and navigation is also possible by grabbing the links from the forward and back arrows at the top of each post. Livejournal does this in a rather odd way, sending a "go=next/prev" parameter along with the original ID to redirect you to the new post instead of going to the 'view post' page with the new ID directly - but copying this behaviour worked without problems, as long as I remembered to get the post ID out of the HTML that came in, instead of relying on the ID passed to the page being the actual ID of the post.
Take, for example, this post I made about Red Alert two years ago - the navigation is quite basic and there are a couple of other things I want to do (like replacing all links to my own journal with links to this parsing page as it writes them out), but it gives you pretty much all you need to flick through the journal. It means I can now link people directly to my own site when I mention something I've written in an entry.
I could give out the source if anyone would be interested in doing this themselves, though be warned it looks fairly hideous. Though having said that, it's PHP - what do you expect?