Organising the Archive
-
@lia Send me what you want for the timestamp template, and I will see what I can do. There are lots of options. The best way to do all of this is to write the program to make all the changes you want, then it can just do the work for you all at once, straight from the source files. Then, if we want to tweak anything, we just update the program and re-run it against the source files. If you find more pages to fill in gaps, you just re-run the program with the new files in place, and it recreates everything exactly how you want it.
Those icons at the top are done the same way, they are esentialy from a font, linked to from the CSS file, and the forum stores that data under /assets. Here is an example, navigate to this page, then "View Source": https://owforum.co.uk/assets/fonts/glyphicons-halflings-regular.eot
-
-
@biell said in Organising the Archive:
Those icons at the top are done the same way, they are esentialy from a font, linked to from the CSS file, and the forum stores that data under /assets. Here is an example, navigate to this page, then "View Source": https://owforum.co.uk/assets/fonts/glyphicons-halflings-regular.eot
Good find, I did notice some of it appears to be in the client.css file but the icons don't show regardless. I'll likely simplify it and just create the icons myself then refer to them instead.
For the time stamps I've dug some more and think I found it, there's a class "timeago" that I assume might be missing and stopping it render.
Going to dig around and see if I can either find it or rebuild it.Edit: Ah never mind it's a Jquery D:In that case since they all exist like below:
<span class="visible-xs-inline-block visible-sm-inline-block visible-md-inline-block visible-lg-inline-block"> <a class="permalink" href="https://community.onewheel.com/post/3836"><span class="timeago" title="2015-09-09T19:09:53.329Z"></span></a>
Can it be replaced with this format:
</small> <small class="pull-right"> <span class="visible-xs-inline-block visible-sm-inline-block visible-md-inline-block visible-lg-inline-block"> <a class="permalink" href="">2015-09-09T19:09:53.329</span></a>
This closes the above "pull left" and sets the timestamp to pull right relying on the already existing closing statement. The href gets null'd along with the timestamp becoming detatched from the Jquery as just text. Ends up looking a bit like this:
(I've imported the client-darkly.css for this hence the difference in color)
If it's possible to have Perl re-write that as something more human readable like "9 September 2015, 09:53" that would be ideal :)
-
@lia said in Organising the Archive:
If it's possible to have Perl re-write that as something more human readable like "9 September 2015, 09:53" that would be ideal :)
lol uh oh @biell ... u know where this train goes...
-
@notsure Oh no is that an issue?
-
@lia said in Organising the Archive:
@notsure Oh no is that an issue?
lol no its probably super easy.
but it always starts with innocuous little requests. next thing u know, code everywhere...
-
@notsure Aha I get that, these should really be all that's needed due to the sheer volume of repetitive edits involved. I've narrowed the scope of the archive to purely the basics. Stripping everything other than just the content to later be improved if needed since I keep a copy of the raw un-edited posts I gathered. Bossman wonders why I have a 17TB Nas... this is why lol.
The rest I'm happy to manually scrub to clean a handful of things that would probably be a pain to automate due to edge cases.
-
@lia Yes, writing it in any format will be easy. The HTML code also has the time as milliseconds in epoch, so I will just use that with whatever strftime string is necessary for the format you want (probably "%d %B %Y, %H:%M UTC"). Are you cool with 24-hour time, or do you want AM/PM? Personally, I prefer 24-hour time. In pass 1, I will just print it in UTC. Given that this is an archive, that should be sufficient.
-
@lia said in Organising the Archive:
The rest I'm happy to manually scrub to clean a handful of things that would probably be a pain to automate due to edge cases.
You should at least let me know what they are, because if I can code for them, I can save a lot of time, and allow you to re-run the join later if we want to change something on lots of pages.
My rule on a computer is to only ever do something 3 times.
- Learn the process
- Code the process while performing it
- Run the code and fix up all the bugs
After that, run the automation every time.
-
@biell 24 hour is perfect :) Simpler to implement and just a better format in general. UTC is fine too, happy to keep it as a single time zone to keep things consistent. I feel the dates/times only really matter in the broader sense of gauging roughly when something was said.
Appreciate the offer to code some extra bits. I've not settled yet on a header replacement yet so I'm uncertain of what to suggest for those other than the trimming of the site buttons. I can have a crack at finishing something today then have it so any changes can be handled by some CSS instead so you don't need to re-write anything.
Here's the client-darkly.css if you don't have it.
On my test doc I've done this if you're curious.
Purged some elements so only this exists in the pre "main" section with the nav being short and simple. Also ammended the header to explain the page a bit and link back to the archive<!DOCTYPE html> <title>Welcome to the Onewheel forum! | Onewheel Forum</title> <link rel="stylesheet" type="text/css" href="./2 Welcome to the Onewheel forum! _ Onewheel Forum_files/client-darkly.css"> <main id="panel" class="slideout-panel" style="padding-top: 100px;"> <nav class="navbar navbar-default navbar-fixed-top header" id="header-menu" component="navbar"> <div class="container"> <div class="navbar-header"> <a href="http://www.onewheel.com/"> <img alt="Onewheel Home Page" height="80" src="./2 Welcome to the Onewheel forum! _ Onewheel Forum_files/site-logo.png"> </a> </div> <div class="pull-right"> <a href="http://archive.owforum.co.uk"> <img alt="The Archive homepage" src="http://archive.owforum.co.uk/Images/OWForumArchive.png" height="80"> </a> </div> <p class="text-center"> <br /> This page is an archived copy of the old Onewheel Forum. <br /> To see more click <a href="http://archive.owforum.co.uk">here</a> or the archive link to the right. </p> </div> </nav>
Ends up looking like this :)
I then wipe out the below
<noscript> <div class="alert alert-danger"> <p> Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you may not be able to execute some actions. </p> <p> Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript). </p> </div> </noscript> <ol class="breadcrumb"> <li itemscope="itemscope" itemtype="http://data-vocabulary.org/Breadcrumb"> <a href="https://community.onewheel.com/" itemprop="url"> <span itemprop="title"> Home </span> </a> </li> <li itemscope="itemscope" itemtype="http://data-vocabulary.org/Breadcrumb"> <a href="https://community.onewheel.com/category/1/news-announcements" itemprop="url"> <span itemprop="title"> News & Announcements </span> </a> </li> <li component="breadcrumb/current" itemscope="itemscope" itemtype="http://data-vocabulary.org/Breadcrumb" class="active"> <span itemprop="title"> Welcome to the Onewheel forum! <a target="_blank" href="https://community.onewheel.com/topic/2.rss"><i class="fa fa-rss-square"></i></a> </span> </li> </ol> <div widget-area="header"> </div>
Near the bottom of the page these get removed also.
<div class="tags pull-left"> <a href="https://community.onewheel.com/tags/news"> <span class="tag-item" data-tag="news" style="">news</span> <span class="tag-topic-count human-readable-number" title="2">2</span> </a> <a href="https://community.onewheel.com/tags/official"> <span class="tag-item" data-tag="official" style="">official</span> <span class="tag-topic-count human-readable-number" title="1">1</span> </a> </div> <div component="topic/browsing-users"> </div>
And this
<div component="topic/reply/container" class="btn-group action-bar bottom-sheet hidden"> <a href="https://community.onewheel.com/compose?tid=2&title=Welcome%20to%20the%20Onewheel%20forum!" class="btn btn-primary" component="topic/reply" data-ajaxify="false" role="button"><i class="fa fa-reply visible-xs-inline"></i><span class="visible-sm-inline visible-md-inline visible-lg-inline"> Reply</span></a> <button type="button" class="btn btn-info dropdown-toggle" data-toggle="dropdown"> <span class="caret"></span> </button> <ul class="dropdown-menu pull-right" role="menu"> <li><a href="https://community.onewheel.com/topic/2/welcome-to-the-onewheel-forum/3#" component="topic/reply-as-topic">Reply as topic</a></li> </ul> </div> <a component="topic/reply/guest" href="https://community.onewheel.com/login" class="btn btn-primary">Log in to reply</a>
Complete page that I have is here along with the raw unedited files if you'd rather look at it completely rather than navigate my ramblings just in case I haven't made sense.
-
@lia Sorry this is taking so long, the issues I was dealing with at work over the weekend have continued into the week.
That said, I had a couple hours to sit down and code tonight and I basically have everything you asked for so far. The banner is rewritten,Google's content is removed, the HTML chunks you wanted deleted are removed, I stitched all the posts into one huge thread, the time displays accurately in UTC, and I have cleaned up some of the "Register", "Login", etc. buttons.
I can't see how to insert the time for the "Last reply" so I am removing it for now.
The have-todo item left is to move all the "*_files" content into a single, consolidated directory so you have one .html file and one corresponding _files directory.
After that, I will go through the webpage some more and see what else needs to be cleaned out from the HTML or updated to enhance rendering.
-
@biell thats okay, no rush on any of it. Thanks for the effort and commitment so far :) Hope work hasn't been stressful.
Happy to have the “last reply” element scrubbed, its pretty redundant to be fair and removes the need to fix that timestamp.
Let me know if you need anything else from me in the meantime.
-
Minor edit to the top of the page and some big fixes.
-
Realised when testing on mobile the page loaded like it does on desktop making it a bit of a pain to use. Turns out a handful of the meta tags actually restructure the page so I added back only those. However it then broke the header so I removed the Onewheel logo image and changed the text to state what topic was currently loaded because on mobile I couldn't get that to render otherwise.
-
Made a resources folder that'll contain common elements so I'm not duplicating the same .css and images per topic. Should help save space on the small SSD I spec'd for this archive server.
-
Also I accidentally stumbled on a fix for the site icons >.> Just reference some public ones >:D
<!DOCTYPE html> <title>Welcome to the Onewheel forum! | Onewheel Forum</title> <link rel="stylesheet" type="text/css" href="./resources/client-darkly.css"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="content-type" content="text/html; charset=UTF-8"> <meta name="apple-mobile-web-app-capable" content="yes"> <meta name="mobile-web-app-capable" content="yes"> <meta property="og:site_name" content="Onewheel Forum"> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> <link rel="icon" href="./resources/OWForumArchiveIcon.png"> <main id="panel" class="slideout-panel" style="padding-top: 100px;"> <nav class="navbar navbar-default navbar-fixed-top header" id="header-menu" component="navbar"> <div class="container"> <div class="navbar-header"> <a href="http://archive.owforum.co.uk"> <img alt="The Archive homepage" src="./resources/OWForumArchive.png" height="60"> </a> </div> <div class="navbar-header pull-right"> <p class="text-right" > <br /> This page is an archived copy of the old Onewheel Forum. <br /> Current topic : <a style="font-size:1.2em" class="topic-title">Welcome to the Onewheel forum!</a> </p> </div> </div> </nav> <div class="container" id="content">
This can be seen live now if anyone wants to check it out (and see a happy Kyle)
http://archive.owforum.co.uk/Topics/2 Welcome to the Onewheel forum! _ Onewheel Forum.html -
-
@lia ssl
-
@notsure
On the list todo :)
Done -
@lia I have yet again proved that it takes 20% of the time to do the first 80% of the work, and 80% of the time to do the last 20% of the work.
Give this a shot, extract it to the document root:
https://drive.google.com/file/d/1we2wIcMRgP5t8K2laDzwHo-1xOHlxH0D/view
It will create /topic and /assets directory structures that mimic how a live nodebb is supposed to look. I used an "index.html" file so the URL bar will be very close looking to what you would see on a live system.
If something major doesn't look right, then I probably just don't have it configured right for how your webserver is setup. If it does look good, go through it and let me know of what other things you would like to see changed.
I am assuming you aren't going to mirror people pages, right? I removed all the people page references, but left them looking like links (you just can't click them anymore).
It looks like your webserver is ubuntu, do you have shell access? If so, it should have perl, and so I can just send you the script once you are done and you can run it against all the other pages.
-
Thanks @biell :)
It looks great and surprising doesn't take an age to load as I worried it might since that was a 2000+ post long topic. Having the directory structure like that makes sense, I did have to move the assets into the same directory as index for it to work then add a "." before /assets/ to get index to use a relative path which did the job.
am assuming you aren't going to mirror people pages
Correct, I did consider it but it seemed like a lot of work for no real benefit since the user pages would literally be a picture in time with no ability to pull posts together without me actually just building a database.
Was it possible to have the script spit out a txt for missing post IDs or was that a pain to impliment?
It looks like your webserver is ubuntu, do you have shell access?
I do, but I intend to repair files locally on the Windows machine then one by one add them to the archive so that I can speed up the tweaks I do in VS.
Thank you so much for doing this. I can't put into words how much this helps <3
-
@lia You shouldn't have to move anything, and using a shared /assets across all posts should ensure that images like avatar's which are used all over only exist once. That is good for your disk space, and good for page load times, as people only have to keep the image in cache once.
Did you extract at the document root of the web server. I set things up that way, but I am now realizing that is dumb to rely on that, and I can do relative links. I will make all the links relative, and you won't have to edit move anything. I will get you a new version tomorrow with that modified.
Are there any other changes you are trying to make, I can incorporate those also. After that, I can send you the perl script and you can run it for yourself. Or, if you prefer, we can create a pipeline whereby you send me a zip, I filter it, then send you back the updated zip.
-
@biell said in Organising the Archive:
After that, I can send you the perl script and you can run it for yourself. Or, if you prefer, we can create a pipeline whereby you send me a zip, I filter it, then send you back the updated zip.
-
@biell That's a better idea, keeping avatars in a common assets directory.
I was hesitant on keeping all post images in one place as it looks like the early forum didn't rename images to prevent clashing image names. If they were all merged a handful may get overwritten so at least for post images storing them in their own topic directory makes sense since there is almost no cross pollination in assets.
I haven't put it on the server yet, it's only on my PC however I intend on keeping it all in the site directory so I don't have to grant permissions outside it.
I've only made a minor adjustment to the header and added back a section that helped with mobile compatibility. If you view Topic 2 on the current build you can see it behaves better on there and Mobile (before it was hard to see and the banner got in the way). The <div class="topic-header"> appears to be what holds the element that stays on the page which when removed breaks that and makes mobile a nightmare.
Below is the full HTML for the top section of the page before the post data begins. if that helps.
<!DOCTYPE html> <title>Welcome to the Onewheel forum! | Onewheel Forum</title> <link rel="stylesheet" type="text/css" href="./resources/client-darkly.css"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="content-type" content="text/html; charset=UTF-8"> <meta name="apple-mobile-web-app-capable" content="yes"> <meta name="mobile-web-app-capable" content="yes"> <meta property="og:site_name" content="Onewheel Forum"> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> <link rel="icon" href="./resources/OWForumArchiveIcon.png"> <main id="panel" class="slideout-panel" style="padding-top: 1px;"> <nav class="navbar navbar-default navbar-fixed-top header" id="header-menu" component="navbar"> <div class="container"> <div class="navbar-header"> <a href="http://archive.owforum.co.uk"> <img alt="The Archive homepage" src="./resources/OWForumArchive.png" height="60"> </a> </div> <div class="navbar-header pull-right"> <p class="text-right" style="padding-top: 10px"> This page is an archived copy of the old Onewheel Forum. </p> </div> </div> </nav> <div class="container" id="content"> <div data-widget-area="header"> </div> <div class="row"> <div class="topic col-lg-12"> <div class="topic-header"> <h1 component="post/header" class="" itemprop="name" style="padding-top: 50px;"> <span class="topic-title" component="topic/title"> <span component="topic/labels"> <i component="topic/pinned" class="fa fa-thumb-tack " title="Pinned"></i> <i component="topic/locked" class="fa fa-lock " title="Locked"></i> <i class="fa fa-arrow-circle-right hidden" title="Moved"></i> </span> Welcome to the Onewheel forum! </span> </h1> <div class="topic-info clearfix"> <div class="category-item inline-block"> <div role="presentation" class="icon pull-left" style="background-color: #00487F; color: #ffffff;"> <i class="fa fa-fw fa-list-alt"></i> </div> <a href="#">News & Announcements</a> </div> <div class="tags tag-list inline-block hidden-xs"> </div> <div class="inline-block hidden-xs"> <div class="stats text-muted"> <i class="fa fa-fw fa-user" title="Posters"></i> <span title="1" class="human-readable-number">1</span> </div> <div class="stats text-muted"> <i class="fa fa-fw fa-pencil" title="Posts"></i> <span component="topic/post-count" title="3" class="human-readable-number">3</span> </div> <div class="stats text-muted"> <i class="fa fa-fw fa-eye" title="Views"></i> <span class="human-readable-number" title="8665">8665</span> </div> </div> <a class="hidden-xs" target="_blank" href="https://community.onewheel.com/topic/2.rss"><i class="fa fa-rss-square"></i></a> <div component="topic/browsing-users" class="inline-block hidden-xs"> </div> <div class="topic-main-buttons pull-right inline-block"> <span class="loading-indicator btn pull-left hidden" done="0"> <span class="hidden-xs">Loading More Posts</span> <i class="fa fa-refresh fa-spin"></i> </span> </div> </div> </div>
Other than that I think it's all good. Did the script manage to spit out missing post IDs?
Wondering if I make a template post could we have the script just insert it in place to both acknowledge it's missing and also where to submit the post if it's later found elsewhere. I can go in later and piece those few missing ones together in VS and remove my template.I'm happy to run the script if you prefer. Whatever you prefer since it's your script, this is the volume of data I'm working with just to give you an idea so I'm happy to run it if that looks like a nightmare xD It certainly was to manually scrape from google cache lmao.