Organising the Archive
-
@biell There we go, put all the files in this time (and I checked it again just in case lol).
20 FRANCE OneWheel Riders _ Onewheel Forum_files.zipThe HTML files have a corresponding folder that contains images and other dependencies that was downloaded when I saved the pages.
For the one I repaired I just merged all the folders into 1 then made sure all the references to the files in the master HTML were pointing at the master folder and not the individual ones if that makes sense.
-
*clears throat *
Haven't given the site an SSL certificate yet buuuuut...
archive.owforum.co.uk
Totally got Will Smith to show this one off for me. Thx Will x
.
Seems to work fine on mobile too so far. I'm tempted to put an iframe window on the right side to preview the clicked topic but need to figure out how to do that.
For now any feedback would be great (be gentle lol).
To Do List
- Add a custom header to all archived topics to return to the main page.
- Maaaaybe figure out an okay-ish search facility
- Git gud
-
@lia -- I am in awe!!! -- of your drive, of your dedication, and of your abilities!!!
-
@s-leon Thank you <3
You all give me the strength and desire to do all of this. I can't think of anything else that would have me attempt even a fraction of this.
-
-
@lia So, this should be pretty easy. The posts are all contained within a single giant unordered list. Each post is a list item, and there is metadata with a "data-index". So, both getting the correct post order and deduplicating entries from the different files will be a breeze.
All the files media/aux files under the _files structure are identical across all posts by name (e.g. 2170-profileimg.jpg is always the same picture).
If you are still interested, it would only take an hour or two to write the program, and then it could stitch all of this together. I would write it in perl, because I'm old. So, either you could upload zip files like this or I can easilly write it so the most basic perl install will work for you to run it yourself.
I can also program to rewrite the header to clean it up, for example I could remove the "Register" and "Login" links. I can also remove the text added by Google Cache, if you would like.
I can also remove the upvote/downvote links while I am at it, if you want. Or, I can completely remove that section, including the vote-count.
Let me know how you want it, I can then write version 1, send you the output, and we can tweek from there. But, this should all be very straight forward.
-
@biell music to my ears, removing those bits will be fine as I plan to go in and add a template header that’ll call to some master .css for the rest. Getting all the broken topics together would be a dream.
Would it be possible to have it spit out a txt or something per completed topic of missing post ids? I’ll go in after and add a placeholder list item and later try locating them on archive.org
I think we can keep the up/downvotes, Those might be interesting to see still and I’m working to repair the missing icons and later a way to do the timestamps.
Thanks for taking the time to have a look, will be sure to give you some credit on the archive page for helping simplify this mammoth task.
-
@lia Providing a list of missing posts should be very easy.
If you copy over from the current forum the "/assets" directory structure into an "/assets" in the archive website, then the missing icons should be fixed. From what I can tell, that is where they were, and where I hope for them to be looked for after I de-googleify the pages. Or, if you like, I could try to point the missing assets to this forum.
-
@lia What's wrong with the timestamps? Do you just want them rendered in UTC, or do you want them rendered localtime? If you want it in UTC, I can just hard code it into the page from the metadata and put that in the correct place. If you want it rendered in localtime, I should be able to add a touch of javascript to render that, but it will take longer to get working.
-
@biell sorry I meant the icons like up arrow, down arrow, pinned, online, offline etc. You probably noticed the header nav icons are missing. They appear to currently rely on some internal site icons from FontAwesome but I plan to add custom ones and reference them instead so they render.
Time stamps don’t render currently, seems like the data is there but compared to a working page the html elements are not there. They seem to exist as a meta tag only. If I gave you the template for a working timestamp element would it be simple to insert onto the posts using the meta data stored in these tags?
I’m genuinely impressed with how capable this all seems. I was previously looking to try doing something in python but it seemed way out of my league since I’m not even an amateur D:
-
@lia Send me what you want for the timestamp template, and I will see what I can do. There are lots of options. The best way to do all of this is to write the program to make all the changes you want, then it can just do the work for you all at once, straight from the source files. Then, if we want to tweak anything, we just update the program and re-run it against the source files. If you find more pages to fill in gaps, you just re-run the program with the new files in place, and it recreates everything exactly how you want it.
Those icons at the top are done the same way, they are esentialy from a font, linked to from the CSS file, and the forum stores that data under /assets. Here is an example, navigate to this page, then "View Source": https://owforum.co.uk/assets/fonts/glyphicons-halflings-regular.eot
-
-
@biell said in Organising the Archive:
Those icons at the top are done the same way, they are esentialy from a font, linked to from the CSS file, and the forum stores that data under /assets. Here is an example, navigate to this page, then "View Source": https://owforum.co.uk/assets/fonts/glyphicons-halflings-regular.eot
Good find, I did notice some of it appears to be in the client.css file but the icons don't show regardless. I'll likely simplify it and just create the icons myself then refer to them instead.
For the time stamps I've dug some more and think I found it, there's a class "timeago" that I assume might be missing and stopping it render.
Going to dig around and see if I can either find it or rebuild it.Edit: Ah never mind it's a Jquery D:In that case since they all exist like below:
<span class="visible-xs-inline-block visible-sm-inline-block visible-md-inline-block visible-lg-inline-block"> <a class="permalink" href="https://community.onewheel.com/post/3836"><span class="timeago" title="2015-09-09T19:09:53.329Z"></span></a>
Can it be replaced with this format:
</small> <small class="pull-right"> <span class="visible-xs-inline-block visible-sm-inline-block visible-md-inline-block visible-lg-inline-block"> <a class="permalink" href="">2015-09-09T19:09:53.329</span></a>
This closes the above "pull left" and sets the timestamp to pull right relying on the already existing closing statement. The href gets null'd along with the timestamp becoming detatched from the Jquery as just text. Ends up looking a bit like this:
(I've imported the client-darkly.css for this hence the difference in color)
If it's possible to have Perl re-write that as something more human readable like "9 September 2015, 09:53" that would be ideal :)
-
@lia said in Organising the Archive:
If it's possible to have Perl re-write that as something more human readable like "9 September 2015, 09:53" that would be ideal :)
lol uh oh @biell ... u know where this train goes...
-
@notsure Oh no is that an issue?
-
@lia said in Organising the Archive:
@notsure Oh no is that an issue?
lol no its probably super easy.
but it always starts with innocuous little requests. next thing u know, code everywhere...
-
@notsure Aha I get that, these should really be all that's needed due to the sheer volume of repetitive edits involved. I've narrowed the scope of the archive to purely the basics. Stripping everything other than just the content to later be improved if needed since I keep a copy of the raw un-edited posts I gathered. Bossman wonders why I have a 17TB Nas... this is why lol.
The rest I'm happy to manually scrub to clean a handful of things that would probably be a pain to automate due to edge cases.
-
@lia Yes, writing it in any format will be easy. The HTML code also has the time as milliseconds in epoch, so I will just use that with whatever strftime string is necessary for the format you want (probably "%d %B %Y, %H:%M UTC"). Are you cool with 24-hour time, or do you want AM/PM? Personally, I prefer 24-hour time. In pass 1, I will just print it in UTC. Given that this is an archive, that should be sufficient.
-
@lia said in Organising the Archive:
The rest I'm happy to manually scrub to clean a handful of things that would probably be a pain to automate due to edge cases.
You should at least let me know what they are, because if I can code for them, I can save a lot of time, and allow you to re-run the join later if we want to change something on lots of pages.
My rule on a computer is to only ever do something 3 times.
- Learn the process
- Code the process while performing it
- Run the code and fix up all the bugs
After that, run the automation every time.
-
@biell 24 hour is perfect :) Simpler to implement and just a better format in general. UTC is fine too, happy to keep it as a single time zone to keep things consistent. I feel the dates/times only really matter in the broader sense of gauging roughly when something was said.
Appreciate the offer to code some extra bits. I've not settled yet on a header replacement yet so I'm uncertain of what to suggest for those other than the trimming of the site buttons. I can have a crack at finishing something today then have it so any changes can be handled by some CSS instead so you don't need to re-write anything.
Here's the client-darkly.css if you don't have it.
On my test doc I've done this if you're curious.
Purged some elements so only this exists in the pre "main" section with the nav being short and simple. Also ammended the header to explain the page a bit and link back to the archive<!DOCTYPE html> <title>Welcome to the Onewheel forum! | Onewheel Forum</title> <link rel="stylesheet" type="text/css" href="./2 Welcome to the Onewheel forum! _ Onewheel Forum_files/client-darkly.css"> <main id="panel" class="slideout-panel" style="padding-top: 100px;"> <nav class="navbar navbar-default navbar-fixed-top header" id="header-menu" component="navbar"> <div class="container"> <div class="navbar-header"> <a href="http://www.onewheel.com/"> <img alt="Onewheel Home Page" height="80" src="./2 Welcome to the Onewheel forum! _ Onewheel Forum_files/site-logo.png"> </a> </div> <div class="pull-right"> <a href="http://archive.owforum.co.uk"> <img alt="The Archive homepage" src="http://archive.owforum.co.uk/Images/OWForumArchive.png" height="80"> </a> </div> <p class="text-center"> <br /> This page is an archived copy of the old Onewheel Forum. <br /> To see more click <a href="http://archive.owforum.co.uk">here</a> or the archive link to the right. </p> </div> </nav>
Ends up looking like this :)
I then wipe out the below
<noscript> <div class="alert alert-danger"> <p> Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you may not be able to execute some actions. </p> <p> Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript). </p> </div> </noscript> <ol class="breadcrumb"> <li itemscope="itemscope" itemtype="http://data-vocabulary.org/Breadcrumb"> <a href="https://community.onewheel.com/" itemprop="url"> <span itemprop="title"> Home </span> </a> </li> <li itemscope="itemscope" itemtype="http://data-vocabulary.org/Breadcrumb"> <a href="https://community.onewheel.com/category/1/news-announcements" itemprop="url"> <span itemprop="title"> News & Announcements </span> </a> </li> <li component="breadcrumb/current" itemscope="itemscope" itemtype="http://data-vocabulary.org/Breadcrumb" class="active"> <span itemprop="title"> Welcome to the Onewheel forum! <a target="_blank" href="https://community.onewheel.com/topic/2.rss"><i class="fa fa-rss-square"></i></a> </span> </li> </ol> <div widget-area="header"> </div>
Near the bottom of the page these get removed also.
<div class="tags pull-left"> <a href="https://community.onewheel.com/tags/news"> <span class="tag-item" data-tag="news" style="">news</span> <span class="tag-topic-count human-readable-number" title="2">2</span> </a> <a href="https://community.onewheel.com/tags/official"> <span class="tag-item" data-tag="official" style="">official</span> <span class="tag-topic-count human-readable-number" title="1">1</span> </a> </div> <div component="topic/browsing-users"> </div>
And this
<div component="topic/reply/container" class="btn-group action-bar bottom-sheet hidden"> <a href="https://community.onewheel.com/compose?tid=2&title=Welcome%20to%20the%20Onewheel%20forum!" class="btn btn-primary" component="topic/reply" data-ajaxify="false" role="button"><i class="fa fa-reply visible-xs-inline"></i><span class="visible-sm-inline visible-md-inline visible-lg-inline"> Reply</span></a> <button type="button" class="btn btn-info dropdown-toggle" data-toggle="dropdown"> <span class="caret"></span> </button> <ul class="dropdown-menu pull-right" role="menu"> <li><a href="https://community.onewheel.com/topic/2/welcome-to-the-onewheel-forum/3#" component="topic/reply-as-topic">Reply as topic</a></li> </ul> </div> <a component="topic/reply/guest" href="https://community.onewheel.com/login" class="btn btn-primary">Log in to reply</a>
Complete page that I have is here along with the raw unedited files if you'd rather look at it completely rather than navigate my ramblings just in case I haven't made sense.
-
@lia Sorry this is taking so long, the issues I was dealing with at work over the weekend have continued into the week.
That said, I had a couple hours to sit down and code tonight and I basically have everything you asked for so far. The banner is rewritten,Google's content is removed, the HTML chunks you wanted deleted are removed, I stitched all the posts into one huge thread, the time displays accurately in UTC, and I have cleaned up some of the "Register", "Login", etc. buttons.
I can't see how to insert the time for the "Last reply" so I am removing it for now.
The have-todo item left is to move all the "*_files" content into a single, consolidated directory so you have one .html file and one corresponding _files directory.
After that, I will go through the webpage some more and see what else needs to be cleaned out from the HTML or updated to enhance rendering.