|
Post by captainjapan on Aug 11, 2021 8:35:43 GMT -6
I am by no means proficient with databases or network backends, but I thought I might like to generate a list of links-out from the odd74 forum. Example: User, derv, writes a lot of original material which he links directly from his forum posts. Suppose derv links out to his dropbox account. I would want a listing of every link-out of the odd74 forum to browse derv's games (I realize that derv retires his links, but still...) Or, talysman, who has an extensive house rules system on his blog, that he occasionally links to. Perhaps, I could generate a listing of talysman's links to get to his blog, faster. I'm not trying to single out derv and talysman, necessarily. Many posters, past and present, write their own material. Much of it stretches almost to the beginning of the boards. With veteran members writing thousands of blogposts, only a relative handful of which containing hyperlinks, I think it would be useful to have a way to search. I don't know the terminology for what I'm asking. Is it called scraping, or crawling, or parsing...? Also, is this kind of activity from members blocked by proboard's servers?
|
|
|
Post by derv on Aug 11, 2021 11:24:43 GMT -6
Not sure l’ll be of any help with your request. I do know you can search all of a members posts. There are a number of filters that can help refine the results.
In my case most of the material I share is in relation to ongoing discussion on the forum at the time. As you mentioned, I retire the links after a short time unless they encourage further discussion. Once the conversation peters out l see little reason to keep the links active. But in most cases l will share the material if someone would ask. There’s only a couple docs I truly retired with the intent of revisiting later.
Oh, and I don’t have a blog to promote, so all my conversation on gaming basically happens here.
|
|
|
Post by talysman on Aug 11, 2021 13:45:58 GMT -6
I assume you're asking for a feature in ProBoards that will aggregate links, but I don't know that they have such a thing.
You could try using wget on specific sections of the forum, but setting that up might be a nightmare. I haven't used wget in at least 15 years and have forgotten most of it, so I can't really advise you, but it seems to me that if you're just collecting links that point elsewhere, you'd still have to run it through some kind of script (perl, python, etc.) to process it, extracting links and saving them to a file.
Another option is to use a bookmarklet that finds all the links on the current webpage and copies them to a popup window, where you can copy them to the clipboard. This, unfortunately, is manual... you'd have to go to threads one at a time and run the bookmarklet, copy the links, save them to a file, and edit the file to get rid of links you don't want, duplicate links, and so on.
|
|
|
Post by waysoftheearth on Aug 11, 2021 16:29:59 GMT -6
You could start by going into the search feature, and searching for all posts with "http" in the content. That will give you a very long list of posts with links in them.
Then, you could start narrowing the list down by adding more criteria, such as the subboard, and the post author. So, then you'd have (for example) a list of posts by talysman in the U&WA subforum that have links in them.
From there you would still have to manually eye-ball each returned post, pull out the link, and decide whether its still points to anything useful or not.
It would take a while, but it's doable if you're keen.
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on Aug 11, 2021 19:59:54 GMT -6
I am by no means proficient with databases or network backends, but I thought I might like to generate a list of links-out from the odd74 forum. Example: User, derv , writes a lot of original material which he links directly from his forum posts. Suppose derv links out to his dropbox account. I would want a listing of every link-out of the odd74 forum to browse derv's games (I realize that derv retires his links, but still...) Or, talysman , who has an extensive house rules system on his blog, that he occasionally links to. Perhaps, I could generate a listing of talysman's links to get to his blog, faster. I'm not trying to single out derv and talysman, necessarily. Many posters, past and present, write their own material. Much of it stretches almost to the beginning of the boards. With veteran members writing thousands of blogposts, only a relative handful of which containing hyperlinks, I think it would be useful to have a way to search. I don't know the terminology for what I'm asking. Is it called scraping, or crawling, or parsing...? Also, is this kind of activity from members blocked by proboard's servers? I might be able to be useful! I've heard scraping, crawling, mirroring, and archiving all used to describe the download step. You'll want some parsing too, parsing the downloaded HTML so that you can pull the links out. You could try a simpler string match but it'll become a headache quickly so I would just go to a parser. talysman 's suggestion of wget is good if you can install it on your OS. wget has a random delay mode and a fixed time delay mode so that your crawl can be polite to the server. odd74.proboards.com/robots.txt is a conventional place to put crawler rules. Sometimes webservers still don't like crawlers like wget but if you make it run slowly and not hammer the server you won't do any harm, but might get requests rejected. It looks like ProBoards tells a couple of eager search engine crawlers to slow down and keeps crawlers off certain sections of the page, but not the posts, so you shouldn't be making them mad. Are you comfortable with any scripting language? Most have HTML parsers built in or available, like Python's html.parser, beautifulsoup, or nokogiri. There's also a lot of crawler libraries out there if you want to script that instead of using wget. If I were trying to do this I would probably do these steps: 1. Set up wget to run slowly and mirror the HTML files off of ProBoards (probably at least overnight and maybe a couple of days) 2. Write a script to parse the downloaded HTML and grab all of the "a"/"a href" elements, more if you want the poster name too 3. Remove any link that has proboards.com in it to get external links That way you can tweak and retry your script without losing a bunch of time. Sometimes some message boards use a link interceptor to redirect outbound links somewhere else, which could mess up internal vs. external but I didn't see that in ProBoards' HTML. Check out www.linuxjournal.com/content/downloading-entire-web-site-wget - you'll want --random-wait and -w <seconds> too, to slow the crawl and add a little jitter. Maybe something like this: wget --recursive -w 10 --random-wait --html-extension --convert-links --restrict-file-names=windows --domains odd74.proboards.com https://odd74.proboards.com (Probably not exactly what you want but hopefully enough to get you started.) If you want to keep this going for new posts, there's RSS feeds you could use: odd74.proboards.com/rss/publicI hope this helps Good luck
|
|
|
Post by captainjapan on Aug 13, 2021 9:16:47 GMT -6
Thanks, everyone, for the suggestions. @d4caltrops, Thanks, especially to you for taking the time to detail wget. I only have windows 10 machines right now. I did install a windows port of wget, but I couldn't get it to connect. I had investigated wget in the past, for a similar project; and I know it is the go to solution for site scraping. What ended up working for me was an old program called Zenu Link Sleuth. It is windows-only and has a graphical interface. You can export a tab-separated text file from Zenu to read as a spreadsheet (see a spreadsheet of derv's recent outgoing links, here). I removed the proboards.com links from the spreadsheet so you only see what derv personally linked, since around 2017, where the "recent" posts begin. I could have been kinder on the proboards server, though. Zenu was set to crawl a maximum of 50 parallel threads to a depth of 999 links. I have crawled the entire site this. When I parse the output, I might post it to Links & Resources. Zenu crawled about 200,000 links on the odd74 boards last night. Mostly internal pointing links. Plenty of old Photobucket deadends, too. If I did want to extract links in the context of a discussion thread, although I could only go as fast as it takes to read page by page, I could use a bookmarklet as talysman suggested. Here is a selection of bookmarklets that do useful things with hypertext links. I installed the one that lists all outgoing links on a page and also the one that color codes links according to where they point. Thanks, for the tip! All this time, I never knew what a bookmarklet was. Or, how easy it is to install! (Unless you're on Microsoft's edge browser )
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on Aug 13, 2021 9:45:03 GMT -6
Not sure l’ll be of any help with your request. I do know you can search all of a members posts. There are a number of filters that can help refine the results. In my case most of the material I share is in relation to ongoing discussion on the forum at the time. As you mentioned, I retire the links after a short time unless they encourage further discussion. Once the conversation peters out l see little reason to keep the links active. But in most cases l will share the material if someone would ask. There’s only a couple docs I truly retired with the intent of revisiting later. Oh, and I don’t have a blog to promote, so all my conversation on gaming basically happens here. See your pm's. (I know it doesn't need the apostrophe, but it looks better that way. )
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on Aug 13, 2021 9:45:33 GMT -6
Thanks, everyone, for the suggestions. @d4caltrops , Thanks, especially to you for taking the time to detail wget. I only have windows 10 machines right now. I did install a windows port of wget, but I couldn't get it to connect. I had investigated wget in the past, for a similar project; and I know it is the go to solution for site scraping. What ended up working for me was an old program called Zenu Link Sleuth. It is windows-only and has a graphical interface. You can export a tab-separated text file from Zenu to read as a spreadsheet (see a spreadsheet of derv 's recent outgoing links, here). I removed the proboards.com links from the spreadsheet so you only see what derv personally linked, since around 2017, where the "recent" posts begin. I could have been kinder on the proboards server, though. Zenu was set to crawl a maximum of 50 parallel threads to a depth of 999 links. I have crawled the entire site this. When I parse the output, I might post it to Links & Resources. Zenu crawled about 200,000 links on the odd74 boards last night. Mostly internal pointing links. Plenty of old Photobucket deadends, too. If I did want to extract links in the context of a discussion thread, although I could only go as fast as it takes to read page by page, I could use a bookmarklet as talysman suggested. Here is a selection of bookmarklets that do useful things with hypertext links. I installed the one that lists all outgoing links on a page and also the one that color codes links according to where they point. Thanks, for the tip! All this time, I never knew what a bookmarklet was. Or, how easy it is to install! (Unless you're on Microsoft's edge browser ) Thanks for the derv link.
|
|