tech support - link crawling | Original D&D Discussion

captainjapan
Level 8 Warlock

Posts: 887

tech support - link crawling Aug 11, 2021 8:35:43 GMT -6

Quote

Post by captainjapan on Aug 11, 2021 8:35:43 GMT -6

I am by no means proficient with databases or network backends, but I thought I might like to generate a list of links-out from the odd74 forum. Example: User, derv, writes a lot of original material which he links directly from his forum posts. Suppose derv links out to his dropbox account. I would want a listing of every link-out of the odd74 forum to browse derv's games (I realize that derv retires his links, but still...) Or, talysman, who has an extensive house rules system on his blog, that he occasionally links to. Perhaps, I could generate a listing of talysman's links to get to his blog, faster. I'm not trying to single out derv and talysman, necessarily. Many posters, past and present, write their own material. Much of it stretches almost to the beginning of the boards. With veteran members writing thousands of blogposts, only a relative handful of which containing hyperlinks, I think it would be useful to have a way to search.

I don't know the terminology for what I'm asking. Is it called scraping, or crawling, or parsing...? Also, is this kind of activity from members blocked by proboard's servers?

derv
Regular Patron

"Not all who wander are lost."

Posts: 2,296

tech support - link crawling Aug 11, 2021 11:24:43 GMT -6

Quote

Post by derv on Aug 11, 2021 11:24:43 GMT -6

Not sure l’ll be of any help with your request. I do know you can search all of a members posts. There are a number of filters that can help refine the results.

In my case most of the material I share is in relation to ongoing discussion on the forum at the time. As you mentioned, I retire the links after a short time unless they encourage further discussion. Once the conversation peters out l see little reason to keep the links active. But in most cases l will share the material if someone would ask. There’s only a couple docs I truly retired with the intent of revisiting later.

Oh, and I don’t have a blog to promote, so all my conversation on gaming basically happens here.

Last Edit: Aug 11, 2021 11:27:32 GMT -6 by derv

"War is a game which were their subjects wise, kings would not play at."
-William Cowper

"War does not determine who is right- only who is left."
-Bertrand Russell

The Primary Rule: "Nothing can be done contrary to what could or would be done in actual war."
-Fred T. Jane

"There is only one rule to our war game: simulate reality."
-Michael F. Korn

"Only the dead have seen the end of war."
-George Santayana

talysman
Level 9 Sorcerer

Posts: 1,796

tech support - link crawling Aug 11, 2021 13:45:58 GMT -6

Quote

Post by talysman on Aug 11, 2021 13:45:58 GMT -6

I assume you're asking for a feature in ProBoards that will aggregate links, but I don't know that they have such a thing.

You could try using wget on specific sections of the forum, but setting that up might be a nightmare. I haven't used wget in at least 15 years and have forgotten most of it, so I can't really advise you, but it seems to me that if you're just collecting links that point elsewhere, you'd still have to run it through some kind of script (perl, python, etc.) to process it, extracting links and saving them to a file.

Another option is to use a bookmarklet that finds all the links on the current webpage and copies them to a popup window, where you can copy them to the clipboard. This, unfortunately, is manual... you'd have to go to threads one at a time and run the bookmarklet, copy the links, save them to a file, and edit the file to get rid of links you don't want, duplicate links, and so on.

waysoftheearth
Global Moderator

Posts: 4,183

tech support - link crawling Aug 11, 2021 16:29:59 GMT -6

Quote

Post by waysoftheearth on Aug 11, 2021 16:29:59 GMT -6

You could start by going into the search feature, and searching for all posts with "http" in the content. That will give you a very long list of posts with links in them.

Then, you could start narrowing the list down by adding more criteria, such as the subboard, and the post author. So, then you'd have (for example) a list of posts by talysman in the U&WA subforum that have links in them.

From there you would still have to manually eye-ball each returned post, pull out the link, and decide whether its still points to anything useful or not.

It would take a while, but it's doable if you're keen.

Last Edit: Aug 11, 2021 17:37:53 GMT -6 by waysoftheearth

Deleted
Deleted Member

Posts: 0

tech support - link crawling Aug 11, 2021 19:59:54 GMT -6 Mushgnome likes this

Quote

Post by Deleted on Aug 11, 2021 19:59:54 GMT -6

Aug 11, 2021 8:35:43 GMT -6 captainjapan said:

I am by no means proficient with databases or network backends, but I thought I might like to generate a list of links-out from the odd74 forum. Example: User, derv , writes a lot of original material which he links directly from his forum posts. Suppose derv links out to his dropbox account. I would want a listing of every link-out of the odd74 forum to browse derv's games (I realize that derv retires his links, but still...) Or, talysman , who has an extensive house rules system on his blog, that he occasionally links to. Perhaps, I could generate a listing of talysman's links to get to his blog, faster. I'm not trying to single out derv and talysman, necessarily. Many posters, past and present, write their own material. Much of it stretches almost to the beginning of the boards. With veteran members writing thousands of blogposts, only a relative handful of which containing hyperlinks, I think it would be useful to have a way to search.

I don't know the terminology for what I'm asking. Is it called scraping, or crawling, or parsing...? Also, is this kind of activity from members blocked by proboard's servers?

I might be able to be useful!

I've heard scraping, crawling, mirroring, and archiving all used to describe the download step. You'll want some parsing too, parsing the downloaded HTML so that you can pull the links out. You could try a simpler string match but it'll become a headache quickly so I would just go to a parser.

talysman 's suggestion of wget is good if you can install it on your OS. wget has a random delay mode and a fixed time delay mode so that your crawl can be polite to the server.

odd74.proboards.com/robots.txt is a conventional place to put crawler rules. Sometimes webservers still don't like crawlers like wget but if you make it run slowly and not hammer the server you won't do any harm, but might get requests rejected. It looks like ProBoards tells a couple of eager search engine crawlers to slow down and keeps crawlers off certain sections of the page, but not the posts, so you shouldn't be making them mad.

Are you comfortable with any scripting language? Most have HTML parsers built in or available, like Python's html.parser, beautifulsoup, or nokogiri. There's also a lot of crawler libraries out there if you want to script that instead of using wget.

If I were trying to do this I would probably do these steps:

1. Set up wget to run slowly and mirror the HTML files off of ProBoards (probably at least overnight and maybe a couple of days)
2. Write a script to parse the downloaded HTML and grab all of the "a"/"a href" elements, more if you want the poster name too
3. Remove any link that has proboards.com in it to get external links

That way you can tweak and retry your script without losing a bunch of time.

Sometimes some message boards use a link interceptor to redirect outbound links somewhere else, which could mess up internal vs. external but I didn't see that in ProBoards' HTML.

Check out www.linuxjournal.com/content/downloading-entire-web-site-wget - you'll want --random-wait and -w <seconds> too, to slow the crawl and add a little jitter. Maybe something like this:

wget --recursive -w 10 --random-wait --html-extension --convert-links --restrict-file-names=windows --domains odd74.proboards.com  https://odd74.proboards.com

(Probably not exactly what you want but hopefully enough to get you started.)

If you want to keep this going for new posts, there's RSS feeds you could use: odd74.proboards.com/rss/public

I hope this helps

Good luck

captainjapan
Level 8 Warlock

Posts: 887

tech support - link crawling Aug 13, 2021 9:16:47 GMT -6

Quote

Post by captainjapan on Aug 13, 2021 9:16:47 GMT -6

Thanks, everyone, for the suggestions.

@d4caltrops,

Thanks, especially to you for taking the time to detail wget. I only have windows 10 machines right now. I did install a windows port of wget, but I couldn't get it to connect. I had investigated wget in the past, for a similar project; and I know it is the go to solution for site scraping. What ended up working for me was an old program called Zenu Link Sleuth. It is windows-only and has a graphical interface. You can export a tab-separated text file from Zenu to read as a spreadsheet (see a spreadsheet of derv's recent outgoing links, here). I removed the proboards.com links from the spreadsheet so you only see what derv personally linked, since around 2017, where the "recent" posts begin. I could have been kinder on the proboards server, though. Zenu was set to crawl a maximum of 50 parallel threads to a depth of 999 links. I have crawled the entire site this. When I parse the output, I might post it to Links & Resources. Zenu crawled about 200,000 links on the odd74 boards last night. Mostly internal pointing links. Plenty of old Photobucket deadends, too.

If I did want to extract links in the context of a discussion thread, although I could only go as fast as it takes to read page by page, I could use a bookmarklet as talysman suggested. Here is a selection of bookmarklets that do useful things with hypertext links. I installed the one that lists all outgoing links on a page and also the one that color codes links according to where they point. Thanks, for the tip! All this time, I never knew what a bookmarklet was. Or, how easy it is to install! (Unless you're on Microsoft's edge browser

)

Deleted
Deleted Member

Posts: 0

tech support - link crawling Aug 13, 2021 9:45:03 GMT -6

Quote

Post by Deleted on Aug 13, 2021 9:45:03 GMT -6

Aug 11, 2021 11:24:43 GMT -6 derv said:

Not sure l’ll be of any help with your request. I do know you can search all of a members posts. There are a number of filters that can help refine the results.

In my case most of the material I share is in relation to ongoing discussion on the forum at the time. As you mentioned, I retire the links after a short time unless they encourage further discussion. Once the conversation peters out l see little reason to keep the links active. But in most cases l will share the material if someone would ask. There’s only a couple docs I truly retired with the intent of revisiting later.

Oh, and I don’t have a blog to promote, so all my conversation on gaming basically happens here.

See your pm's. (I know it doesn't need the apostrophe, but it looks better that way.

)

Deleted
Deleted Member

Posts: 0

tech support - link crawling Aug 13, 2021 9:45:33 GMT -6

Quote

Post by Deleted on Aug 13, 2021 9:45:33 GMT -6

Aug 13, 2021 9:16:47 GMT -6 captainjapan said:

Thanks, everyone, for the suggestions.

@d4caltrops ,

Thanks, especially to you for taking the time to detail wget. I only have windows 10 machines right now. I did install a windows port of wget, but I couldn't get it to connect. I had investigated wget in the past, for a similar project; and I know it is the go to solution for site scraping. What ended up working for me was an old program called Zenu Link Sleuth. It is windows-only and has a graphical interface. You can export a tab-separated text file from Zenu to read as a spreadsheet (see a spreadsheet of derv 's recent outgoing links, here). I removed the proboards.com links from the spreadsheet so you only see what derv personally linked, since around 2017, where the "recent" posts begin. I could have been kinder on the proboards server, though. Zenu was set to crawl a maximum of 50 parallel threads to a depth of 999 links. I have crawled the entire site this. When I parse the output, I might post it to Links & Resources. Zenu crawled about 200,000 links on the odd74 boards last night. Mostly internal pointing links. Plenty of old Photobucket deadends, too.

If I did want to extract links in the context of a discussion thread, although I could only go as fast as it takes to read page by page, I could use a bookmarklet as talysman suggested. Here is a selection of bookmarklets that do useful things with hypertext links. I installed the one that lists all outgoing links on a page and also the one that color codes links according to where they point. Thanks, for the tip! All this time, I never knew what a bookmarklet was. Or, how easy it is to install! (Unless you're on Microsoft's edge browser

)

Thanks for the derv link.