Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
6646159
python3 does not support string exceptions
sandsmark Jul 21, 2019
a9360af
dos2unix, missing )
sandsmark Jul 21, 2019
fcdc5bd
run 2to3 on crawl.py
sandsmark Jul 21, 2019
8dcab9e
dos2unix on the rest
sandsmark Jul 21, 2019
68e9b67
2to3 on the rest
sandsmark Jul 21, 2019
3f5fbd0
tabs to spaces
sandsmark Jul 21, 2019
3142188
don't kill stdout and stderr
sandsmark Jul 21, 2019
19c0bc8
os.errno doesn't exist anymore in python3, but exist_ok does
sandsmark Jul 21, 2019
a69ef7c
port to git, mercurial doesn't have a python3 API (at least not stabl…
sandsmark Jul 21, 2019
8d9232d
use proper path separators
sandsmark Jul 21, 2019
4990468
don't need the mercurial monkeypatching anymore
sandsmark Jul 21, 2019
110415d
'better' commit message when no message from author
sandsmark Jul 21, 2019
134fc1b
update readme explaining it now uses git
sandsmark Jul 21, 2019
f397b8c
.hgignore -> .gitignore
sandsmark Jul 21, 2019
3e6e9f1
actually commit changes
sandsmark Jul 21, 2019
61838a1
handle more than 250 pages
sandsmark Jul 21, 2019
c63bf3b
cache fetched pages
sandsmark Jul 27, 2019
92a0510
less debug spam, fix exception for python3 compatibility
sandsmark Jul 27, 2019
149136d
missing declaration
sandsmark Jul 27, 2019
dd0738a
less debug spam, fix skipping already fetched
sandsmark Jul 27, 2019
335f1c8
check for .git when checking if there's an existing repo
sandsmark Jul 27, 2019
f6bd4e7
fix renames
sandsmark Jul 27, 2019
974ddb0
fix cleanup, store fetched IDs so we don't fetch again later
sandsmark Jul 28, 2019
7bf42b3
more verbose output when returned json fails to parse
sandsmark Jul 28, 2019
04e9324
fix storing/skipping of already fetched revisions
sandsmark Jul 28, 2019
ab82f22
track renames with symlinks
sandsmark Jul 28, 2019
bf568d5
avoid making the terminal backlog useless when scraping scp
sandsmark Jul 28, 2019
8850e83
Revert "track renames with symlinks"
sandsmark Jul 28, 2019
d129d91
track redirect pages correctly
sandsmark Jul 28, 2019
23e6412
control debug spam
sandsmark Jul 28, 2019
b40b560
more cleaning of debug output
sandsmark Jul 29, 2019
5a90bb1
fix dates in commits
sandsmark Jul 29, 2019
e487cdd
let commit date be the current datetime, it makes more sense
sandsmark Jul 29, 2019
94fa6ae
python doesn't have this already?
sandsmark Aug 4, 2019
f9175f3
retry in case of gateway errors, which seem to be semi-frequent and q…
sandsmark Aug 4, 2019
7070b38
disable removing state tracking files, we want them if we continously…
sandsmark Aug 4, 2019
d103e4d
improve tracking of created files (not entirely sure why it didn't wo…
sandsmark Aug 4, 2019
1955c28
persist metadata (renames etc.) in the git repo
sandsmark Aug 4, 2019
3d08cc2
add dependencies to readme
sandsmark Jul 14, 2020
b29c0ee
python's time.clock() is gone
sandsmark Jul 14, 2020
bcf3240
bs (appropriate name) apparently has changed its API
sandsmark Jul 14, 2020
817c4d0
avoid double / in the URLs
sandsmark Jul 14, 2020
fd918ad
print URLs we fetch in debug output
sandsmark Jul 14, 2020
2fcec56
try to have more robust fetching (longer waiting on errors)
sandsmark Jul 14, 2020
c4d907e
extract list of embedded images
sandsmark Jul 14, 2020
8d0a5ee
fix image downloading (TODO: make it add them in the right commit, no…
sandsmark Jul 14, 2020
2eabf6b
add comment explaining why we can't get the images in the right commit
sandsmark Jul 14, 2020
8afada4
fuck python, this suddendly didn't work on my server
sandsmark Jul 17, 2020
74923b1
re-try in case of json errors, seems like they are spurious
sandsmark Jul 17, 2020
9ed8019
less debug spam
sandsmark Jul 18, 2020
49da2cf
don't need to wait for a download slot if we're not downloading
sandsmark Jul 18, 2020
e6da377
better use of named parameters and stuff
sandsmark Jul 18, 2020
c2479c9
add progress bars with tqdm
sandsmark Jul 18, 2020
511c6eb
remove unused
sandsmark Jul 18, 2020
9c43ea7
don't be dumb, use sets, massive speedup
sandsmark Jul 18, 2020
e2a763b
fix default argument
sandsmark Jul 30, 2020
5cc5fdb
fix status output
sandsmark Jul 30, 2020
674bdda
fix check for existing repo
sandsmark Jul 30, 2020
2f1a9f6
add todo
sandsmark Jul 30, 2020
858b7ed
404 for images is not fatal
sandsmark Jul 30, 2020
29e1791
fix
sandsmark Jul 30, 2020
d24acca
fix relative path
sandsmark Jul 30, 2020
938bb68
add timeouts
sandsmark Jul 30, 2020
e0d9e4b
python's time.sleep is in seconds, not milliseconds
sandsmark Jul 30, 2020
b01c253
support for skipping select revisions
sandsmark Jul 30, 2020
b52dc93
time how long a download takes, remove invalid images (usually 404 er…
sandsmark Jul 30, 2020
bae3660
add tags to todo
sandsmark Jul 30, 2020
4dc15fd
improve debug output
sandsmark Jul 30, 2020
251b706
skip updating parent history if not actually changed
sandsmark Jul 30, 2020
4ec2b3b
mention added images in commit message
sandsmark Jul 30, 2020
22a3f1e
added some dependencies
sandsmark Jul 30, 2020
d38d514
avoid retrying images that we know are invalid (i. e. not temporary d…
sandsmark Jul 30, 2020
1b3f608
implement tag handling, not tested
sandsmark Jul 30, 2020
26e8977
bump default delay, be nice
sandsmark Jul 30, 2020
b991fe7
doh
sandsmark Jul 30, 2020
c8c8ed8
typo
sandsmark Jul 30, 2020
44a5fc1
handle timeouts
sandsmark Jul 30, 2020
404a1e4
make some errors that should be fatal fatal
sandsmark Jul 30, 2020
e0b27c3
avoid so long delays, it usually recovers immediately
sandsmark Jul 30, 2020
cf88384
simplify
sandsmark Jul 30, 2020
c9b7f53
fix initial fetch
sandsmark Jul 30, 2020
9b526bf
don't always cleanup
sandsmark Jul 31, 2020
cdcb096
better throttling when requests fail
sandsmark Jul 31, 2020
d3eeb75
fix starting from scratch
sandsmark Aug 2, 2020
f6cfe01
fix path to revid file
sandsmark Aug 2, 2020
2d96bf8
ignore minor error
sandsmark Aug 2, 2020
b8cd79f
annoying
sandsmark Aug 2, 2020
6be1b90
move code around
sandsmark Aug 2, 2020
77490e2
start on forum scraping support
sandsmark Aug 2, 2020
b197beb
support for skipping entire pages (for pages that fail for some reason)
sandsmark Aug 22, 2020
603d164
fix support for skipping multiple revisions
sandsmark Aug 22, 2020
3524f54
fix
sandsmark Aug 22, 2020
f23b0ff
fix robustness when downloading images
sandsmark Aug 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.pyc
__pycache__
2 changes: 0 additions & 2 deletions .hgignore

This file was deleted.

239 changes: 121 additions & 118 deletions crawl.py
Original file line number Diff line number Diff line change
@@ -1,118 +1,121 @@
import argparse
import sys
import locale
import codecs
import os
from wikidot import Wikidot
from rmaint import RepoMaintainer

# TODO: Files.
# TODO: Forum and comment pages.
# TODO: Ability to download new transactions since last dump.
# We'll probably check the last revision time, then query all transactions and select those with greater revision time (not equal, since we would have downloaded equals at the previous dump)

rawStdout = sys.stdout
rawStderr = sys.stderr
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout, 'xmlcharrefreplace')
sys.stderr = codecs.getwriter(locale.getpreferredencoding())(sys.stderr, 'xmlcharrefreplace')

parser = argparse.ArgumentParser(description='Queries Wikidot')
parser.add_argument('site', help='URL of Wikidot site')
# Actions
parser.add_argument('--list-pages', action='store_true', help='List all pages on this site')
parser.add_argument('--source', action='store_true', help='Print page source (requires --page)')
parser.add_argument('--content', action='store_true', help='Print page content (requires --page)')
parser.add_argument('--log', action='store_true', help='Print page revision log (requires --page)')
parser.add_argument('--dump', type=str, help='Download page revisions to this directory')
# Debug actions
parser.add_argument('--list-pages-raw', action='store_true')
parser.add_argument('--log-raw', action='store_true')
# Action settings
parser.add_argument('--page', type=str, help='Query only this page')
parser.add_argument('--depth', type=int, default='10000', help='Query only last N revisions')
parser.add_argument('--revids', action='store_true', help='Store last revision ids in the repository')
# Common settings
parser.add_argument('--debug', action='store_true', help='Print debug info')
parser.add_argument('--delay', type=int, default='200', help='Delay between consequent calls to Wikidot')
args = parser.parse_args()


wd = Wikidot(args.site)
wd.debug = args.debug
wd.delay = args.delay


def force_dirs(path):
try:
os.makedirs(path)
except OSError as exception:
if exception.errno != os.errno.EEXIST:
raise

if args.list_pages_raw:
print wd.list_pages_raw(args.depth)

elif args.list_pages:
for page in wd.list_pages(args.depth):
print page

elif args.source:
if not args.page:
raise "Please specify --page for --source."

page_id = wd.get_page_id(args.page)
if not page_id:
raise "Page not found: "+args.page

revs = wd.get_revisions(page_id, 1) # last revision
print wd.get_revision_source(revs[0]['id'])

elif args.content:
if not args.page:
raise "Please specify --page for --source."

page_id = wd.get_page_id(args.page)
if not page_id:
raise "Page not found: "+args.page

revs = wd.get_revisions(page_id, 1) # last revision
print wd.get_revision_version(revs[0]['id'])

elif args.log_raw:
if not args.page:
raise "Please specify --page for --log."

page_id = wd.get_page_id(args.page)
if not page_id:
raise "Page not found: "+args.page

print wd.get_revisions_raw(page_id, args.depth)


elif args.log:
if not args.page:
raise "Please specify --page for --log."

page_id = wd.get_page_id(args.page)
if not page_id:
raise "Page not found: "+args.page
for rev in wd.get_revisions(page_id, args.depth):
print unicode(rev)


elif args.dump:
print "Downloading pages to "+args.dump
force_dirs(args.dump)

rm = RepoMaintainer(wd, args.dump)
rm.debug = args.debug
rm.storeRevIds = args.revids
rm.buildRevisionList([args.page] if args.page else None, args.depth)
rm.openRepo()

print "Downloading revisions..."
while rm.commitNext():
pass

rm.cleanup()
print "Done."
import argparse
import sys
import locale
import codecs
import os
from wikidot import Wikidot
from rmaint import RepoMaintainer

# TODO: Files.
# TODO: Forum and comment pages.
# TODO: Ability to download new transactions since last dump.
# We'll probably check the last revision time, then query all transactions and select those with greater revision time (not equal, since we would have downloaded equals at the previous dump)

parser = argparse.ArgumentParser(description='Queries Wikidot')
parser.add_argument('site', help='URL of Wikidot site')
# Actions
parser.add_argument('--list-pages', action='store_true', help='List all pages on this site')
parser.add_argument('--max-page-count', type=int, default='10000', help='Only list/fetch up to this amount of pages')
parser.add_argument('--source', action='store_true', help='Print page source (requires --page)')
parser.add_argument('--content', action='store_true', help='Print page content (requires --page)')
parser.add_argument('--log', action='store_true', help='Print page revision log (requires --page)')
parser.add_argument('--dump', type=str, help='Download page revisions to this directory')
# Debug actions
parser.add_argument('--list-pages-raw', action='store_true')
parser.add_argument('--log-raw', action='store_true')
# Action settings
parser.add_argument('--page', type=str, help='Query only this page')
parser.add_argument('--depth', type=int, default='10000', help='Query only last N revisions')
parser.add_argument('--revids', action='store_true', help='Store last revision ids in the repository', default=True)
parser.add_argument('--skip', type=str, help='Skip the specified revision')
parser.add_argument('--skip-pages', type=str, help='Skip the specified pages')
parser.add_argument('--cleanup', action='store_true', help='Clean up after downloading repo')
# Common settings
parser.add_argument('--debug', action='store_true', help='Print debug info')
parser.add_argument('--delay', type=int, default='200', help='Delay between consequent calls to Wikidot')
args = parser.parse_args()


wd = Wikidot(args.site)
wd.debug = args.debug
wd.delay = args.delay


def force_dirs(path):
os.makedirs(path, exist_ok=True)

if args.list_pages_raw:
print((wd.list_pages_raw(limit = args.max_pages_count)))

elif args.list_pages:
for page in wd.list_pages(limit = args.max_pages_count):
print(page)

elif args.source:
if not args.page:
raise Exception("Please specify --page for --source.")

page_id = wd.get_page_id(page_unix_name=args.page)
if not page_id:
raise Exception("Page not found: "+args.page)

revs = wd.get_revisions(page_id, 1) # last revision
print((wd.get_revision_source(revs[0]['id'])))

elif args.content:
if not args.page:
raise Exception("Please specify --page for --source.")

page_id = wd.get_page_id(page_unix_name=args.page)
if not page_id:
raise Exception("Page not found: "+args.page)

revs = wd.get_revisions(page_id, 1) # last revision
print((wd.get_revision_version(revs[0]['id'])))

elif args.log_raw:
if not args.page:
raise Exception("Please specify --page for --log.")

page_id = wd.get_page_id(page_unix_name=args.page)
if not page_id:
raise Exception("Page not found: "+args.page)

print((wd.get_revisions_raw(page_id, args.depth)))


elif args.log:
if not args.page:
raise Exception("Please specify --page for --log.")

page_id = wd.get_page_id(page_unix_name=args.page)
if not page_id:
raise Exception("Page not found: "+args.page)
for rev in wd.get_revisions(page_id, args.depth):
print((str(rev)))


elif args.dump:
print(("Downloading pages to "+args.dump))
force_dirs(args.dump)

rm = RepoMaintainer(wd, args.dump)
rm.debug = args.debug
rm.storeRevIds = args.revids
rm.max_depth = args.depth
rm.max_page_count = args.max_page_count
rm.buildRevisionList([args.page] if args.page else None)
rm.openRepo()

if args.skip_pages:
rm.pages_to_skip = args.skip_pages.split(",")
if args.skip:
rm.revs_to_skip = args.skip.split(",")

print("Downloading revisions")
rm.fetchAll()

if args.cleanup:
rm.cleanup()

print("Done.")
50 changes: 0 additions & 50 deletions hgpatch.py

This file was deleted.

80 changes: 50 additions & 30 deletions readme.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,50 @@
This is a Python command line client for relatively popular wiki hosting http://www.wikidot.com which lets you:

* List all pages on a site
* See all revisions of a page
* Query page source

Most interestingly, it allows you to download the whole site as a Mercurial repository, with proper commit dates and comments!

##### Examples:

crawl.py http://example.wikidot.com --dump ExampleRepo
crawl.py http://example.wikidot.com --log --page example-page

It uses internal Wikidot AJAX requests to do it's job. If you're from Wikidot, please don't break it. Thank you! We'll try to be nice and not put a load on your servers.

Downloading of large sites might take a while. If anything breaks, just restart the same command, it'll continue from where it crashed.

##### Useful links:

Wikidot code (very old) which simplifies things a bit:

* https://github.com/gabrys/wikidot/blob/master/php/modules/history/PageRevisionListModule.php

The descriptions for on-site modules are heavily correlated with AJAX ones:

* http://www.wikidot.com/doc-modules:listpages-module

Someone else did Wikidot AJAX:

* https://github.com/kerel-fs/ogn-rdb/blob/master/wikidotcrawler.py
*This is a fork to make a permanent backup of the SCP wiki.*

This is a Python command line client for relatively popular wiki hosting
http://www.wikidot.com which lets you:

* List all pages on a site
* See all revisions of a page
* Query page source

Most interestingly, it allows you to download the whole site as a Git repository, with proper commit dates, author and comments!

##### Dependencies

At least:

* Python 3
* python-beautifulsoup4
* python-gitpython
* python-requests
* python-tqdm

##### Examples:

crawl.py http://example.wikidot.com --dump ExampleRepo
crawl.py http://example.wikidot.com --log --page example-page

It uses internal Wikidot AJAX requests to do it's job. If you're from Wikidot, please don't break it. Thank you! We'll try to be nice and not put a load on your servers.

Downloading of large sites might take a while. If anything breaks, just restart the same command, it'll continue from where it crashed.

##### Useful links:

Wikidot code (very old) which simplifies things a bit:

* https://github.com/gabrys/wikidot/blob/master/php/modules/history/PageRevisionListModule.php

The descriptions for on-site modules are heavily correlated with AJAX ones:

* http://www.wikidot.com/doc-modules:listpages-module

Someone else did Wikidot AJAX:

* https://github.com/kerel-fs/ogn-rdb/blob/master/wikidotcrawler.py


#### TODO

- Handle deleted images. Probably need to check the diff and check all pages for references if removed from one page.
- Handle tags (both added and removed).

Loading