Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion lib/staticizer/crawler.rb
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,16 @@ def save_page_to_aws(response, uri)
# Upload this file directly to AWS::S3
opts = {:acl => "public-read"}
opts[:content_type] = response['content-type'] rescue "text/html"

# Detect a meta-redirect and set an S3 hosting redirect metadata item
if response =~ /META http-equiv='refresh' content='0;URL="(.*)"/

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are likely some common cases where this regex would fail. For example:

<meta http-equiv="refresh" content="0;url=http://example.com" />

or

<meta http-equiv="refresh" content="2;url='http://example.com'" />

I'll merge this request and then likely modify this to catch a wider range of meta redirects.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, my bad, this is just a little hack to use the existing process_redirect. Got it! :)

location = $1
if location =~ /^(?:[^\/]|http:\/\/|https\:\/\/).*/

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is prepending a slash if the location starts with http or https, is that needed for s3 redirects? Also wouldn't this redirect to the wrong place if the location is not absolute. So if we are at http://www.google.com/section/page1 and that page has a meta refresh to url='page2' then this would redirect to /page2 instead of /section/page2.

location.prepend('/')
end
opts[:website_redirect_location] = location
end

@log.info "Uploading #{key} to s3 with content type #{opts[:content_type]}"
if response.respond_to?(:read_body)
body = process_body(response.read_body, uri, opts)
Expand Down Expand Up @@ -197,7 +207,6 @@ def process_success(response, parsed_uri)
end

# If we hit a redirect we save the redirect as a meta refresh page
# TODO: for AWS S3 hosting we could instead create a redirect?
def process_redirect(url, destination_url)
body = "<html><head><META http-equiv='refresh' content='0;URL=\"#{destination_url}\"'></head><body>You are being redirected to <a href='#{destination_url}'>#{destination_url}</a>.</body></html>"
save_page(body, url)
Expand Down