ruby - Why doesn't Nokogiri load the full page? -

i'm using nokogiri open wikipedia pages various countries, , extracting names of these countries in other languages interwiki links (links foreign-language wikipedias). however, when try open the page france, nokogiri not download full page. maybe it's large, anyway doesn't contain interwiki links need. how can force download all?

here's code:

url = "http://en.wikipedia.org/wiki/" + country_name page = nil begin   page = nokogiri::html(open(url)) rescue   openuri::httperror=>e   puts "no article found " + country_name end  language_part = page.css('div#p-lang')

test:

with country_name = "france" => []  country_name = "thailand" => long array don't want quote here,    containing right data

maybe issue goes beyond nokogiri , openuri - anyway need find solution.

nokogiri not retrieve page, asks openuri internal read on stringio object open::uri returns.

require 'open-uri' require 'zlib'  stream = open('http://en.wikipedia.org/wiki/france') if (stream.content_encoding.empty?)   body = stream.read else   body = zlib::gzipreader.new(stream).read end  p body

here's can key off of:

>> require 'open-uri' #=> true >> open('http://en.wikipedia.org/wiki/france').content_encoding #=> ["gzip"] >> open('http://en.wikipedia.org/wiki/thailand').content_encoding #=> []

in case if it's [], aka "text/html", reads. if it's ["gzip"] decodes.

doing stuff above , tossing to:

require 'nokogiri' page = nokogiri::html(body) language_part = page.css('div#p-lang')

should on track.

do after above confirm visually you're getting usable:

p language_part.text.gsub("\t", '')

see casper's answer , comments why saw 2 different results. looked open-uri inconsistent in processing of returned data, based on casper said, , saw using curl, wikipedia isn't honoring "accept-encoding" header large documents , returns gzip. safe today's browsers clients open-uri don't automatically sense encoding have problems. that's code above should fix.

Search This Blog

Barbera

ruby - Why doesn't Nokogiri load the full page? -

Comments

Post a Comment

Popular posts from this blog

c# - SharpSVN - How to get the previous revision? -

c++ - Is it possible to compile a VST on linux? -

url - Querystring manipulation of email Address in PHP -