ruby - Why doesn't Nokogiri load the full page? -
i'm using nokogiri open wikipedia pages various countries, , extracting names of these countries in other languages interwiki links (links foreign-language wikipedias). however, when try open the page france, nokogiri not download full page. maybe it's large, anyway doesn't contain interwiki links need. how can force download all?
here's code:
url = "http://en.wikipedia.org/wiki/" + country_name page = nil begin page = nokogiri::html(open(url)) rescue openuri::httperror=>e puts "no article found " + country_name end language_part = page.css('div#p-lang')
test:
with country_name = "france" => [] country_name = "thailand" => long array don't want quote here, containing right data
maybe issue goes beyond nokogiri , openuri - anyway need find solution.
nokogiri not retrieve page, asks openuri internal read
on stringio object open::uri returns.
require 'open-uri' require 'zlib' stream = open('http://en.wikipedia.org/wiki/france') if (stream.content_encoding.empty?) body = stream.read else body = zlib::gzipreader.new(stream).read end p body
here's can key off of:
>> require 'open-uri' #=> true >> open('http://en.wikipedia.org/wiki/france').content_encoding #=> ["gzip"] >> open('http://en.wikipedia.org/wiki/thailand').content_encoding #=> []
in case if it's []
, aka "text/html", reads. if it's ["gzip"]
decodes.
doing stuff above , tossing to:
require 'nokogiri' page = nokogiri::html(body) language_part = page.css('div#p-lang')
should on track.
do after above confirm visually you're getting usable:
p language_part.text.gsub("\t", '')
see casper's answer , comments why saw 2 different results. looked open-uri inconsistent in processing of returned data, based on casper said, , saw using curl, wikipedia isn't honoring "accept-encoding" header large documents , returns gzip. safe today's browsers clients open-uri don't automatically sense encoding have problems. that's code above should fix.
Comments
Post a Comment