Problem with decoding UTF-8 JSON in perl -
utf-8 characters destroyed when processed json library (maybe similar problem decoding unicode json in perl, setting binmode creates problem).
i have reduced problem down following example:
(hlovdal) localhost:/tmp/my_test>cat my_test.pl #!/usr/bin/perl -w use strict; use warnings; use json; use file::slurp; use getopt::long; use encode; $set_binmode = 0; getoptions("set-binmode" => \$set_binmode); if ($set_binmode) { binmode(stdin, ":encoding(utf-8)"); binmode(stdout, ":encoding(utf-8)"); binmode(stderr, ":encoding(utf-8)"); } sub check { $text = shift; return "is_utf8(): " . (encode::is_utf8($text) ? "1" : "0") . ", is_utf8(1): " . (encode::is_utf8($text, 1) ? "1" : "0"). ". "; } $my_test = "hei på deg"; $json_text = read_file('my_test.json'); $hash_ref = json->new->utf8->decode($json_text); print check($my_test), "\$my_test = $my_test\n"; print check($json_text), "\$json_text = $json_text"; print check($$hash_ref{'my_test'}), "\$\$hash_ref{'my_test'} = " . $$hash_ref{'my_test'} . "\n"; (hlovdal) localhost:/tmp/my_test>
when running testing text reason crippeled iso-8859-1. setting binmode sort of solves causes double encoding of other strings.
(hlovdal) localhost:/tmp/my_test>cat my_test.json { "my_test" : "hei på deg" } (hlovdal) localhost:/tmp/my_test>file my_test.json my_test.json: utf-8 unicode text (hlovdal) localhost:/tmp/my_test>hexdump -c my_test.json 0000000 { " m y _ t e s t " : " h 0000010 e p 303 245 d e g " } \n 000001e (hlovdal) localhost:/tmp/my_test> (hlovdal) localhost:/tmp/my_test>perl my_test.pl is_utf8(): 0, is_utf8(1): 0. $my_test = hei på deg is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" } is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei p� deg (hlovdal) localhost:/tmp/my_test>perl my_test.pl --set-binmode is_utf8(): 0, is_utf8(1): 0. $my_test = hei pÃ¥ deg is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei pÃ¥ deg" } is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei på deg (hlovdal) localhost:/tmp/my_test>
what causing , how solve?
this on newly installed , date fedora 15 system.
(hlovdal) localhost:/tmp/my_test>perl --version | grep version perl 5, version 12, subversion 4 (v5.12.4) built x86_64-linux-thread-multi (hlovdal) localhost:/tmp/my_test>rpm -q perl-json perl-json-2.51-1.fc15.noarch (hlovdal) localhost:/tmp/my_test>locale lang=en_us.utf-8 lc_ctype="en_us.utf-8" lc_numeric="en_us.utf-8" lc_time="en_us.utf-8" lc_collate="en_us.utf-8" lc_monetary="en_us.utf-8" lc_messages="en_us.utf-8" lc_paper="en_us.utf-8" lc_name="en_us.utf-8" lc_address="en_us.utf-8" lc_telephone="en_us.utf-8" lc_measurement="en_us.utf-8" lc_identification="en_us.utf-8" lc_all= (hlovdal) localhost:/tmp/my_test>
update: adding use utf8
not solve it, characters still not processed right (although different before):
(hlovdal) localhost:/tmp/my_test>perl my_test.pl is_utf8(): 1, is_utf8(1): 1. $my_test = hei p� deg is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" } is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei p� deg (hlovdal) localhost:/tmp/my_test>perl my_test.pl --set-binmode is_utf8(): 1, is_utf8(1): 1. $my_test = hei på deg is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei pÃ¥ deg" } is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei på deg (hlovdal) localhost:/tmp/my_test>
as noted perlunifaq
can use unicode in perl sources?
yes, can! if sources utf-8 encoded, can indicate use utf8 pragma.
use utf8;
this doesn't input, or output. influences way sources read. can use unicode in string literals, in identifiers (but still have "word characters" according \w ), , in custom delimiters.
you saved program in utf-8, forget tell perl. add use utf8;
.
also, programming complicated. json functions dwym. inspect stuff, use devel::peek.
use utf8; # following line $my_test = 'hei på deg'; use devel::peek qw(dump); use file::slurp (read_file); use json qw(decode_json); $hash_ref = decode_json(read_file('my_test.json')); dump $hash_ref; # perl character strings dump $my_test; # perl character string
Comments
Post a Comment