[closed] urllib2.urlopen + unicode

pilot

Никак не получается открыть (urllib2.urlopen) такой вот урл:

http://he.wikibooks.org/wiki/%D7%94%D7%92%D7%93%D7%A8%D7%AA_ClamWin
# urllib2.unquote(_)
'http://he.wikibooks.org/wiki/\7\x94\7\x92\7\x93\7\xa8\7\xaa_ClamWin'
# unicode(_)
u'http://he.wikibooks.org/wiki/\u05d4\u05d2\u05d3\u05e8\u05ea_ClamWin'
-- квотить не получается, keyerror.

Это нормально или я чего-то не понимаю?
python2.5
// совсем ничего :(

pilot

Какое именно представление?
Они друг из друга такими способами и получены.

>>> st
'http://he.wikibooks.org/wiki/%D7%94%D7%92%D7%93%D7%A8%D7%AA_ClamWin'
>>> urllib2.unquote(st)
'http://he.wikibooks.org/wiki/\7\x94\7\x92\7\x93\7\xa8\7\xaa_ClamWin'
>>> _.encode('utf-8')
'http://he.wikibooks.org/wiki/\7\x94\7\x92\7\x93\7\xa8\7\xaa_ClamWin'
>>> unicode(_)
u'http://he.wikibooks.org/wiki/\u05d4\u05d2\u05d3\u05e8\u05ea_ClamWin'
>>> _.encode('utf-8')
'http://he.wikibooks.org/wiki/\7\x94\7\x92\7\x93\7\xa8\7\xaa_ClamWin'
>>> urllib2.quote(_)
'http%3A//he.wikibooks.org/wiki/%D7%94%D7%92%D7%93%D7%A8%D7%AA_ClamWin'

>>> un
u'http://he.wikibooks.org/wiki/\u05d4\u05d2\u05d3\u05e8\u05ea_ClamWin'
>>> urllib2.urlopen(un)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 121, in urlopen
return _opener.open(url, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 380, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 491, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 418, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 353, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 499, in http_error_default
raise HTTPError(req.get_full_url code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

>>> st
'http://he.wikibooks.org/wiki/%D7%94%D7%92%D7%93%D7%A8%D7%AA_ClamWin'
>>> urllib2.urlopen(st)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 121, in urlopen
return _opener.open(url, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 380, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 491, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 418, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 353, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 499, in http_error_default
raise HTTPError(req.get_full_url code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

>>> urllib2.unquote(st)
'http://he.wikibooks.org/wiki/\7\x94\7\x92\7\x93\7\xa8\7\xaa_ClamWin'
>>> urllib2.urlopen(_)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 121, in urlopen
return _opener.open(url, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 380, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 491, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 418, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 353, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib2.py", line 499, in http_error_default
raise HTTPError(req.get_full_url code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>>

>>> un
u'http://he.wikibooks.org/wiki/\u05d4\u05d2\u05d3\u05e8\u05ea_ClamWin'
>>> urllib2.quote(un)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/urllib.py", line 1205, in quote
res = map(safe_map.__getitem__, s)
KeyError: u'\u05d4'


nikita270601

Вот такой запрос делает Python:
GET /wiki/%D7%94%D7%92%D7%93%D7%A8%D7%AA_ClamWin HTTP/1.1
Accept-Encoding: identity
Host: he.wikibooks.org
Connection: close
User-Agent: Python-urllib/2.5

А такой — Safari:
GET /wiki/%D7%94%D7%92%D7%93%D7%A8%D7%AA_ClamWin HTTP/1.1
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 (KHTML, like Gecko) Version/3.0.4 Safari/523.10.6
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: he.wikibooks.org

Видимо, парни злые, не хотят, чтобы с них странички скриптами качали. Поиграйся с header'ами. Я сделал:
>>> st = 'http://he.wikibooks.org/wiki/%D7%94%D7%92%D7%93%D7%A8%D7%AA_ClamWin'
>>> x = urllib2.Request(st,
None,
{"User-Agent": "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-us) AppleWebKit/523.10.6 (KHTML, like Gecko) Version/3.0.4 Safari/523.10.6",
"Accept": "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
"Accept-Encoding" : "gzip, deflate"})
>>> urllib2.urlopen(x)
<addinfourl at 4366824 whose fp = <socket._fileobject object at 0x428b30>>

SPARTAK3959

Возможно википедия не поддерживает отсутствие сжатия.

nikita270601

Перепроверил.
Если убрать User-Agent, говорит 403. Если все убрать, а User-Agent оставить, все ок.

pilot

ЗдОрово, спасибо!
Я почему-то считал что там неправильно перекодированный урл передается. Не проверив.
Оставить комментарий
Имя или ник:
Комментарий: