徹底搞懂py2.x/py3.x中編碼和解碼問(wèn)題

一結(jié)論

1. 字符在內(nèi)存中都是以u(píng)nicode類(lèi)型存在。

專(zhuān)注于為中小企業(yè)提供成都網(wǎng)站設(shè)計(jì)、成都網(wǎng)站建設(shè)服務(wù),電腦端+手機(jī)端+微信端的三站合一,更高效的管理,為中小企業(yè)麒麟免費(fèi)做網(wǎng)站提供優(yōu)質(zhì)的服務(wù)。我們立足成都，凝聚了一批互聯(lián)網(wǎng)行業(yè)人才，有力地推動(dòng)了上千企業(yè)的穩(wěn)健成長(zhǎng)，幫助中小企業(yè)通過(guò)網(wǎng)站建設(shè)實(shí)現(xiàn)規(guī)模擴(kuò)充和轉(zhuǎn)變。

2. 當(dāng)數(shù)據(jù)要保存到磁盤(pán)或者網(wǎng)絡(luò)傳輸時(shí)，會(huì)轉(zhuǎn)為utf-8等編碼再保存或傳輸。

3. 在python文件第一行指定的編碼方式用于向python解釋器指出解碼方式。

4. Python中字符的存儲(chǔ)類(lèi)型有bytes和unicode,在py2.x中被稱(chēng)為str和unicode,在py3.x中被稱(chēng)為bytes和str.

5. Py2.x和py3.x中字符的類(lèi)型都為str,因此在py2.x中是bytes類(lèi)型，在py3.x中是unicode類(lèi)型。

6. py2.x中文件默認(rèn)解碼方式（即python文件第一行不指定時(shí)）為ASCII, py3.x中為UTF-8.使用sys.getdefaultencoding()獲得。

7. 編碼、解碼過(guò)程。

徹底搞懂py2.x/py3.x中編碼和解碼問(wèn)題

8. IDE在程序運(yùn)行前先按文檔屬性設(shè)定的編碼方式把數(shù)據(jù)保存到磁盤(pán)，然后python解釋器按文件第一行的解碼方式把磁盤(pán)中存儲(chǔ)的二進(jìn)制序列讀取并解碼為unicode加載到內(nèi)存中。

徹底搞懂py2.x/py3.x中編碼和解碼問(wèn)題

二概念

編碼：把人類(lèi)發(fā)明的文字及符號(hào)轉(zhuǎn)化為計(jì)算機(jī)能夠識(shí)別的二進(jìn)制序列的過(guò)程。

解碼：把計(jì)算機(jī)內(nèi)部存儲(chǔ)的二進(jìn)制序列轉(zhuǎn)化為人類(lèi)能夠認(rèn)識(shí)的文字及符號(hào)的過(guò)程。

ASCII:占一個(gè)字節(jié)，保留最高位，其余7位組成了128個(gè)字符的字符集。

unicode: unicode編碼了世界上所有的文字。

utf-8:對(duì)unicode進(jìn)行了壓縮和優(yōu)化。ASCII碼中的內(nèi)容用1個(gè)字節(jié)保存、歐洲的字符用2個(gè)字節(jié)保存，東亞的字符用3個(gè)字節(jié)保存。

GBK:漢字的國(guó)標(biāo)碼，用2個(gè)字節(jié)保存。

三 py2.x

1. 驗(yàn)證Py2.x中的字符類(lèi)型。

Py2.x:
#coding:utf-8
s = '中國(guó)hello'
print s
print type(s)
print len(s)
print repr(s)
執(zhí)行結(jié)果：
中國(guó)hello
<type 'str'>
11
'\xe4\xb8\xad\xe5\x9b\xbdhello'

可見(jiàn)是str類(lèi)型，即bytes類(lèi)型。len()是占用的字節(jié)數(shù)。

Py2.x:
#coding:utf-8
s = u'中國(guó)hello'
print s
print type(s)
print len(s)
print repr(s)
執(zhí)行結(jié)果：
中國(guó)hello
<type 'unicode'>
7
u'\u4e2d\u56fdhello'

指定了使用unicode類(lèi)型。 u4e2d是unicode字符集中字符“中”的代碼。len()是字符的個(gè)數(shù)。

2. bytes和unicode的轉(zhuǎn)換。

#coding:utf-8
s = '中國(guó)'
print type(s)
print len(s)
 
s2 = s.decode('utf-8')
type(s2)
print len(s2)
執(zhí)行結(jié)果：
<type 'str'>
6
<type 'unicode'>
2

3. 不同編碼類(lèi)型的字符串拼接。

Py2.x:
#coding:utf-8
print 'cisco'+u'google'
執(zhí)行結(jié)果：
ciscogoogle
之所以英文字符可以把兩種類(lèi)型的進(jìn)行拼接，是因?yàn)樵趐ython2.x中，只要數(shù)據(jù)全部是 ASCII 的話，python解釋器自動(dòng)把 byte 轉(zhuǎn)換為 unicode 。但是一旦一個(gè)非 ASCII 字符偷偷進(jìn)入你的程序，那么默認(rèn)的解碼將會(huì)失效，從而造成 UnicodeDecodeError 的錯(cuò)誤。python2.x編碼讓程序在處理 ASCII 的時(shí)候更加簡(jiǎn)單。你付出的代價(jià)就是在處理非 ASCII 的時(shí)候?qū)?huì)失敗。
Py2.x:
#coding:utf-8
s = 'hello'+'china'
print s
print type(s)
print repr(s)
執(zhí)行結(jié)果：
hellochina
<type 'str'>
'hellochina'
查看不同編碼類(lèi)型拼接后的存儲(chǔ)類(lèi)型：
Py2.x:
#coding:utf-8
s = 'hello'+u'china'
print s
print type(s)
print repr(s)
執(zhí)行結(jié)果：
hellochina
<type 'unicode'>
u'hellochina'
可見(jiàn)Py2.x進(jìn)行了自動(dòng)轉(zhuǎn)換。
Py2.x:
#coding:utf-8
print '中國(guó)'+'美國(guó)'
執(zhí)行結(jié)果：
中國(guó)美國(guó)
Py2.x:
#coding:utf-8
print '中國(guó)'+u'美國(guó)'
執(zhí)行結(jié)果：
Traceback (most recent call last):
  File "E:\python\study\test\index.py", line 14, in <module>
    print '中國(guó)'+u'美國(guó)'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

Py3.x中即使都是ASCII范圍內(nèi)，也不能進(jìn)行拼接：

#coding:utf-8
s1 = 'cisco'
print(type(s1))
s2 = b'google'
print(type(s2))
print(s1+s2)
執(zhí)行結(jié)果：
<class 'str'>
<class 'bytes'>
Traceback (most recent call last):
  File "E:\python\study\test\index.py", line 6, in <module>
    print(s1+s2)
TypeError: can only concatenate str (not "bytes") to str

四 py3.x

1. 驗(yàn)證Py3.x中的字符類(lèi)型。

#coding:utf-8
import json
s1 = '中國(guó)'
print(type(s1))
print(len(s1))
print(json.dumps(s1))
print(s1)
 
s2 = s1.encode('utf-8')
print(type(s2))
print(len(s2))
print(s2)
執(zhí)行結(jié)果：
<class 'str'>
2
"\u4e2d\u56fd"
中國(guó)
<class 'bytes'>
6
b'\xe4\xb8\xad\xe5\x9b\xbd'

#coding:utf-8
s = u'中'
print(s)
print(type(s))
print(len(s))
print(repr(s))
print(ord(s))
print(bin(ord(s)))
 
中
<class 'str'>
1
'中'
20013
0b100111000101101

2. bytes和unicode的轉(zhuǎn)換。除了encode和decode的轉(zhuǎn)換方法，還可以：

#coding:utf-8
s1 = '中國(guó)'
print(type(s1))
 
s2 = bytes(s1,encoding='utf-8')
print(type(s2))
 
s3 = str(s2,encoding='utf-8')
print(type(s3))
執(zhí)行結(jié)果：
<class 'str'>
<class 'bytes'>
<class 'str'>

五常見(jiàn)錯(cuò)誤

1. py2.x中默認(rèn)的解碼方式是ASCII, py3.x中默認(rèn)的是utf-8,當(dāng)在py2.x中指定解碼方式為utf-8時(shí)，py2.x和py3.x應(yīng)該是沒(méi)有區(qū)別，可為何在py2.x中默認(rèn)的字符類(lèi)型是bytes,而在py3.x中確是unicode,都是utf-8，不應(yīng)該都是bytes嗎？或者既然都加載在內(nèi)存了，不該都是unicode嗎？

答：python解釋器從磁盤(pán)讀取文件，以u(píng)nicode編碼方式把整個(gè)代碼加載到內(nèi)存中，然后逐條執(zhí)行，當(dāng)識(shí)別到字符串時(shí)，py2.x默認(rèn)的str類(lèi)型是bytes,而py3.x默認(rèn)的str類(lèi)型是unicode,但當(dāng)有明確指定字符類(lèi)型時(shí)，按指定的編碼，如py2.x中的u'china', py3.x中的b'google'.

2. Py2.x中不指定utf-8編碼方式時(shí)，print漢字會(huì)報(bào)錯(cuò)：

Py2.x:
s = '中國(guó)'
print s
結(jié)果：
File "E:\python\study\test\index.py", line 3
SyntaxError: Non-ASCII character '\xe4' in file E:\python\study\test\index.py on line 3, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

這是因?yàn)閜y2.x中默認(rèn)的解碼方式為ASCII,而文件保存的編碼方式為UTF-8,兩者不匹配，因此報(bào)錯(cuò)。

3. Py2.x中字符串s本來(lái)應(yīng)該是字節(jié)類(lèi)型，但為何print時(shí)卻顯示為明文了呢？

Py2.x中：
>>> s = '中國(guó)'
>>> s
'\xd6\xd0\xb9\xfa'
>>> print s
中國(guó)
>>>print '\xd6\xd0\xb9\xfa'
中國(guó)

這是因?yàn)閜rint在執(zhí)行時(shí)調(diào)用了str函數(shù)，str函數(shù)執(zhí)行了bytes到unicode的操作。但py3.x中不存在這種現(xiàn)象：

#coding:utf-8
import json
s1 = '中國(guó)'
print(type(s1))
print(len(s1))
print(json.dumps(s1))
print(s1)
 
s2 = s1.encode('utf-8')
print(type(s2))
print(len(s2))
print(s2)
執(zhí)行結(jié)果：
<class 'str'>
2
"\u4e2d\u56fd"
中國(guó)
<class 'bytes'>
6
b'\xe4\xb8\xad\xe5\x9b\xbd'

控制臺(tái)中：
>>> s = '中國(guó)'
>>> print(type(s))
<class 'str'>
>>> s
'中國(guó)'
>>> print(s)
中國(guó)
>>> s1 = bytes(s,encoding='gbk')
>>> s1
b'\xd6\xd0\xb9\xfa'
>>> print(s1)
b'\xd6\xd0\xb9\xfa'
>>>

4. 以u(píng)tf-8保存文件，在windows中執(zhí)行，輸出不同：

#coding:utf-8
s = '中國(guó)'
print(s)
 
D:\Python37-32>python d:\index.py
中國(guó)
D:\Python27>python d:\index.py
涓浗

因?yàn)閜y3.x中字符串被識(shí)別為unicode,傳給cmd.exe時(shí)被編碼為GBK，再以GBK解碼輸出。但py2.x中字符串被識(shí)別為bytes, utf-8編碼的兩個(gè)漢字有6個(gè)字節(jié)，傳給cmd.exe時(shí)按GBK解碼，識(shí)別成為了3個(gè)亂碼，再以GBK解碼輸出。

5. Py3.x使用open的r方法打開(kāi)utf-8編碼的文件時(shí)會(huì)報(bào)錯(cuò)：

#coding:utf-8
 
f = open('index.py','r')
print(f.read())
執(zhí)行結(jié)果：
Traceback (most recent call last):
  File "E:\python\study\test\test.py", line 5, in <module>
    print(f.read())
UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 2: illegal multibyte sequence
使用rb時(shí)輸出：
b'\xe4\xb8\xad\xe5\x9b\xbd'

py2.x中不會(huì)報(bào)錯(cuò)。open()方法打開(kāi)文件時(shí)，read()讀取的是str(py2.x中即是bytes)，讀取后需要使用正確的編碼格式進(jìn)行decode().

Py2.x:
f = open('index.py','r')
s = f.read()
print type(s)
print len(s)
print s
執(zhí)行結(jié)果：
<type 'str'>
6
中國(guó)
可見(jiàn)，此處使用utf-8進(jìn)行解碼,如果指定為GBK呢?
#coding:gbk
f = open('index.py','r')
s = f.read()
print s
輸出是涓浗，改為ASCII后同樣報(bào)錯(cuò)。

可查看py3.x使用的是GBK進(jìn)行解碼：

Py3.x:
>>> f = open('index.py','r')
>>> f
<_io.TextIOWrapper name='index.py' mode='r' encoding='cp936'>
但py2.x沒(méi)有顯示編碼方式：
>>> f = open(r'e:\python\study\test\index.py','r')
>>> s = f.read()
>>> f
<open file 'e:\\python\\study\\test\\index.py', mode 'r' at 0x016BD1D8>

可使用open('index.py','r',encoding='utf-8')指定編碼方式。

六參考資料

https://www.cnblogs.com/OldJack/p/6658779.html

http://www.cnblogs.com/yuanchenqi/articles/5938733.html

https://www.cnblogs.com/shine-lee/p/4504559.html

https://blog.csdn.net/nyyjs/article/details/56667626

https://blog.csdn.net/nyyjs/article/details/56670080

https://www.jianshu.com/p/19c74e76ee0a

https://www.jb51.net/article/59599.htm

網(wǎng)站欄目：徹底搞懂py2.x/py3.x中編碼和解碼問(wèn)題
轉(zhuǎn)載注明：http://www.rwnh.cn/article14/gopdge.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供網(wǎng)站設(shè)計(jì)公司、品牌網(wǎng)站設(shè)計(jì)、營(yíng)銷(xiāo)型網(wǎng)站建設(shè)、搜索引擎優(yōu)化、Google、域名注冊(cè)

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請(qǐng)盡快告知，我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如需處理請(qǐng)聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時(shí)需注明來(lái)源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容

中文字幕日韩精品一区二区免费_精品一区二区三区国产精品无卡在_国精品无码专区一区二区三区_国产αv三级中文在线

徹底搞懂py2.x/py3.x中編碼和解碼問(wèn)題

一 結(jié)論

二 概念

三 py2.x

四 py3.x

五 常見(jiàn)錯(cuò)誤

六 參考資料

一結(jié)論

二概念

五常見(jiàn)錯(cuò)誤

六參考資料