【Python基础】11、文本处理与IO深入理解

1、有一个文件，单词之间使用空格、分号、逗号、或者句号分隔，请提取全部单词。

解决方案：

使用\w匹配并提取单词，但是存在误判

使用str.split分隔字符字符串，但是需要多次分隔

使用re.split分隔字符串

In [4]: help(re.split)Help on function split in module re:split(pattern, string, maxsplit=0, flags=0)Split the source string by the occurrences of the pattern,returning a list containing the resulting substrings.

In [23]: text = "i'm xj, i love Python,,Linux;   i don't like windows."In [24]: fs = re.split(r"(,|\.|;|\s)+\s*", text)In [25]: fs
Out[25]: 
["i'm",' ','xj',' ','i',' ','love',' ','Python',',','Linux',' ','i',' ',"don't",' ','like',' ','windows','.','']In [26]: fs[::2]            #提取出单词
Out[26]: 
["i'm",'xj','i','love','Python','Linux','i',"don't",'like','windows','']In [27]: fs[1::2]      #提取出符号
Out[27]: [' ', ' ', ' ', ' ', ',', ' ', ' ', ' ', ' ', '.']In [53]: fs = re.findall(r"[^,\.;\s]+", text)In [54]: fs
Out[54]: ["i'm", 'xj', 'i', 'love', 'Python', 'Linux', 'i', "don't", 'like', 'windows']In [55]: fh = re.findall(r'[,\.;\s]', text)In [56]: fh
Out[56]: [' ', ',', ' ', ' ', ' ', ',', ',', ';', ' ', ' ', ' ', ' ', '.']

2、有一个目录，保存了若干文件，找出其中所有的C源文件（.c和.h）

解决方案：

使用listdir

使用str.endswith判断

In [13]: s = "xj.c"In [14]: s.endswith(".c")
Out[14]: TrueIn [15]: s.endswith(".h")
Out[15]: FalseIn [16]: import osIn [17]: os.listdir("/usr/include/")
Out[17]: 
['libmng.h','netipx','ft2build.h','FlexLexer.h','selinux','QtSql','resolv.h','gio-unix-2.0','wctype.h','python2.6','scsi',...'QtOpenGL','mysql','byteswap.h',
, 'xj.c''mntent.h','semaphore.h','stdio_ext.h','libxml2']In [21]: for filename in os.listdir("/usr/include"):if filename.endswith(".c"):print filename....:         
xj.cIn [22]: for filename in os.listdir("/usr/include"):if filename.endswith((".c", ".h")):          #这里元祖是或的关系print filename....:         
libmng.h
ft2build.h
FlexLexer.h
nss.h
png.h
utime.h
ieee754.h
features.h
xj.c
.
.
.
verto-module.h
semaphore.h
stdio_ext.hIn [23]:

3、fnmath模块

支持和shell一样的通配符

In [24]: help(fnmatch)           #是否区分大小写与操作系统一致Help on function fnmatch in module fnmatch:fnmatch(name, pat)Test whether FILENAME matches PATTERN.Patterns are Unix shell style:*       matches everything?       matches any single character[seq]   matches any character in seq[!seq]  matches any char not in seqAn initial period in FILENAME is not special.Both FILENAME and PATTERN are first case-normalizedif the operating system requires it.If you don't want this, use fnmatchcase(FILENAME, PATTERN).
~
(END) In [47]: fnmatch.fnmatch("sba.txt", "*txt")
Out[47]: TrueIn [48]: fnmatch.fnmatch("sba.txt", "*t")
Out[48]: TrueIn [49]: fnmatch.fnmatch("sba.txt", "*b")
Out[49]: FalseIn [50]: fnmatch.fnmatch("sba.txt", "*b*")
Out[50]: True

案例：你有一个程序处理文件，文件名由用户输入，你需要支持和shell一样的通配符。

[root@Node3 src]# cat test1.py 
#!/usr/local/bin/python2.7
#coding: utf-8import os
import sys
from fnmatch import fnmatchret = [name for name in os.listdir(sys.argv[1]) if fnmatch(name, sys.argv[2])]
print ret
[root@Node3 src]# python2.7 test1.py /usr/include/ *.c
['xj.c']

4、re.sub() 文本替换

In [53]: help(re.sub)Help on function sub in module re:sub(pattern, repl, string, count=0, flags=0)Return the string obtained by replacing the leftmostnon-overlapping occurrences of the pattern in string by thereplacement repl.  repl can be either a string or a callable;if a string, backslash escapes in it are processed.  If it isa callable, it's passed the match object and must returna replacement string to be used.

案例：有一个文本，文本里的日期使用的是%m/%d/%Y的格式，你需要把它全部转化成%Y-%m-%d的格式。

In [55]: text = "Today is 11/08/2016, next class time 11/15/2016"In [56]: new_text = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\2-\1', text )In [57]: new_text
Out[57]: 'Today is 2016-08-11, next class time 2016-15-11'

5、str.format 字符串格式化

In [71]: help(str.format)Help on method_descriptor:format(...)S.format(*args, **kwargs) -> stringReturn a formatted version of S, using substitutions from args and kwargs.The substitutions are identified by braces ('{' and '}').
(END)

案例：你需要创建一个小型的模版引擎，不需要逻辑控制，但是需要使用变量来填充模版

In [18]: text = "{name} has {n} messages"In [19]: new_text = text.format(name = "xj", n = 17)In [20]: text
Out[20]: '{name} has {n} messages'In [22]: new_text
Out[22]: 'xj has 17 messages'In [29]: text = "%s has %s messages"In [30]: print text % ("xj", 17)
xj has 17 messages

6、StringIO 伪文件对象将字符串模拟成文件对象

案例：有一个方法worker，它被设计用来处理文件对象，你有一些字符串，希望使用worker来处理。

解决方案：

把字符串写入文件，再使用work处理 #涉及io，低效

使用SrtingIO模块处理 #将字符串模拟成文件对象伪文件对象

In [3]: import jsonIn [4]: from SrtingIO import StringIOIn [6]: StringIO.               #有文件的所有属性和方法
StringIO.close       StringIO.read        StringIO.truncate
StringIO.flush       StringIO.readline    StringIO.write
StringIO.getvalue    StringIO.readlines   StringIO.writelines
StringIO.isatty      StringIO.seek        
StringIO.next        StringIO.tell        In [6]: type(StringIO)
Out[6]: classobjIn [7]: data = {'a':1, 'b':[2, 3, 4]}In [8]: io = StringIO()In [9]: json.d
json.decoder  json.dump     json.dumps    In [9]: io.
io.buf         io.getvalue    io.read        io.tell
io.buflist     io.isatty      io.readline    io.truncate
io.close       io.len         io.readlines   io.write
io.closed      io.next        io.seek        io.writelines
io.flush       io.pos         io.softspace   In [9]: type(io)
Out[9]: instanceIn [10]: json.dump(data, io)In [11]: print io.getvalue()
{"a": 1, "b": [2, 3, 4]}

转载于:https://blog.51cto.com/xiexiaojun/1870832