python模块urllib和requests

HTTP请求方法 GET HEAD POST PUT DELETE CONNECT OPTIONS TRACE

urllib库

安装:python默认的,自带的,不需要自己安装。

基于urllib库的GET方式

直接访问,不加请求头信息


import urllib.request
def get_page():
    url = 'http://www.baidu.com'
    result = urllib.request.urlopen(url=url).read().decode("utf-8")
    print(result)
if __name__ == '__main__':
    get_page()

设置请求头信息

更改User-Agent,模拟用户。 1. urllib.request.Request方法(此种方法,拓展性强) 由于urllib.request.urlopen() 函数不接受headers参数,所以需要构建一个urllib.request.Request对象来实现请求头


    import urllib.request
    def get_page():
        url = 'http:///www.baidu.com'
        headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
        }
        request = urllib.request.Request(url=url, headers=headers)
        result = urllib.request.urlopen(request).read().decode('utf-8')
        print(result)
    if __name__ == '__main__':
        get_page()
  1. 第二种方式,启用opener安装方法https://docs.python.org/zh-cn/3/library/urllib.request.html#urllib.request.build_opener opener是 urllib.request.OpenerDirector 的实例,我们之前一直都在使用的urlopen,它是一个特殊的opener(也就是模块帮我们构建好的)。 但是基本的urlopen()方法不支持代理、Cookie等其他的 HTTP/HTTPS高级功能,所以自己构建opener

    import urllib.request
    def get_page():
        url = "http://www.baidu.com"
        headers = (
            "User-Agent","Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
        ) #这里只能是数组,参考官方源码:self.addheaders = [('User-agent', client_version)]
        opener = urllib.request.build_opener()
        opener.addheaders = [headers]
        result = opener.open(url).read().decode('utf-8')
        print(result)
    if __name__ == '__main__':
        get_page()

  1. 定义一个opener来代替urlopen(),但是仍然用urlopen打开URL(全局)

    import urllib.request
    def get_page():
        url = 'http://www.baidu.com'
        headers ={
        'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
        }
        opener = urllib.request.build_opener()
        opener.addheaders = [headers]
        urllib.request.install_opener(opener)
        result = urllib.request.urlopen(url=url).read().decode('utf-8')
        print(result)

    if __name__ == '__main__':
        get_page()

设置代理

  1. 既要加代理,还有变更User-agent 这里使用ProxyHandler方式 类urllib.request.ProxyHandler(proxies = None ) 使请求通过代理。如果给出了代理,则它必须是将协议名称映射到代理URL的字典。默认设置是从环境变量中读取代理列表 _proxy。用opener代替urlopen

    import urllib.request
    def get_page(url):
        headers={
            'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
        }
        request = urllib.request.Request(url=url, haeders=headers)
        proxy ={
            'http':'*****:53512' #etc
        }
        pxoy_handler = urllib.request.ProxyHandler(proxy)
        opener =urllib.request.build_opener(pxoy_handler)
        urllib.request.install_opener(opener)

        result = urllib.request.urlopen(request).read().decode('utf-8')
        print(result)
    if __name__ == '__main__':
        get_page("http://www.baidu.com")

基于urllib库的POST请求

post一般用于登陆多点,或者动态的加载文件啥的,参考上一篇:==>

requests库

安装:pip install requests

基于requests的GET请求

直接访问,不加请求头


import requests
def get_page():
    url = 'http://www.baidu.com'
    response = requests.get(url)
    print(response.text)  
if __name__ == '__main__':
    get_page()

response.text是 requests是经response.content解码的字符串,requests会根据自己的猜测来进行解码,有时候会猜测错误,导致乱码。所以改用:response.content.decode('utf-8')

输出结果:


>>> res = response.content.decode('utf-8')
>>> print(res)
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rs 

基于requests的POST请求

GET方法中传递参数的三种方式

  1. 将字典形式的参数用urllib.parse.urlencode()编码成url参数

   import urllib.parse
   if __name__ == '__main__':
    base_url = 'https://www.baidu.com/s?'
    params ={
        'ie':'utf-8', # 'key1':'values1'
        'wd':'hello' # 'key2':'values2'
    }
    url = base_url + urllib.parse.urlencode(params)
    print(url)

结果:https://www.baidu.com/s?ie=utf-8&wd=hello,就是百度搜索的链接呀!

  1. 基于requests.get中使用params参数 参考链接

   import requests
   if __name__ == '__main__':
    payload = {
        'ie':'utf-8',
        'wd':'hello'
    }
    base_url = 'https://www.baidu.com/s?'
    response = requests.get(url=base_url, params=payload)
    print(response.url)

结果:出了点问题,百度提示有安全问题,必须加载验证码才能访问

 https://wappass.baidu.com/static/captcha/tuxing.html?&ak=c27bbc89afca0463650ac9bde68ebe06&backurl=https%3A%2F%2Fwww.baidu.com%2Fs%3Fie%3Dutf-8%26wd%3Dhello&logid=10813104645036683003&signature=1ed04493a7a14debe47c945fe7729723×tamp=1589427108

直接手写url:https://www.baidu.com/s?key1=value1&key2=value2