Published on

使用praw获取reddit内容

Authors
  • avatar
    Name
    Shelton Ma
    Twitter

reddit本身提供官方接口访问Reddit Data API Wiki, 并且从2023-07开始对请求做了限制

这里借助PRAW请求

reddit app

注册地址: https://www.reddit.com/prefs/apps

限制: 600次/10min, 通过接口headers的x-ratelimit-used/x-ratelimit-reset/x-ratelimit-remaining限制访问速度, 注册多个app可以提高速度

PRAW

关于429报错

当时临界点处理不太严谨, 还是可能触发429, 所以需要修改下源码

# /lib/python3.9/site-packages/prawcore/rate_limit.py
# 当剩余次数小于3次, 就等待重置, 避免触发429请求
def update(self, response_headers):
    ...

    self.reset_timestamp = now + seconds_to_reset

    # 增加以下内容
    if self.remaining <= 3:
        self.next_request_timestamp = self.reset_timestamp
        return

关于并发

因为reddit通过app限制并发, 可以创建多个app, 多进程的形式进行, praw不是线程安全

关于评论数

reddit会对一些评论做隐藏/折叠操作, 所以实际获取到评论数可能会少于展示的评论数

Demo

# 创建实例
import praw

reddit = praw.Reddit(
    client_id="xxx",
    client_secret="xxx",
    password="",
    user_agent="xxx (by u/USERNAME)",
    username="",
)

# 获取popular最新的100个帖子 hot/top
for submission in reddit.subreddit("popular").new(limit=100):
    print(submission.title)

# 返回新讨论, 最开始返回100
for submission in reddit.subreddit("popular").stream.submissions():
    print(submission)

# 持续取回新评论
# https://praw.readthedocs.io/en/stable/code_overview/other/subredditstream.html#praw.models.reddit.subreddit.SubredditStream
for comment in reddit.subreddit("popular").stream.comments():
    print(comment)

# 批量获取评论
submission = reddit.submission("15ztk07")
# submission.comments 包含MoreComments, 使用replace_more可以替换MoreComments为comments
submission.comments.replace_more(limit=None)
comments = submission.comments.list()

# 评论数验证, 实际获取到评论数可能会少于展示的评论数
submission.num_comments == len(comments)