reddit-crawler
npm install reddit-crawler
Iterate over all submissions in a subreddit.
- Uses Reddit's CloudSearch API.
- Auto-renews OAuth access token as it crawls.
Usage
Until Node gets native async iterators, the crawler approximates one with its
method next(): Promise<array | null>
.
If falsey, then the crawler is done crawling the subreddit.
The array of submissions may be empty. The crawler will expand its search interval until it finds results, attempting to hover around 50-99 results per request.
const makeCrawler = const creds = username: 'foo' password: 'secret' appId: 'xxx' appSecret: 'yyy' { const crawler = while true const submissions = await crawlernext if !submissions console break for const sub of submissions await } { console}
Credentials are for a Reddit app and a user that owns it. By giving the crawler your creds, it can renew its access-token as it crawls.
The access-token expires in one hour, but large subreddits take a while to crawl (respecting Reddit's rate-limit).
Reddit's API requires a user-agent: https://github.com/reddit/reddit/wiki/API.
<platform>:<app ID>:<version string> (by /u/<reddit username>)
Options
const Duration = const crawler =
initInterval
: the crawler starts off requesting submissions created within this span of time.minInterval
/maxInterval
: the crawler shrinks/grows its interval to hover around 50-99 results per request, never exceeding min nor max.maxInterval
also tells the crawler when to give up: if the crawler has grown to its max interval yet it still does not find any results, then it assumes that there are no more submissions.initMax
: the crawler starts at initMax date and crawls backwards into the past. useful when you want to resume progress without re-crawling the top N submissions.
Notes
- Reddit asks that you hit its API no more than once per second. The crawler has a rudimentary, built-in sleep after each cloud-search request.
- Set
DEBUG=reddit-crawler
to see debug logging.