SitemapRequestLoader
Hierarchy
- RequestLoader- SitemapRequestLoader
 
Index
Methods
__aenter__
- Enter the context manager. - Returns SitemapRequestLoader
__aexit__
- Exit the context manager. - Parameters- exc_type: type[BaseException] | None
- exc_value: BaseException | None
- exc_traceback: TracebackType | None
 - Returns None
__init__
- Initialize the sitemap request loader. - Parameters- sitemap_urls: list[str]- Configuration options for the loader. 
- http_client: HttpClient- the instance of - HttpClientto use for fetching sitemaps.
- optionalkeyword-onlyproxy_info: ProxyInfo | None = None- Optional proxy to use for fetching sitemaps. 
- optionalkeyword-onlyinclude: list[re.Pattern[Any] | Glob] | None = None- List of glob or regex patterns to include URLs. 
- optionalkeyword-onlyexclude: list[re.Pattern[Any] | Glob] | None = None- List of glob or regex patterns to exclude URLs. 
- optionalkeyword-onlymax_buffer_size: int = 200- Maximum number of URLs to buffer in memory. 
- optionalkeyword-onlypersist_state_key: str | None = None- A key for persisting the loader's state in the KeyValueStore. When provided, allows resuming from where it left off after interruption. If None, no state persistence occurs. 
 - Returns None
abort_loading
- Abort the sitemap loading process. - Returns None
close
- Close the request loader. - Returns None
fetch_next_request
- Fetch the next request to process. - Returns Request | None
get_handled_count
- Return the number of URLs that have been handled. - Returns int
get_total_count
- Return the total number of URLs found so far. - Returns int
is_empty
- Check if there are no more URLs to process. - Returns bool
is_finished
- Check if all URLs have been processed. - Returns bool
mark_request_as_handled
- Mark a request as successfully handled. - Parameters- request: Request
 - Returns ProcessedRequest | None
start
- Start the sitemap loading process. - Returns None
to_tandem
- Combine the loader with a request manager to support adding and reclaiming requests. - Parameters- optionalrequest_manager: RequestManager | None = None- Request manager to combine the loader with. If None is given, the default request queue is used. 
 - Returns RequestManagerTandem
A request loader that reads URLs from sitemap(s).
The loader is designed to handle sitemaps that follow the format described in the Sitemaps protocol (https://www.sitemaps.org/protocol.html). It supports both XML and plain text sitemap formats. Note that HTML pages containing links are not supported - those should be handled by regular crawlers and the
enqueue_linksfunctionality.The loader fetches and parses sitemaps in the background, allowing crawling to start before all URLs are loaded. It supports filtering URLs using glob and regex patterns.
The loader supports state persistence, allowing it to resume from where it left off after interruption when a
persist_state_keyis provided during initialization.