Crawl4AI: Build Async Web Crawlers with Extraction & JS

Environment Setup for Reliable Crawling

Install Crawl4AI v0.8.x in Colab with system deps (libnss3, libatk1.0-0, etc.), pip packages (crawl4ai, nest_asyncio, pydantic), and Playwright Chromium via playwright install chromium && install-deps. Apply nest_asyncio for async notebooks. Use AsyncWebCrawler() context manager for all runs.

Basic crawl: await crawler.arun(url="https://example.com") yields result.success, result.metadata['title'], result.markdown.raw_markdown. Config via BrowserConfig(headless=True, viewport_width=1920, user_agent=...) and CrawlerRunConfig(cache_mode=CacheMode.BYPASS, page_timeout=30000, wait_until="networkidle") handles dynamic sites like httpbin.org/html, ensuring full JS-rendered content.

Markdown Cleaning and Query Filtering

Generate clean markdown with DefaultMarkdownGenerator(content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed", min_word_threshold=20)). On Wikipedia's Web_scraping page, raw markdown shrinks ~50-70% to fit_markdown by removing noise.

For relevance, apply BM25ContentFilter(user_query="legal aspects privacy data protection", bm25_threshold=1.2)—filters Wikipedia to query-matched sections only, e.g., 800+ chars of privacy-focused content. Use css_selector="article, main", excluded_tags=["nav", "footer"] and remove_overlay_elements=True to target main content, yielding concise markdown (e.g., 500 chars preview without nav junk).

Structured Extraction: CSS, LLM, and JS Handling

CSS extraction via JsonCssExtractionStrategy(schema): Define baseSelector (e.g., "div.mw-parser-output h2") and fields like {"name": "heading_text", "selector": "span.mw-headline", "type": "text"} or attributes. Extracts 10+ Wikipedia Python headings or Hacker News top stories (rank, title, url, site) as JSON list—fast, no LLM needed.

LLM extraction: Pydantic schema class Article(BaseModel): title: str; summary: str; topics: List[str]. Use LLMExtractionStrategy(llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token=...), schema=Article.model_json_schema(), instruction="Extract article titles and summaries.") on HN for structured JSON.

JS execution: Inject js_code=["window.scrollTo(0, document.body.scrollHeight); await new Promise(r => setTimeout(r, 1000));"] with wait_for="css:body", delay_before_return_html=1.0 to load dynamic content.

Scaling: Deep Crawls, Concurrency, Sessions, and Outputs

Deep crawl with BFSDeepCrawlStrategy(max_depth=2, max_pages=5, filter_chain=FilterChain([DomainFilter(allowed_domains=["docs.crawl4ai.com"]), URLPatternFilter(patterns=["*quickstart*"])]))—crawls 5 targeted docs.crawl4ai.com pages.

Concurrent: await crawler.arun_many(urls=["httpbin.org/html", ...]) processes 5 URLs in parallel, reporting success/content lengths.

Sessions: Share session_id="my_session" across arun() calls to persist cookies (e.g., set/read via httpbin.org/cookies).

Extras: screenshot=True captures base64 PNG; media['images'] lists img srcs; result.links['internal/external'] analyzes site structure (e.g., 20+ internals from docs.crawl4ai.com).

Real-world: Combine CSS schema for HN stories + pruning for 15 clean stories JSON, saved via json.dump(stories, 'hacker_news_stories.json'). Trade-offs: Bypassing cache speeds dev but risks duplicates; headless=True hides browser but misses visual debug.