Resources

Resources for building automated data workflows

Guides, tutorials, and technical documentation for using Citrusiq to build web data extraction pipelines, automate workflows, and power AI systems.

citrusiq — terminal
$citrusiq extract https://example.com/products \
$ --schema product.json --output ./dataset
✓ JavaScript rendering enabled (Chrome headless)
✓ 1,240 records extracted across 52 pages
✓ Schema validation passed — 0 rejected
✓ Dataset exported → ./dataset/products.jsonl
Throughput: 847 records/min · Elapsed: 1m 28s
$
Browse by type

Resource categories

Start with a category or dive directly into a specific guide, tutorial, or reference document.

Guides

12

Step-by-step walkthroughs for building and deploying data pipelines using Citrusiq.

Browse Guides

Tutorials

9

Hands-on tutorials covering extraction, processing, schema design, and delivery.

Browse Tutorials

Technical Articles

18

Deep-dive technical content on web extraction, pipeline architecture, and AI integration.

Browse Technical Articles

Help Center

31

Documentation, API reference, troubleshooting guides, and setup instructions.

Browse Help Center
Featured Guides

Start with a guide

End-to-end walkthroughs covering the most common data pipeline use cases with Citrusiq.

Guide12 min read

How to build a web data pipeline from scratch

Walk through the complete pipeline lifecycle — from configuring your first extractor and defining a schema, to scheduling runs and delivering structured datasets to downstream systems.

What you'll learn

Configure extractor
Define schema
Schedule pipeline
Deliver dataset
Read guide
Guide9 min read

Extracting structured datasets from any website

Learn how Citrusiq handles JavaScript rendering, authentication flows, pagination, and anti-bot measures — and how to define schemas that map web content to typed, validated output.

What you'll learn

JS rendering setup
Auth configuration
Pagination handling
Schema mapping
Read guide
Guide15 min read

Building AI training datasets from web data

A complete guide to collecting, deduplicating, and quality-scoring domain-specific content for model training, fine-tuning, and RAG system construction using Citrusiq pipelines.

What you'll learn

Source selection
Deduplication
Quality scoring
Training delivery
Read guide
Guide11 min read

Automating competitor monitoring with Citrusiq

Set up a continuous monitoring pipeline that detects changes on competitor pages — pricing updates, feature launches, job postings — and routes structured alerts to your team automatically.

What you'll learn

Define watch targets
Configure change detection
AI summarization
Alert routing
Read guide
Technical Articles

Go deeper

Architecture patterns, scaling strategies, and technical deep-dives for engineers building on Citrusiq.

Architecture8 min

Handling JavaScript-heavy websites at scale

How Citrusiq manages headless Chrome worker pools, session reuse, and render caching to extract from complex SPAs without reliability trade-offs.

Read article
Scaling10 min

Scaling web data extraction pipelines

Concurrency models, rate limiting strategies, distributed crawl queues, and checkpoint recovery patterns for high-volume pipeline deployments.

Read article
AI Integration12 min

Structuring datasets for AI model training

Schema design principles, record normalization, token budgeting, and metadata strategies that improve model quality when training on web-collected data.

Read article
Workflows9 min

Automating market intelligence workflows

Pipeline patterns for continuous signal collection, AI-powered classification, and scheduled report delivery to dashboards and data warehouses.

Read article
Data Engineering7 min

Schema versioning and backward compatibility

Managing schema evolution across pipeline versions — additive changes, deprecation strategies, and maintaining backward compatibility with downstream consumers.

Read article
Reliability11 min

Pipeline observability and failure recovery

Setting up health checks, structured logging, alerting thresholds, and idempotent retry logic for production-grade extraction pipelines.

Read article
pipeline.config.json
Example configuration
// citrusiq pipeline config
{
  "pipeline": "product-monitor",
  "schedule": "0 */6 * * *",
  "extractor": {
    "url": "https://store.example.com/products",
    "js_render": true,
    "pagination": "auto"
  },
  "schema": {
    "name": "string",
    "price": "number",
    "in_stock": "boolean",
    "sku": "string"
  },
  "export": {
    "format": "jsonl",
    "destination": "warehouse"
  }
}

What this pipeline does

  • Runs every 6 hours on a cron schedule
  • Renders JavaScript with headless Chrome
  • Auto-detects and follows pagination
  • Enforces typed schema on every record
  • Exports JSONL to your data warehouse
Help Center

Documentation and support

Reference documentation, setup guides, and support resources for teams using Citrusiq in production.

Getting started with Citrusiq

Account setup, first pipeline, and core concepts.

View docs

API reference

REST API endpoints, authentication, and response schemas.

View docs

Workflow automation setup

Scheduling, triggers, and pipeline orchestration.

View docs

Troubleshooting extraction issues

Common errors, auth failures, and pagination edge cases.

View docs

Schema design reference

Field types, validation rules, and output format options.

View docs

Contact support

Open a ticket or reach the engineering team directly.

View docs
Quick reference — common CLI commands

citrusiq init

Initialize a new project

citrusiq run <id>

Execute a pipeline run

citrusiq status

Check pipeline run status

citrusiq export --format jsonl

Export dataset as JSONL

Get started

Start building automated data workflows

Get your first pipeline running in under 30 minutes. Talk to our team to find the right setup for your use case.