Resources for building automated data workflows
Guides, tutorials, and technical documentation for using Citrusiq to build web data extraction pipelines, automate workflows, and power AI systems.
Resource categories
Start with a category or dive directly into a specific guide, tutorial, or reference document.
Guides
12Step-by-step walkthroughs for building and deploying data pipelines using Citrusiq.
Tutorials
9Hands-on tutorials covering extraction, processing, schema design, and delivery.
Technical Articles
18Deep-dive technical content on web extraction, pipeline architecture, and AI integration.
Help Center
31Documentation, API reference, troubleshooting guides, and setup instructions.
Start with a guide
End-to-end walkthroughs covering the most common data pipeline use cases with Citrusiq.
How to build a web data pipeline from scratch
Walk through the complete pipeline lifecycle — from configuring your first extractor and defining a schema, to scheduling runs and delivering structured datasets to downstream systems.
What you'll learn
Extracting structured datasets from any website
Learn how Citrusiq handles JavaScript rendering, authentication flows, pagination, and anti-bot measures — and how to define schemas that map web content to typed, validated output.
What you'll learn
Building AI training datasets from web data
A complete guide to collecting, deduplicating, and quality-scoring domain-specific content for model training, fine-tuning, and RAG system construction using Citrusiq pipelines.
What you'll learn
Automating competitor monitoring with Citrusiq
Set up a continuous monitoring pipeline that detects changes on competitor pages — pricing updates, feature launches, job postings — and routes structured alerts to your team automatically.
What you'll learn
Go deeper
Architecture patterns, scaling strategies, and technical deep-dives for engineers building on Citrusiq.
Handling JavaScript-heavy websites at scale
How Citrusiq manages headless Chrome worker pools, session reuse, and render caching to extract from complex SPAs without reliability trade-offs.
Scaling web data extraction pipelines
Concurrency models, rate limiting strategies, distributed crawl queues, and checkpoint recovery patterns for high-volume pipeline deployments.
Structuring datasets for AI model training
Schema design principles, record normalization, token budgeting, and metadata strategies that improve model quality when training on web-collected data.
Automating market intelligence workflows
Pipeline patterns for continuous signal collection, AI-powered classification, and scheduled report delivery to dashboards and data warehouses.
Schema versioning and backward compatibility
Managing schema evolution across pipeline versions — additive changes, deprecation strategies, and maintaining backward compatibility with downstream consumers.
Pipeline observability and failure recovery
Setting up health checks, structured logging, alerting thresholds, and idempotent retry logic for production-grade extraction pipelines.
// citrusiq pipeline config
{
"pipeline": "product-monitor",
"schedule": "0 */6 * * *",
"extractor": {
"url": "https://store.example.com/products",
"js_render": true,
"pagination": "auto"
},
"schema": {
"name": "string",
"price": "number",
"in_stock": "boolean",
"sku": "string"
},
"export": {
"format": "jsonl",
"destination": "warehouse"
}
}What this pipeline does
- Runs every 6 hours on a cron schedule
- Renders JavaScript with headless Chrome
- Auto-detects and follows pagination
- Enforces typed schema on every record
- Exports JSONL to your data warehouse
Documentation and support
Reference documentation, setup guides, and support resources for teams using Citrusiq in production.
Getting started with Citrusiq
Account setup, first pipeline, and core concepts.
API reference
REST API endpoints, authentication, and response schemas.
Workflow automation setup
Scheduling, triggers, and pipeline orchestration.
Troubleshooting extraction issues
Common errors, auth failures, and pagination edge cases.
Schema design reference
Field types, validation rules, and output format options.
Contact support
Open a ticket or reach the engineering team directly.
citrusiq init
Initialize a new project
citrusiq run <id>
Execute a pipeline run
citrusiq status
Check pipeline run status
citrusiq export --format jsonl
Export dataset as JSONL
Start building automated data workflows
Get your first pipeline running in under 30 minutes. Talk to our team to find the right setup for your use case.