Data Scraper
The Data Scraper is the core data collection engine within the Mirox-Agent, actively retrieving real-time information from all monitored equipment at your plant. It connects to your loggers, inverters, meters and battery systems through a library of vendor-specific adapters, normalizes everything into one consistent metric vocabulary, and forwards the result to the rest of the platform — while running a growing set of edge analytics (performance ratio, curtailment tracking, clear-sky baselines, forecasting and network monitoring) directly at the plant.
Purpose and Role
The Data Scraper serves a single, focused purpose: actively collect raw measurements from equipment and forward them for processing. It acts as the bridge between diverse manufacturer equipment and the unified Mirox platform, translating proprietary data formats into standardized metrics.
Core Responsibilities:
- Connect to data loggers and monitoring devices through vendor-specific adapters
- Retrieve raw measurements on configurable schedules
- Transform manufacturer-specific data into standardized metric format
- Automatically discover and track installation components
- Monitor component activity and operational status
- Run edge analytics (expected power, performance ratio, curtailment, clear-sky and forecasts)
- Inspect the plant's local network and audit device access
- Forward metrics to the Time-Series Database and Digital Twin service
- Report component health and operational status to the IoT Cloud
This separation of concerns keeps the Data Scraper lightweight, focused, and independently deployable.
Architecture Overview
The Data Scraper operates as an asynchronous, event-driven service where multiple data collection tasks run concurrently:
Key Architectural Principles:
- Stateless: No persistent state between restarts
- Adapter-Based: A dedicated, vendor-specific adapter speaks each device's protocol
- Self-Healing: Automatic error recovery with exponential backoff
- Concurrent: Each data source collected independently
- Edge-Deployed: Runs on or near the plant, close to the equipment it reads
Network Access Requirements
The Data Scraper requires direct TCP/IP network access to data sources for communication. This typically means:
- Direct Ethernet/WiFi connectivity to the device's IP address
- Open network ports for the device's protocol (e.g., TCP 80/443 for HTTP and WebSocket APIs)
- Proper network routing between the Data Scraper host and devices
If direct network access is not possible (e.g., isolated OT networks, air-gapped systems, serial-only devices), we may need to implement an intermediate data collector such as:
- Third-party data logger with network connectivity
- Protocol gateway (Serial-to-Ethernet, fieldbus bridge, etc.)
- Custom hardware solution for specialized interfaces
Consult with our engineering team to evaluate connectivity options for your specific installation.
The Adapter System
Device-specific protocols
Adapter
Standardized data format
Concept
An adapter is a vendor-specific connector module that knows how to communicate with one particular family of device. Each adapter is hand-built and reverse-engineered against that device's own web, API, or database interface — there is no single "speaks-any-protocol" engine. The adapter system is the core extensibility mechanism: supporting a new device means adding a new adapter, which we do on request (see below).
Each adapter is a self-contained module responsible for:
- Connection Management - Establishing and maintaining communication
- Data Retrieval - Fetching measurements using the appropriate protocol
- Data Transformation - Converting to standardized metric format
All adapters inherit from a base class that provides health monitoring, automatic retry logic, exponential backoff on failures, metric validation, and status reporting to the IoT platform.
Health Management
Each adapter implements an automatic state machine:
- INITIALIZING: Starting up and establishing initial connections
- HEALTHY: Operating normally with successful data collections
- UNHEALTHY: Experiencing errors but attempting to continue
- RECONNECTING: Performing recovery actions after repeated failures
- FROZEN: The device is returning stale data — the same values repeat, or no new reading has arrived within the expected window
- PAUSED: Temporarily paused by user command; resumes automatically when the pause expires
The system automatically transitions between states, reports status to the platform, and attempts recovery without manual intervention. The FROZEN state is what lets the platform tell a genuinely-down logger apart from one that is merely repeating a stuck value.
Supported Devices
The Data Scraper ships with around fifteen vendor-specific adapters, each built for a particular device family. The list below reflects what is supported today and grows whenever a new device is integrated.
| Device family | What it is | How it is read |
|---|---|---|
| Bluelog data logger | Meteocontrol-style logger (sensors + strings) | HTTP login plus a live WebSocket feed |
| SMA Sunny Central | Central inverter controller | Vendor HTTP API (with shutdown detection) |
| SMA Power Manager | SMA plant controller | Vendor HTTP API |
| Sungrow logger | Sungrow inverter data logger | Live WebSocket connection |
| Huawei SmartLogger | SmartLogger 1000 / 3000 / 4000 | HTTP web interface (zero-config onboarding) |
| Janitza meters | Power-quality meters | HTTP, no credentials (zero-config onboarding) |
| Phoenix Contact PLC | PLCnext / SPS controller | Vendor HTTPS REST API (zero-config onboarding) |
| Dexcon controller | Plant controller | Vendor HTTPS REST API |
| Zebotec | Inverters and sensors | Vendor HTTP API |
| FREQCON BESS | Battery storage system | Vendor HTTP plus a time-series query interface |
| PRTG | Network-monitoring server | PRTG HTTP API |
| Becker historian | Historian database (Microsoft SQL Server) | Microsoft SQL Server connection |
| Object storage / files | S3 or S3-compatible storage and local files | File scanning with CSV/Excel parsing and gap detection |
| Weather model | Open-Meteo weather + on-device PV power model | HTTP (clear-sky and irradiance modelling) |
Real Transports in Use
Across these adapters, the actual communication methods are vendor HTTP/HTTPS APIs (most common), live WebSocket connections (Bluelog and Sungrow), Microsoft SQL Server (the Becker historian), S3 / file access, and a time-series query interface used only by the FREQCON battery adapter. There is no generic Modbus, MQTT, FTP/SFTP, or WebDAV ingestion — each integration is purpose-built for its device.
Creating New Adapters
New adapters can be developed to support additional devices or protocols. The modular design and base class functionality significantly reduce development time.
Compatibility with Legacy Equipment: We can create adapters for older devices that were never specifically designed for data export. As long as the device provides its data in any accessible way—whether through a REST API, web interface, database, file system, or any other mechanism—we can extract and integrate that data into the platform.
Unrestricted Data Collection: Our adapters are not limited to the pre-defined data-export formats that data loggers typically provide. We can collect any data that the device makes available, going beyond the standard set of metrics a manufacturer's logger might expose. If a device has additional diagnostic information, advanced parameters, or hidden data points accessible through its interface, we can retrieve and standardize them.
Custom Adapters on Demand
We can create new adapters for virtually any data source at any time upon customer request. The adapter system is designed for rapid extensibility—new protocol support can typically be implemented within days depending on complexity. If you have equipment from a manufacturer not yet supported, contact us to discuss custom adapter development.
No Vendor Documentation Required
Adapter development does not strictly require vendor API documentation. Through network traffic analysis, protocol reverse engineering (where legally permitted), and empirical testing, we can often create functional adapters even for devices with undocumented interfaces. This capability is particularly valuable for legacy equipment or systems with proprietary protocols.
Onboarding Through the Platform
For a subset of devices, you can bring a logger online from the platform without hand-writing any configuration. An onboarding wizard asks the agent to dry-run-connect to the device and streams the live probe results back to you, so you see immediately whether the connection works before committing it. There are two flavours:
- Zero-config onboarding — the adapter already owns the device's full reading set, so the wizard just shows a read-only live preview and you save. Available today for Janitza meters, Huawei SmartLogger, and Phoenix Contact controllers. Janitza needs no credentials at all.
- Interactive mapping for generic loggers — some loggers expose arbitrary raw values the platform can't interpret on its own. For these the wizard asks the agent to enumerate every raw value the device exposes (group, name, unit, live sample), the operator maps each one to a known metric, and a mapping-driven dry run previews the exact metrics that would be produced before saving. QReader is the first generic logger supported this way.
Not Universal Plug-and-Play
Only the device families above are wizard-onboardable today (the three zero-config families plus generic loggers like QReader). All other adapters still require a per-device configuration delivered with the agent, so treat onboarding automation as device-specific rather than universal.
Metric Standardization
All collected data is transformed into a standardized metric format defined by the platform's metric taxonomy. This ensures consistency across all data sources and enables unified processing downstream.
Metric Structure
Each metric follows a standardized structure compatible with modern time-series databases:
Components:
- Name: Standardized metric identifier from predefined taxonomy
- Value: Numeric measurement in base SI units
- Labels: Key-value pairs for component identification and grouping
- Timestamp: Optional preservation of original device timestamp
Standard Labels:
- Source adapter type and instance number
- Human-readable names
- Component identifiers (inverter ID, string number, etc.)
- Physical location or grouping information
Unit Conventions
All metrics use base SI units regardless of what the manufacturer's device reports:
- Power: Watts (W)
- Energy: Watt-hours (Wh)
- Voltage: Volts (V)
- Current: Amperes (A)
- Temperature: Celsius (°C)
- Irradiance: Watts per square meter (W/m²)
Adapters automatically convert from manufacturer-specific units (kW, MWh, etc.) to these standards during the transformation phase.
Metric Categories
The platform defines 376 standardized metric types organized into 11 families:
| Family | Count | What it covers |
|---|---|---|
| Powerplant | 156 | Grid, AC output, inverters, combiner boxes, strings, irradiation |
| Battery | 86 | Battery box, storage, module and cell measurements |
| Weather | 46 | Weather inputs and measurements |
| Weather Model | 16 | Modelled PV production from weather |
| Network SNMP | 16 | SNMP readings from network devices |
| Agent | 19 | Agent self-telemetry and health |
| Network Monitor | 11 | Local-network monitoring measurements |
| AI Usage | 11 | AI feature usage at the edge |
| Operator | 7 | Operator-fleet telemetry |
| Network | 4 | Basic connectivity |
| Scraper | 4 | Data Scraper self-metrics |
Label expansions (per-string, per-phase, per-inverter, and so on) multiply these into far more individual time series at a real plant. For the complete metric taxonomy and definitions, see Metric Collection.
Data Collection Flow
Polling Strategies
Adapters support two polling modes:
Interval-Based (default): Executes every N seconds after the previous collection completes. Simple and responsive to varying collection durations.
Static Time-Based: Executes at fixed intervals from midnight with optional offset (e.g., at 00:01, 05:01, 10:01 for 5-minute intervals with 1-minute offset). Useful for alignment with external systems.
Processing Pipeline
After collection, metrics pass through several processing stages:
Metric Preparation: Source labels are added, timestamps applied, and structure validated.
Filtering: Configured filters can modify values, validate ranges, or skip metrics based on rules.
Calculations: Automated calculators derive additional metrics:
- Solar radiation power integrated to irradiation energy
- String voltage × current calculated to power
- Power values integrated to energy over time
Component Discovery: As metrics flow through, the Data Scraper automatically discovers and identifies installation components. This is a crucial feature—since the Data Scraper is the layer that actively collects data, it inherently knows which components exist and are providing data. The system automatically discovers:
- Inverters (from inverter power metrics)
- String combiner boxes / GAKs (from GAK metrics)
- Individual strings (from string voltage/current metrics)
- Irradiation sensors (from radiation metrics)
- Grid connection points (from grid energy metrics)
Discovered components are synchronized to the IoT platform for inventory management, creating a real-time, self-maintaining equipment registry without manual configuration.
Component Activity Tracking
Because the Data Scraper continuously polls data sources, it can detect which components are actively providing data and which have stopped responding. When a component stops sending metrics, the Data Scraper marks it as inactive in the IoT Cloud. When it resumes, it's automatically marked active again. This provides real-time awareness of equipment operational status—not just whether the Data Scraper can reach the data logger, but whether individual components within the installation are functioning and reporting data.
Production Detection: The system monitors plant operational state:
- Detects when production begins based on irradiance and power
- Identifies unexpected shutdowns during production hours
- Reports state transitions for alerting
Metric Grouping: Metrics are batched by time series to optimize database insertion performance.
Network Traffic Optimization
After metric grouping and batching, the Data Scraper applies additional compression before transmitting data to the Mirox-Cloud. This significantly reduces network traffic volume, which is particularly beneficial when internet bandwidth is limited or metered. For more details on bandwidth considerations, see On-Site Deployment.
Data Export
Processed metrics are forwarded to two destinations:
Time-Series Database: Metrics are pushed in batches with rate limiting and retry logic for long-term storage and historical querying.
Digital Twin Webhook: A separate background task continuously forwards the latest metric values to the Digital Twin service (a completely separate microservice) for real-time analysis. The Data Scraper has no knowledge of what the Digital Twin does with the data—it simply provides the metrics. For information about Digital Twin processing, see Digital Twin.
Stateless Operation
The Data Scraper is designed to be completely stateless:
- No persistent storage—all state exists only in memory
- Can be stopped and restarted without data loss
- Multiple instances can run independently for different parks
- Each polling cycle is independent of previous cycles
- Crash-resilient with no risk of corrupting persistent state
The only persistent state exists externally:
- Configuration files (version-controlled)
- Time-Series Database (external system)
- IoT Cloud component registry (external system)
This design ensures operational simplicity, reliability, and easy horizontal scaling.
Separation of Concerns
The Data Scraper has a narrow, focused responsibility that enables clear separation from other platform components:
Data Scraper:
- Collects raw measurements from equipment
- Transforms data to standard format
- Discovers and tracks components
- Monitors component activity status
- Forwards metrics to other services
Digital Twin: Validates against physics models and detects anomalies and losses
Time-Series Database: Stores historical data, provides query interface
IoT Cloud: Maintains component registry, tracks device status, manages equipment inventory
This separation enables independent development, testing, deployment, and scaling of each component while ensuring each service focuses on its core competency.
Advanced Features
Automatic Health Monitoring
Each adapter implements a state machine that tracks operational health with automatic reporting to the platform and exposure via the metrics API for operational monitoring.
Automatic Component Discovery
The Data Scraper's position as the data collection layer gives it a unique advantage: it inherently knows which components exist at an installation because it directly interacts with the metrics they produce. As metrics flow through the system, components are automatically discovered from metric labels and registered with the IoT platform.
Discovery Process:
- Metrics arrive with identifying labels (inverter ID, string number, sensor location, etc.)
- The Data Scraper extracts component information from these labels
- New components are automatically registered with the IoT Cloud
- Component metadata (type, identifier, location) is synchronized
- The platform maintains an up-to-date equipment inventory without manual entry
This self-discovery mechanism ensures the platform always knows what equipment exists at the installation, eliminating the need for manual configuration and reducing deployment time.
Production State Detection
The service monitors plant operational state and detects production starts, unexpected shutdowns during production hours, and state transitions for alerting and analysis, reporting only when the state actually changes. It also watches for overproduction — output above the clear-sky model for a sustained period — which can flag a logger that is returning frozen values and, in that case, fall back to the clear-sky model so the data stream stays sensible. This provides real-time operational awareness beyond just raw measurements.
Calculated Metrics
Several calculators automatically derive metrics from raw measurements—solar radiation integrated to irradiation energy, string power calculated from voltage and current, and power values integrated to energy over time. These calculations happen transparently, enriching the data stream without requiring explicit configuration.
Edge Analytics
Beyond raw collection, the Data Scraper runs a set of analytics directly at the plant, computed from the live metric feed and exported as chartable time series alongside the raw data.
Expected Power and Performance Ratio
The agent computes the expected power for each plant from its irradiance sources (on-site pyranometer, satellite radiation, or weather model) and compares it against actual output to derive the performance ratio (PR) — a normalized measure of how well the plant is converting available sunlight into electricity. PR is calculated per day and accounts for clipping, frost, and data-quality filters so that anomalous days do not distort the result.
Live Curtailment Tracking
When a plant produces less than it could, the agent attributes the foregone production to its cause: curtailment by the marketer (a deliberate market-driven cap) versus curtailment by the grid operator. This distinction matters for loss accounting and contractual reporting. Curtailment is tracked per minute and emitted as both instantaneous power and cumulative energy. See Loss Detection for how curtailment fits into overall loss attribution.
Clear-Sky Baseline and Forecasting
- Clear-sky baseline: a theoretical ideal-conditions PV curve maintained as a long-term record, giving you a stable reference to compare real output against.
- Day-ahead forecast: a short-horizon PV production forecast derived from weather data, so you can anticipate the next day's output. These are weather-physics based, not statistical guesses.
Historical Backfill
When a plant is first connected or after a gap, the agent can backfill historical data — replaying raw readings and re-deriving the analytics above for a requested window, then handing the result to the live pipelines so charts are complete from day one rather than starting empty.
Network Monitoring
The Data Scraper includes a built-in local-network inspector that runs alongside data collection and maps the plant's on-site network. It discovers devices on the configured network ranges by sweeping for reachable hosts, reading the address table, identifying vendors from hardware addresses, and probing devices for their identity. Discovered devices are classified against a large library of known network-equipment profiles, and an AI device-identification step helps recognize device families that simple rules miss.
Once devices are known, the inspector polls them for reachability and health (response time, interface and resource status) and reports the results to the platform. You can trigger or stop a discovery scan, re-check an individual device, and review the discovered network — see Local Network Inspector.
Proxy Auditing
For plants reachable through the Mirox Browser Proxy, the agent audits human access to local device interfaces. It groups each person's activity into sessions, redacts sensitive query data before anything is stored, and can produce an AI-generated summary of what a session did. This feeds the platform's access audit trail — see Access Audit Log.
Performance Characteristics
Typical Performance:
- Polling frequencies: 1-300 seconds per adapter (configurable)
- Concurrent adapters: 20+ running simultaneously
- Throughput: 10,000+ metrics per minute sustained
- Latency: Sub-100ms from collection to database insertion
- Resource usage: 5-15% CPU, 100-500 MB memory on edge hardware
The asynchronous architecture ensures high concurrency without blocking, enabling efficient collection from many sources simultaneously.
Related Features
- Digital Twin — the physics-based analysis engine that consumes the metrics the Data Scraper collects
- Mirox-Agent Overview — how the Data Scraper fits within the wider edge agent
- Deployment Options — on-site versus cloud deployment trade-offs for the agent
- Metric Collection — the full standardized metric taxonomy
- Local Network Inspector — the on-site network monitoring surface
- Loss Detection — how curtailment and other losses are attributed