Title: Manual Health System – Complete Function-by-Function Reference Location (face): extractly/extractly/management/commands/manual_health.py Location (heart): manual_agregator/server_monitor/engine.py Support (optional): manual_agregator/server_monitor/manual_notify.py (draft) Purpose The manual health system provides a single command-line entry point (manual_health) that orchestrates a suite of diagnostics over the manual ingestion pipeline. It measures: - Network fetch health - Parser quality and field completeness - Freshness/throughput and backlog and exposes utilities to: - Generate a per-portal health snapshot - Create debug bundles (HTML/selectors/errors) for investigation - Lint selector configurations against model fields Core Data Models Referenced - SourceManual: portal configuration and selectors - NetworkMonitoredPage: fetched pages (html, sliced_html, is_active, created_at, url, etc.) - NetworkPageError: network/page-level errors - AdsManual: parsed ads (has_data, price, title, city, description, …) ================================================================================ CLI FACE: manual_health (extractly/extractly/management/commands/manual_health.py) ================================================================================ Overview This Django management command is the single CLI “face” to the system. It defines subcommands and delegates execution to ServerMonitorEngine in engine.py. Subcommands and Arguments 1) health (default if no subcommand is provided) - --name/-n : limit to specific portal names - --all : include all portals - --limit : maximum portals in the output (default 25) - --threshold-error <0..1>: error-rate threshold to consider unhealthy (default 0.3) - --json : write JSON to file instead of console summary Behavior: Calls engine.health_snapshot(portals, limit, thr_error). Prints a one-line summary per portal or writes JSON to the provided file. 2) run - --portal/-p : portal name (required) - --hours : lookback window for checks (default 24) - --json : write JSON to file Behavior: Calls engine.run_all(portal, hours) and returns a dict of three sections: network, parser, freshness. 3) debug-dump - --name/-n : specific portals (omit when using --all) - --all : process all portals - --limit : max pages per portal (default 25) - --only-unparsed : include only pages without a parsed ad - --only-errors : include only pages that have errors - --out : output root directory (default debug/manual_dump) - --check-selectors : count CSS selector hits using BeautifulSoup - --json : write the summary JSON Behavior: Calls engine.debug_dump(...). Writes page bundles to disk and returns a summary of how many bundles were created. 4) lint - --name/-n : specific portals (omit when using --all) - --all : lint all portals - --json : write JSON to file Behavior: Calls engine.lint_selectors(...). Reports unknown selector field keys and disallowed properties. ================================================================================ ENGINE HEART: manual_agregator/server_monitor/engine.py ================================================================================ Overview engine.py consolidates all diagnostic logic. It’s organized into three checkers and one orchestrator: - NetworkHealthChecker: fetch-layer health - ParserHealthChecker: parsing quality and field completeness - FreshnessChecker: lag, backlog, and throughput health - ServerMonitorEngine: orchestrates checks and provides utilities (snapshot, dump, lint) ----------------------------- Class: NetworkHealthChecker ----------------------------- Function: check_fetch_health(portal: str, hours: int = 24) -> Dict What it checks: - Filters NetworkMonitoredPage by source name and created_at >= now - hours - Computes metrics: * total_pages * empty_html_count / empty_html_rate (html IS NULL) * empty_sliced_count / empty_sliced_rate (sliced_html IS NULL) * undersized_pages (html length < 10,000 chars) * duplicate_html_rate (hash density across up to 50 pages) * consecutive_error_streak (last 20 NetworkPageError rows) Health decision: - healthy == True if: empty_html_rate < 0.10 AND empty_sliced_rate < 0.15 AND consecutive_error_streak < 10 Warnings emitted when: - empty_html_rate > 0.20 (high) - duplicate_html_rate > 0.80 (critical – likely captcha/bot-block) - undersized_pages > 30% of total_pages (warning) - consecutive_error_streak > 5 (critical) Edge cases: - If total_pages == 0 → {healthy: False, reason: "no_pages_fetched", metrics: {}} Helper: _check_duplicate_hashes(pages) -> float - Computes the most common html hash in the first ~50 pages and returns its share (0..1). Used to detect mass-duplicate HTML. Helper: _check_error_streaks(portal: str) -> int - Inspects the most recent ~20 NetworkPageError entries for the portal and counts a streak. Helper: _generate_fetch_warnings(metrics: Dict) -> List[str] - Builds the warning list based on the thresholds listed above. ---------------------------- Class: ParserHealthChecker ---------------------------- Function: check_parser_health(portal: str, hours: int = 24) -> Dict What it checks: - Filters AdsManual joined to NetworkMonitoredPage by portal and created_at within the lookback window. - Computes metrics: * total_ads * parsed_ads, parse_rate (has_data=True) * error_count, error_rate (NetworkPageError in window) * critical_fields_score (weighted fill rate) * secondary_fields_score (mean fill rate) * price_anomalies: zero, negative, suspiciously low (<100), suspiciously high (>10,000,000) * duplicate_titles_rate (density of duplicates when total>=10) * truncated_descriptions (desc ends with … or ...) * partial_records, partial_rate (has_data=True but missing price OR title OR city) * field_breakdown (per-field filled count and fill_rate) Health decision: - healthy == True if: parse_rate > 0.70 AND error_rate < 0.15 AND critical_fields_score > 0.75 Warnings emitted when: - parse_rate < 0.50 (critical) - 0.50 <= parse_rate < 0.70 (high) - error_rate > 0.30 (critical) - critical_fields_score < 0.60 (high) - partial_rate > 0.20 (warning) - Significant count of zero/negative prices (warning) Edge cases: - If total_ads == 0 → {healthy: False, reason: "no_ads_parsed", metrics: {}} Helper: _fill_rate(ads, field: str) -> float - Computes fill rate for a given field, handling text vs numeric types. Excludes NULLs and empty strings for text fields. Helper: _check_critical_fields(ads) -> float - Weighted sum over CRITICAL_FIELDS: price:0.25, currency:0.10, title:0.20, square_footage:0.15, address:0.15, city:0.10, description:0.05 Helper: _check_secondary_fields(ads) -> float - Mean fill rate of SECONDARY_FIELDS: rooms, bathrooms, floor, estate_condition, lon, lat, advertiser_name Helper: _detect_price_anomalies(ads) -> Dict - Counts zero, negative, low, and high outliers. Helper: _check_duplicate_titles(ads) -> float - Share of duplicate titles among all ads (when total >= 10). Helper: _check_truncated_descriptions(ads) -> int - Counts descriptions ending with ellipsis. Helper: _count_partial_records(ads) -> int - Count of has_data=True ads missing price OR title OR city. Helper: _get_field_breakdown(ads) -> Dict - For critical+secondary fields, returns {filled, fill_rate} per field. Helper: _generate_parser_warnings(metrics: Dict) -> List[str] - Creates a warning list matching the thresholds described above. -------------------------- Class: FreshnessChecker -------------------------- Function: check_freshness(portal: str) -> Dict What it checks: - last_page_fetched (max created_at among NetworkMonitoredPage by portal) - last_ad_created (max created_at among AdsManual by portal) - processing_lag_hours = (last_page - last_ad) in hours (if both exist) - backlog_size = count of active pages without a parsed ad - 24h throughput: pages_per_hour_24h, ads_per_hour_24h - processing_efficiency = ads_24h / pages_24h (0 if pages_24h == 0) Health decision: - healthy == True if: (lag is None or lag < 4h) AND backlog_size < 1000 AND pages_24h > 100 Warnings emitted when: - lag > 4h (critical) - backlog_size > 1000 (high) - pages_per_hour_24h < 5 (warning) - ads_per_hour_24h < 5 (warning) - processing_efficiency < 0.5 with sufficient fetch volume (warning) Helper: _generate_freshness_warnings(metrics: Dict) -> List[str] - Creates warning strings for the above conditions. -------------------------------- Class: ServerMonitorEngine (Orchestrator) -------------------------------- Function: run_all(portal: str, hours: int = 24) -> Dict What it does: - Calls: network.check_fetch_health(portal, hours) parser.check_parser_health(portal, hours) freshness.check_freshness(portal) - Returns: { portal, network, parser, freshness } Function: health_snapshot(portals: Optional[List[str]] = None, limit: int = 25, thr_error: float = 0.3) -> Dict What it does: - Selects SourceManual rows (all or filtered by name/source.name) and, per portal: * page counts (total, parsed, error_pages, is_active_pages) * ad counts (total_ads via linked network_ad_manual) * last_ad_created (ISO) * parse_rate = parsed_pages / total_pages * error_rate = error_pages / total_pages * healthy = error_rate <= thr_error * fill_rates/counts for CORE_FIELDS: price, currency, price_per_m2, title, description, address, square_footage, rooms, estate_type, offer_type, city - Sorts rows by (error_rate, -total_pages) descending and truncates to limit. - Returns: { ts, thresholds: { error_rate: thr_error }, portals: [rows...] } Function: debug_dump(portals: Optional[List[str]] = None, all_portals: bool = False, limit: int = 25, only_unparsed: bool = False, only_errors: bool = False, out_dir: str = "debug/manual_dump", check_selectors: bool = False) -> Dict What it does: - Selects portals (all or specific). For each portal, selects recent NetworkMonitoredPage rows with optional filters: * only_unparsed → pages with network_ad_manual IS NULL * only_errors → pages that have related NetworkPageError - For each page, creates a folder under //page_/ containing: * page.html: full HTML (html or sliced_html) * selectors.json: JSON dump of SourceManual.selectors for that portal * info.json: metadata {url, name, is_active, created_at, has_ad_manual, errors_count, errors: [...]} * If check_selectors=True, adds selector_hits = count of CSS matches per configured selector using BeautifulSoup - Returns: { out: , bundles: } Function: lint_selectors(portals: Optional[List[str]] = None, all_portals: bool = False) -> List[Dict] What it does: - Validates selector configuration trees (SourceManual.selectors) against AdsManual fields and a whitelist of allowed selector config properties (KNOWN_CFG_KEYS). - For each portal, returns: * portal: name * unknown_field_paths: list of paths to selector keys that don’t match AdsManual fields (plus a few permitted exceptions like geo_json) * bad_property_paths: paths to properties not in KNOWN_CFG_KEYS KNOWN_CFG_KEYS include: selector, selectors, fieldType, label, valueType, keyMap, ifMissing, defaultValue, trueOptions, falseOptions, paragraphs, joinWith, currencyField, altLabels, cast, splitBy, fromMain, splitIndex, maxField, rangeMode, allowRange, specialMap, isMain, labelField ================================================================================ Operational Notes ================================================================================ - The CLI supports JSON output via --json which is honored by all subcommands in manual_health. - health_snapshot produces a compact per-portal overview and is appropriate for dashboards. - run_all (via subcommand run) produces detailed per-layer metrics and warnings for deep diagnosis. - debug_dump writes investigation-ready bundles on disk; check_selectors is useful when verifying selector breakage. - lint_selectors prevents drift between selectors and AdsManual schema. Edge-case Handling Summary - No pages in window (network) → healthy=False with reason=no_pages_fetched - No ads in window (parser) → healthy=False with reason=no_ads_parsed - Safe fill-rate calculation for text vs numeric fields - Selector counting guarded by try/except and error recorded in info.json if parsing fails Threshold Summary (defaults used by the engine) - Network * empty_html_rate: warn >20%, healthy <10% * empty_sliced_rate: healthy <15% * duplicate_html_rate: critical >80% * consecutive_error_streak: warn >5, healthy <10 - Parser * parse_rate: critical <50%, high <70%, healthy >70% * error_rate: critical >30%, healthy <15% * critical_fields_score: high <60%, healthy >75% * partial_rate: warning >20% - Freshness * processing_lag_hours: critical >4h, healthy <4h * backlog_size: high >1000 * pages_per_hour_24h: warning <5 * ads_per_hour_24h: warning <5 * processing_efficiency: warning <50% when fetch volume is adequate ================================================================================ How to Run (quick reference) ================================================================================ From the project root (where manage.py resides), PowerShell examples: 1) Health snapshot (default) python manage.py manual_health --all python manage.py manual_health health --name "otodom,morizon" --limit 50 --threshold-error 0.25 python manage.py manual_health --all --json "debug\health_snapshot.json" 2) Full check for one portal python manage.py manual_health run --portal otodom --hours 24 --json "debug\otodom_health.json" 3) Debug dump python manage.py manual_health debug-dump --name "otodom" --only-unparsed --check-selectors --out "debug\manual_dump" 4) Lint selectors python manage.py manual_health lint --all --json "debug\lint_report.json" ================================================================================ Optional Support: manual_notify (draft) ================================================================================ manual_agregator/server_monitor/manual_notify.py contains a webhook notifier template (Slack/Discord/Teams) not currently wired as a Django command. It would need to be relocated under an app’s management/commands and refit to use ServerMonitorEngine outputs or a defined “unhealthy” predicate before production use.