A HTTP / HTML spider

Andreas Krüger, DJ3EI, 2025-09-03, Source.

🇩🇪 Post ist auch verfügbar auf deutsch.

I occasional grab stuff from the internet and re-organize the info contained in that stuff. One example was, I didn’t like the sort order of the programm of a German amateur radio convention, so I produced my own.

My problem was: One page had the timetable, the other the abstracts. I unified that info into one page.

Here’s how I did it.

I wrote a script that does the following things:

Grab the pertinent pages (two in this case) from the internet
and cache them. I expect to run half-baked versions of my script often until it’s done and don’t want to put load on the server each time.
Parse the HTML and extract the two types of information I’m interested in: Time information and abstract.
I didn’t plan to do this, but it unfortunately proved necessary: 17 out of the 36 items had conflicts between timetable and abstracts which needed to be resolved. The author or the title didn’t quite agree. This started with humble typing errors and ended with “workshops” that didn’t have an abstract.
Then join the two sources of info and write the result into a Markdown file straight into my blog sources.

I do similar stuff occasionally. So I document it here comparatively cleanly. I plan to use this as raw material when doing it again.

Software used

I use Python, currently version 3.11.2, and installed, via pip, the following software into a venv:

requests
beautifulsoup4

Including their dependencies, as of today, this leads to this installation:

beautifulsoup4==4.13.5
certifi==2025.8.3
charset-normalizer==3.4.3
idna==3.10
requests==2.32.5
soupsieve==2.8
typing_extensions==4.15.0
urllib3==2.5.0

I need to have the BeautifulSoup Dokumentation documentation handy. (If you know DOM: This is a DOM software for Python.)

Polishing

Before publication, I polished the script a bit, something that I more or less habitually do when coding Python.

For that, I used this software:

pip install isort flake8 black types-requests mypy

And I saw to it the following commands all come through without errors:

isort sort_ukwtagung_programm
black -l 120 sort_ukwtagung_programm
flake8 --max-line-length=120 --ignore E203,E701,E704 --color=never sort_ukwtagung_programm
mypy --strict sort_ukwtagung_programm

The --ignore stuff in the flake8 line is from the black documentation. It removes three checks that would cause flake8 to complain about some changes black does.

The program itself

This is the code of my program sort_ukwtagung_programm:

#!/usr/bin/env python

# This programm is in the public domain,
# so usable by anyone for any purpose without restrictions,
# see https://creativecommons.org/publicdomain/zero/1.0/ .

import os
import re
from dataclasses import dataclass
from os.path import isfile
from typing import cast

import requests
from bs4 import BeautifulSoup
from bs4.element import NavigableString, Tag

DEFAULT_CACHE_DIR = "cache"


def get_html_page(
    session: requests.Session, uri: str, cache_filename: str, cache_dir: str = DEFAULT_CACHE_DIR
) -> BeautifulSoup:
    """Download a HTML page from the internet and extract the DOM tree.

    This caches on  first access and never bothers to invalidate the cache,
    that is, assumes the cache is always fresh."""

    cache_fq_filename = f"{cache_dir}/{cache_filename}"
    if isfile(cache_fq_filename):
        # We downloaded the file earlier, so don't bother the HTTP-Server again:
        pass
    else:
        page_response = session.get(uri)
        page_response.raise_for_status()
        content_type = page_response.headers["content-type"].lower()
        if "text/html" in content_type:
            with open(cache_fq_filename, "w") as cache_file:
                cache_file.write(page_response.text)
        else:
            raise RuntimeError(f"Expected text/html, found {content_type} for {uri}")

    with open(cache_fq_filename, "r") as cache_file:
        return BeautifulSoup(cache_file, "html.parser")


@dataclass
class PlanItem:
    """Data class for items from the plan (timetable)."""

    time: str
    room: str
    op: str
    title: str


def grab_plan(session: requests.Session, uri: str, cache_filename: str) -> list[PlanItem]:
    """Grab plan from the internet and extract the plan."""
    plan_html = get_html_page(session, uri, cache_filename)
    table = plan_html.table
    if table is None:
        raise RuntimeError(f"No table in {uri} content")
    tbody = table.tbody
    if tbody is None:
        raise RuntimeError(f"No table.tbody in {uri} content")
    trs = tbody.find_all("tr")

    def parse_tr(tr: Tag) -> list[str]:
        """Split a <tr> into <td>s and grab the string content of each <td>."""
        tds = tr.find_all("td")
        # Consistency check, true for the table we want to parse:
        if len(tds) not in [5, 6]:
            raise RuntimeError(f"Expecting 5 or 6 columns in row {tr.prettify()}")
        result: list[str] = []
        for td in tds:
            # Remove all HTML markup and map every <td> content to a single-line string.
            result.append(" ".join(cast(Tag, td).strings).replace("\n", "").strip())
        return result

    # The table puts info that belongs on top of the table in the table header / <td>
    # and info that belongs in the table header in the initial row:
    initial_row = parse_tr(cast(Tag, trs[0]))

    result = []

    # Each logical table row is coded as two HTML table rows.
    for row_i in range(3, len(trs) - 2, 2):
        time_row = parse_tr(cast(Tag, trs[row_i]))
        # This row does not have the initial column.
        title_row = parse_tr(cast(Tag, trs[row_i + 1]))

        time = time_row[0]
        # Merge info from the initial row and the two data rows,
        # processing columnwise.
        for room, op, title in zip(initial_row[1:], time_row[1:], title_row):
            if op and title:
                # There is a lot of inconsistency between the table and the abstracts,
                # regarding op and title.
                # In cases where the abstracts info was considered better,
                # we replace that info here.
                if "Erich H. Franke" in op:
                    op = "Erich H. Franke, DK6II"
                    title = "Künstliche Intelligenz in der Elektronik-Entwicklung. Ernsthaftes Hilfsmittel oder Hype?"
                elif "Satelliten-Funk – quo vadis? 52 Jahre AMSAT DL E.V." in title:
                    title = "Satelliten-Funk – quo vadis? 52 Jahre AMSAT DL e.V."
                elif "Umweltsensordaten des Urban Weather Project im „Digitalen Zwilling“ der mrn" in title:
                    title = (
                        "Umweltsensordaten des Urban Weather Project im „Digitalen Zwilling“ "
                        "der Metropolregion Rhein-Neckar"
                    )
                elif "Hol mehr aus dem Si5351 heraus" in title:
                    op = "Pieter-Tjerk de Boer, PA3FWM"
                    title = "Hol mehr aus dem Si5351 heraus: höhere Frequenzauflösung, Messungen und Modulation"
                elif "Ein modularer Mehrkanal-VNA von 9 kHz bis (evt.) 26.5 GHz" in title:
                    op = "Paul Boven, PE1NUT"
                    title = "Ein modularer Mehrkanal-VNA von 9 kHz bis (evt.) 26.5 GHz: Erste Schritte"
                elif "DJ1NG" in op:
                    op = "Guido Liedtke, DJ1NG"
                    title = (
                        "Jedermannfunkgeräte für den Notfunk – welche u.U. auch als Amateurfunkgeräte interessant sind"
                    )
                elif "Paul Boven" in op:
                    op = "Wolfgang Herrmann, Paul Boven, PE1NUT"
                elif "Ein Streifzug durch die Geoinformatik für Funkamateure und Dxer" in title:
                    title = "Ein Streifzug durch die Geoinformatik für Funkamateure und DXer"
                result.append(PlanItem(time=time, room=room, op=op, title=title))

    return result


@dataclass
class Abstract:
    """Data class for items from the abstracts page."""

    op: str
    title: str
    abstract_lines: list[str]


def grab_abstracts(session: requests.Session, uri: str, cache_filename: str) -> list[Abstract]:
    """Grab abstracts from the internet and extract the individual abstracts."""

    def split_tag_in_lines(t: Tag) -> list[str]:
        """Helper that splits a tag's contents in individual lines."""
        result: list[str] = []
        for c in t.contents:
            if type(c) is Tag:
                for sub_line in split_tag_in_lines(c):
                    # Do this recursively.
                    # This hopes they didn't use <span> or <a> or similar inline stuff.
                    result.append(sub_line)
            elif type(c) is NavigableString:
                result.append(" ".join(c.strings))
            else:
                raise RuntimeError(f"Didn't expect type {type(c)} of {c}")
        return result

    abstracts_html = get_html_page(session, uri, cache_filename)

    # They put a two-digit number in front of the author that I want to remove:
    split_away_number = re.compile(r"\d\d\s+([^\s].+[^\s])\s*")

    result: list[Abstract] = []
    for h4_raw in abstracts_html.find_all("h4"):
        h4: Tag = cast(Tag, h4_raw)
        abstract_lines: list[str] = []
        num_and_op, _br, title = h4.contents
        if type(num_and_op) is NavigableString:
            num_and_op_s = num_and_op.string
        elif type(num_and_op) is Tag:
            num_and_op_maybe_s = num_and_op.string
            if num_and_op_maybe_s is None:
                raise RuntimeError(f"num_and_op not found in {h4.prettify()}")
            else:
                num_and_op_s = num_and_op_maybe_s
        else:
            raise SystemError(f"Unexpected type {type(num_and_op)} of {num_and_op}")
        if num_and_op_mo := split_away_number.fullmatch(num_and_op_s):
            op = num_and_op_mo.group(1)
        else:
            raise SystemError(f"Could not parse {num_and_op_s}")
        # Now harvest the lines of the abstract.
        # This is simply all the stuff that follows, until the next h4
        # or the end; all abstracts are in a <div>.
        sib = h4.next_sibling
        while True:
            while sib is not None and type(sib) is not Tag:
                if type(sib) is NavigableString:
                    if "" == str(sib).strip():
                        pass
                    else:
                        abstract_lines.append(str(sib))
                sib = sib.next_sibling
            if sib is None or "h4" == sib.name:
                break
            else:
                # <p> or something
                for line in split_tag_in_lines(sib):
                    abstract_lines.append(line)
                sib = sib.next_sibling

        title_s = cast(Tag, title).string
        if title_s is None:
            raise RuntimeError(f"No title for abstract of {op}")
        else:
            if "DJ3EI" in op:
                # Late fiddling of my own talk:
                abstract_lines.append(
                    "Folien und Tagungsbandbeitrag sind veröffentlicht unter "
                    "[https://dj3ei.famsik.de/2025-JS8/](https://dj3ei.famsik.de/2025-JS8/)."
                )
            result.append(Abstract(op=op, title=title_s, abstract_lines=abstract_lines))

    # We have a few things in the plan that don't have their own abstract:
    NO_COMMENT = ["(Keine weitere Beschreibung)"]
    result.append(
        Abstract(
            op="Charly Eichhorn, DK3ZL",
            title="Live-QSO mit der Neumayer III Südpolstation über QO-100",
            abstract_lines=NO_COMMENT,
        )
    )
    result.append(Abstract(op="Michael Dörr", title="Workshop NeoPixel", abstract_lines=["Siehe Vortrag 12:30-13:15"]))
    result.append(
        Abstract(
            op="Alex Knochel DK3HD",
            title="Vorbereitung eines Stratosphärenballons mit SSTV auf der Wiese der DBS",
            abstract_lines=NO_COMMENT,
        )
    )
    result.append(
        Abstract(
            op="Alex Knochel DK3HD",
            title="Start eines Stratosphärenballons mit SSTV auf der Wiese der DBS",
            abstract_lines=NO_COMMENT,
        )
    )

    return result


def main() -> None:
    # Create the cache dir on first run:
    os.makedirs(DEFAULT_CACHE_DIR, exist_ok=True)

    with requests.Session() as session:
        plan_items = grab_plan(session, "https://ukw-tagung.org/vortragsplan-ukw-tagung-2025/", "plan.html")
        abstracts = grab_abstracts(
            session, "https://ukw-tagung.org/abstracts-der-vortraege-der-70-ukw-tagung-2025/", "abstracts.html"
        )

    # Provide the abstracts in a dict with key a concatenation of op and title, with a " " intervening:
    op_title2abstract_lines: dict[str, list[str]] = {}
    for abstract in abstracts:
        # Fix inconsistencies.
        # Where we think the plan has the better representation of the op and/or the title,
        # we use that.
        if abstract.op == "Bernd Sierk":
            op_title = (
                "Bernd Sierk, EUMETSAT Die Erde vom Weltraum aus gesehen – "
                "was man von Satelliten alles messen kann (und wie)"
            )
        elif "Dopplerpeiler-Konzepts" in abstract.title:
            op_title = (
                "Michael Kugel, DC1PAA Realisierung des Relais / QRG-Monitors als "
                "ein Basis – Modul des Dopplerpeiler-Konzepts"
            )
        elif "Einstieg in CircuitPython mit dem Raspberry-Pi-Pico und NeoPixel-Matrizen" == abstract.title:
            op_title = "Michael Dörr Einstieg in CircuitPython mit dem Rasperry-Pico und NeoPixel-Matrizen"
        elif "DK5LV" in abstract.op:
            op_title = "Henning-Christof Weddig, DK5LV Mein erstes Funkgerät für das 2m Band"
        elif "WSPR – wie Amateurfunkverfahren Luft- und Raumfahrt in entlegenen Gebieten hilft" in abstract.title:
            op_title = (
                "Robert Westphal, DJ4FF WSPR – wie Amateurfunkverfahren Schiff- und Luftfahrt "
                "in entlegenen Gebieten hilft"
            )
        else:
            op_title = f"{abstract.op} {abstract.title}"
        op_title2abstract_lines[op_title] = abstract.abstract_lines

    # Now write the result as a Markdown file for my blog, to my blog:
    with open("../content/posts/2025/ukw_tagung_toc.de.md", "w") as toc:
        # Head matter.
        toc.write(
            """title: Programm UKW-Tagung 2025
slug: ukw_tagung_toc
date: 2025-09-03 01:32:00 UTC+02:00
modified: 2025-09-12 06:12:00 UTC+02:00
type: text
special_copyright: <p>Die Rechte an den Abstracts gehören den jeweiligen Autoren.</p>

## Was ist das?

Ich wollte das Programm der diesjährigen [UKW-Tagung](https://ukw-tagung.org/) anders sortiert haben.

Weil: Wenn ein Vortrag zu Ende ist und ich überlege, wo ich als
nächstes hingehe, möchte ich aus den Vorträgen, die als nächstes
anfangen, einen aussuchen.  Dazu will ich die hintereinander weg sehen
können *einschließlich der Vortragszusammenfassungen.*

Das leistet weder der [Vortragsplan](https://ukw-tagung.org/vortragsplan-ukw-tagung-2025/)
(Abstracts fehlen) noch die
[Seite der Abstracts](https://ukw-tagung.org/abstracts-der-vortraege-der-70-ukw-tagung-2025/)
(liefert keine Raum- und Zeitinformation).

Also habe ich ein [Pythonskript gebaut](/de/posts/2025/ukw_tagung_toc_software/),
das diese beiden Seiten der UKW-Tagung einliest, parst,
17 nickelige Inkonsistenzen bereinigt und die
resultierenden Daten sortiert hier wieder ausgibt.

"""
        )

        last_time = None
        for plan_item in plan_items:
            if plan_item.time != last_time:
                if last_time is not None:
                    toc.write("------\n")
                toc.write(f"\n\n## {plan_item.time} Uhr\n\n")
                last_time = plan_item.time
            toc.write(f"\n### {plan_item.title}\n\n**{plan_item.room}**\n\n")
            toc.write(f"{plan_item.op}, {plan_item.time} Uhr\n\n")
            # Retrieve and output the abstract:
            op_title = f"{plan_item.op} {plan_item.title}"
            if op_title in op_title2abstract_lines:
                for line in op_title2abstract_lines[op_title]:
                    toc.write(f"{line}\n\n")
            else:
                raise RuntimeError(f'"{op_title}" not found in abstracts')
        toc.write(
            "_Wer diesen Blogbeitrag kommentieren will und\n"
            "einen Fediverse-Zugang hat, kann\n"
            "[https://mastodon.radio/@dj3ei/115140232578322877]"
            "(https://mastodon.radio/@dj3ei/115140232578322877) kommentieren._\n\n"
        )


if __name__ == "__main__":
    main()

In deviation from what holds for the rest of the text of this page, this code I place into the “public domain”. Anyone can use it for anything they want to. Needless to say, there is no warranty.

If you want to comment on this and have a Fediverse account, you can leave your comment as an answer to https://mastodon.radio/@dj3ei/115140305869717657.