A HTTP / HTML spider
🇩🇪 Post ist auch verfügbar auf deutsch.
I occasional grab stuff from the internet and re-organize the info contained in that stuff. One example was, I didn’t like the sort order of the programm of a German amateur radio convention, so I produced my own.
My problem was: One page had the timetable, the other the abstracts. I unified that info into one page.
Here’s how I did it.
I wrote a script that does the following things:
- Grab the pertinent pages (two in this case) from the internet
- and cache them. I expect to run half-baked versions of my script often until it’s done and don’t want to put load on the server each time.
- Parse the HTML and extract the two types of information I’m interested in: Time information and abstract.
- I didn’t plan to do this, but it unfortunately proved necessary: 17 out of the 36 items had conflicts between timetable and abstracts which needed to be resolved. The author or the title didn’t quite agree. This started with humble typing errors and ended with “workshops” that didn’t have an abstract.
- Then join the two sources of info and write the result into a Markdown file straight into my blog sources.
I do similar stuff occasionally. So I document it here comparatively cleanly. I plan to use this as raw material when doing it again.
Software used
I use Python, currently version 3.11.2, and installed, via pip
, the
following software into a venv
:
requests beautifulsoup4
Including their dependencies, as of today, this leads to this installation:
beautifulsoup4==4.13.5 certifi==2025.8.3 charset-normalizer==3.4.3 idna==3.10 requests==2.32.5 soupsieve==2.8 typing_extensions==4.15.0 urllib3==2.5.0
I need to have the BeautifulSoup Dokumentation documentation handy. (If you know DOM: This is a DOM software for Python.)
Polishing
Before publication, I polished the script a bit, something that I more or less habitually do when coding Python.
For that, I used this software:
pip install isort flake8 black types-requests mypy
And I saw to it the following commands all come through without errors:
isort sort_ukwtagung_programm flake8 --max-line-length=120 --color=never sort_ukwtagung_programm black -l 120 sort_ukwtagung_programm mypy --strict sort_ukwtagung_programm
The program itself
This is the code of my program sort_ukwtagung_programm
:
#!/usr/bin/env python # This programm is in the public domain, # so usable by anyone for any purpose without restrictions, # see https://creativecommons.org/publicdomain/zero/1.0/ . import os import re from dataclasses import dataclass from os.path import isfile from typing import cast import requests from bs4 import BeautifulSoup from bs4.element import NavigableString, Tag DEFAULT_CACHE_DIR = "cache" def get_html_page( session: requests.Session, uri: str, cache_filename: str, cache_dir: str = DEFAULT_CACHE_DIR ) -> BeautifulSoup: """Download a HTML page from the internet and extract the DOM tree. This caches on first access and never bothers to invalidate the cache, that is, assumes the cache is always fresh.""" cache_fq_filename = f"{cache_dir}/{cache_filename}" if isfile(cache_fq_filename): # We downloaded the file earlier, so don't bother the HTTP-Server again: pass else: page_response = session.get(uri) page_response.raise_for_status() content_type = page_response.headers["content-type"].lower() if "text/html" in content_type: with open(cache_fq_filename, "w") as cache_file: cache_file.write(page_response.text) else: raise RuntimeError(f"Expected text/html, found {content_type} for {uri}") with open(cache_fq_filename, "r") as cache_file: return BeautifulSoup(cache_file, "html.parser") @dataclass class PlanItem: """Data class for items from the plan (timetable).""" time: str room: str op: str title: str def grab_plan(session: requests.Session, uri: str, cache_filename: str) -> list[PlanItem]: """Grab plan from the internet and extract the plan.""" plan_html = get_html_page(session, uri, cache_filename) table = plan_html.table if table is None: raise RuntimeError(f"No table in {uri} content") tbody = table.tbody if tbody is None: raise RuntimeError(f"No table.tbody in {uri} content") trs = tbody.find_all("tr") def parse_tr(tr: Tag) -> list[str]: """Split a <tr> into <td>s and grab the string content of each <td>.""" tds = tr.find_all("td") # Consistency check, true for the table we want to parse: if len(tds) not in [5, 6]: raise RuntimeError(f"Expecting 5 or 6 columns in row {tr.prettify()}") result: list[str] = [] for td in tds: # Remove all HTML markup and map every <td> content to a single-line string. result.append(" ".join(cast(Tag, td).strings).replace("\n", "").strip()) return result # The table puts info that belongs on top of the table in the table header / <td> # and info that belongs in the table header in the initial row: initial_row = parse_tr(cast(Tag, trs[0])) result = [] # Each logical table row is coded as two HTML table rows. for row_i in range(3, len(trs) - 2, 2): time_row = parse_tr(cast(Tag, trs[row_i])) # This row does not have the initial column. title_row = parse_tr(cast(Tag, trs[row_i + 1])) time = time_row[0] # Merge info from the initial row and the two data rows, # processing columnwise. for room, op, title in zip(initial_row[1:], time_row[1:], title_row): if op and title: # There is a lot of inconsistency between the table and the abstracts, # regarding op and title. # In cases where the abstracts info was considered better, # we replace that info here. if "Erich H. Franke" in op: op = "Erich H. Franke, DK6II" title = "Künstliche Intelligenz in der Elektronik-Entwicklung. Ernsthaftes Hilfsmittel oder Hype?" elif "Satelliten-Funk – quo vadis? 52 Jahre AMSAT DL E.V." in title: title = "Satelliten-Funk – quo vadis? 52 Jahre AMSAT DL e.V." elif "Umweltsensordaten des Urban Weather Project im „Digitalen Zwilling“ der mrn" in title: title = ( "Umweltsensordaten des Urban Weather Project im „Digitalen Zwilling“ " "der Metropolregion Rhein-Neckar" ) elif "Hol mehr aus dem Si5351 heraus" in title: op = "Pieter-Tjerk de Boer, PA3FWM" title = "Hol mehr aus dem Si5351 heraus: höhere Frequenzauflösung, Messungen und Modulation" elif "Ein modularer Mehrkanal-VNA von 9 kHz bis (evt.) 26.5 GHz" in title: op = "Paul Boven, PE1NUT" title = "Ein modularer Mehrkanal-VNA von 9 kHz bis (evt.) 26.5 GHz: Erste Schritte" elif "DJ1NG" in op: op = "Guido Liedtke, DJ1NG" title = ( "Jedermannfunkgeräte für den Notfunk – welche u.U. auch als Amateurfunkgeräte interessant sind" ) elif "Paul Boven" in op: op = "Wolfgang Herrmann, Paul Boven, PE1NUT" elif "Ein Streifzug durch die Geoinformatik für Funkamateure und Dxer" in title: title = "Ein Streifzug durch die Geoinformatik für Funkamateure und DXer" result.append(PlanItem(time=time, room=room, op=op, title=title)) return result @dataclass class Abstract: """Data class for items from the abstracts page.""" op: str title: str abstract_lines: list[str] def grab_abstracts(session: requests.Session, uri: str, cache_filename: str) -> list[Abstract]: """Grab abstracts from the internet and extract the individual abstracts.""" def split_tag_in_lines(t: Tag) -> list[str]: """Helper that splits a tag's contents in individual lines.""" result: list[str] = [] for c in t.contents: if type(c) is Tag: for sub_line in split_tag_in_lines(c): # Do this recursively. # This hopes they didn't use <span> or <a> or similar inline stuff. result.append(sub_line) elif type(c) is NavigableString: result.append(" ".join(c.strings)) else: raise RuntimeError(f"Didn't expect type {type(c)} of {c}") return result abstracts_html = get_html_page(session, uri, cache_filename) # They put a two-digit number in front of the author that I want to remove: split_away_number = re.compile(r"\d\d\s+([^\s].+[^\s])\s*") result: list[Abstract] = [] for h4_raw in abstracts_html.find_all("h4"): h4: Tag = cast(Tag, h4_raw) abstract_lines: list[str] = [] num_and_op, _br, title = h4.contents if type(num_and_op) is NavigableString: num_and_op_s = num_and_op.string elif type(num_and_op) is Tag: num_and_op_maybe_s = num_and_op.string if num_and_op_maybe_s is None: raise RuntimeError(f"num_and_op not found in {h4.prettify()}") else: num_and_op_s = num_and_op_maybe_s else: raise SystemError(f"Unexpected type {type(num_and_op)} of {num_and_op}") if num_and_op_mo := split_away_number.fullmatch(num_and_op_s): op = num_and_op_mo.group(1) else: raise SystemError(f"Could not parse {num_and_op_s}") # Now harvest the lines of the abstract. # This is simply all the stuff that follows, until the next h4 # or the end; all abstracts are in a <div>. sib = h4.next_sibling while True: while sib is not None and type(sib) is not Tag: if type(sib) is NavigableString: if "" == str(sib).strip(): pass else: abstract_lines.append(str(sib)) sib = sib.next_sibling if sib is None or "h4" == sib.name: break else: # <p> or something for line in split_tag_in_lines(sib): abstract_lines.append(line) sib = sib.next_sibling title_s = cast(Tag, title).string if title_s is None: raise RuntimeError(f"No title for abstract of {op}") else: result.append(Abstract(op=op, title=title_s, abstract_lines=abstract_lines)) # We have a few things in the plan that don't have their own abstract: NO_COMMENT = ["(Keine weitere Beschreibung)"] result.append( Abstract( op="Charly Eichhorn, DK3ZL", title="Live-QSO mit der Neumayer III Südpolstation über QO-100", abstract_lines=NO_COMMENT, ) ) result.append(Abstract(op="Michael Dörr", title="Workshop NeoPixel", abstract_lines=["Siehe Vortrag 12:30-13:15"])) result.append( Abstract( op="Alex Knochel DK3HD", title="Vorbereitung eines Stratosphärenballons mit SSTV auf der Wiese der DBS", abstract_lines=NO_COMMENT, ) ) result.append( Abstract( op="Alex Knochel DK3HD", title="Start eines Stratosphärenballons mit SSTV auf der Wiese der DBS", abstract_lines=NO_COMMENT, ) ) return result def main() -> None: # Create the cache dir on first run: os.makedirs(DEFAULT_CACHE_DIR, exist_ok=True) with requests.Session() as session: plan_items = grab_plan(session, "https://ukw-tagung.org/vortragsplan-ukw-tagung-2025/", "plan.html") abstracts = grab_abstracts( session, "https://ukw-tagung.org/abstracts-der-vortraege-der-70-ukw-tagung-2025/", "abstracts.html" ) # Provide the abstracts in a dict with key a concatenation of op and title, with a " " intervening: op_title2abstract_lines: dict[str, list[str]] = {} for abstract in abstracts: # Fix inconsistencies. # Where we think the plan has the better representation of the op and/or the title, # we use that. if abstract.op == "Rüdiger Lang, Bernd Sierk": op_title = ( "Bernd Sierk, EUMETSAT Die Erde vom Weltraum aus gesehen – " "was man von Satelliten alles messen kann (und wie)" ) elif "Dopplerpeiler-Konzepts" in abstract.title: op_title = ( "Michael Kugel, DC1PAA Realisierung des Relais / QRG-Monitors als " "ein Basis – Modul des Dopplerpeiler-Konzepts" ) elif "Einstieg in CircuitPython mit dem Raspberry-Pi-Pico und NeoPixel-Matrizen" == abstract.title: op_title = "Michael Dörr Einstieg in CircuitPython mit dem Rasperry-Pico und NeoPixel-Matrizen" elif "DK5LV" in abstract.op: op_title = "Henning-Christof Weddig, DK5LV Mein erstes Funkgerät für das 2m Band" elif "WSPR – wie Amateurfunkverfahren Luft- und Raumfahrt in entlegenen Gebieten hilft" in abstract.title: op_title = ( "Robert Westphal, DJ4FF WSPR – wie Amateurfunkverfahren Schiff- und Luftfahrt " "in entlegenen Gebieten hilft" ) else: op_title = f"{abstract.op} {abstract.title}" op_title2abstract_lines[op_title] = abstract.abstract_lines # Now write the result as a Markdown file for my blog, to my blog: with open("../content/posts/2025/ukw_tagung_toc.de.md", "w") as toc: # Head matter. toc.write( """title: Programm UKW-Tagung 2025 slug: ukw_tagung_toc date: 2025-09-03 01:32:00 UTC+02:00 modified: 2025-09-03 13:23:00 UTC+02:00 type: text special_copyright: <p>Die Rechte an den Abstracts gehören den jeweiligen Autoren.</p> ## Was ist das? Ich wollte das Programm der diesjährigen [UKW-Tagung](https://ukw-tagung.org/) anders sortiert haben. Weil: Wenn ein Vortrag zu Ende ist und ich überlege, wo ich als nächstes hingehe, möchte ich aus den Vorträgen, die als nächstes anfangen, einen aussuchen. Dazu will ich die hintereinander weg sehen können *einschließlich der Vortragszusammenfassungen.* Das leistet weder der [Vortragsplan](https://ukw-tagung.org/vortragsplan-ukw-tagung-2025/) (Abstracts fehlen) noch die [Seite der Abstracts](https://ukw-tagung.org/abstracts-der-vortraege-der-70-ukw-tagung-2025/) (liefert keine Raum- und Zeitinformation). Also habe ich ein [Pythonskript gebaut](/de/posts/2025/ukw_tagung_toc_software/), das diese beiden Seiten der UKW-Tagung einliest, parst, 17 nickelige Inkonsistenzen bereinigt und die resultierenden Daten sortiert hier wieder ausgibt. """ ) last_time = None for plan_item in plan_items: if plan_item.time != last_time: if last_time is not None: toc.write("------\n") toc.write(f"\n\n## {plan_item.time} Uhr\n\n") last_time = plan_item.time toc.write(f"\n### {plan_item.title}\n\n**{plan_item.room}**\n\n") toc.write(f"{plan_item.op}, {plan_item.time} Uhr\n\n") # Retrieve and output the abstract: op_title = f"{plan_item.op} {plan_item.title}" if op_title in op_title2abstract_lines: for line in op_title2abstract_lines[op_title]: toc.write(f"{line}\n\n") else: raise RuntimeError(f'"{op_title}" not found in abstracts') toc.write( "_Wer diesen Blogbeitrag kommentieren will und\n" "einen Fediverse-Zugang hat, kann\n" "[https://mastodon.radio/@dj3ei/115140232578322877]" "(https://mastodon.radio/@dj3ei/115140232578322877) kommentieren._\n\n" ) if __name__ == "__main__": main()
In deviation from what holds for the rest of the text of this page, this code I place into the “public domain”. Anyone can use it for anything they want to. Needless to say, there is no warranty.
If you want to comment on this and have a Fediverse account, you can leave your comment as an answer to https://mastodon.radio/@dj3ei/115140305869717657.