Tag: Data Science

  • APEX Updates, 1: Building a Dataset

    Every big project starts with a deceptively small question. For me, it was: how do you turn a carved letter into data?

    APEX (Alphabetic Paleography Explorer) is my attempt to map how the Greek alphabet developed and spread—first across Greek-speaking regions, then into other scripts entirely. But before I can compare, model, or visualize anything, I need something more fundamental: a dataset that doesn’t just record letters, but understands them. That’s where things get tricky.

    Step 0: Drawing the Inscriptions

    Most corpora don’t offer clean, high-res images. They give us facsimiles—drawn reconstructions, often made by epigraphers decades ago. I tried using automated skeletonization on those, but the results were messy and inconsistent. So I went manual: scanning documents and tracing letters by hand on my iPad.

    It’s slow. But it gives me clean, consistent vector forms that reflect how letters were actually drawn—and forces me to look closely at every curve, stroke, and variation. In a sense, this is my own kind of excavation.

    What I Track

    Each inscription gets logged with basic info: where it was found, what it was written on, when it was made (as best we can tell), and how damaged it is. But the real heart of the project is the letters.

    For each character, I record:

    • Visual traits (curvature, symmetry, stroke count, proportions)
    • Layout (spacing, alignment, writing direction)
    • Function (sound value, graphemic identity)
    • Notes on ambiguity or damage

    From this, I can start comparing how different regions handled the same letter—Did their rho have a loop? Was their epsilon closed?—and whether that tells us something about cultural contact or local invention.

    The Workflow

    The data entry pipeline looks like this:

    1. Scan + trace the letterform
    2. Enter the inscription’s metadata
    3. Manually mark letter positions and reading direction
    4. Extract geometric features automatically
    5. Save everything as structured, nestable JSON

    It’s part computer vision, part field notes, and part quiet staring at a very old alpha until you start to feel like it’s looking back.

    Why This Level of Detail?

    Because I want to ask big questions—how alphabets travel, which paths are innovations vs. imitations—but I don’t want to ask them fuzzily. Too much work on writing systems either leans purely qualitative or strips out the messiness for the sake of clean data. APEX is an attempt to hold both: interpretive richness and formal structure.

    This dataset—AlphaBase, soon to be expanded to other open-access museum collections and public domain corpora—is the scaffolding. It’s how I’ll test transmission models later on. But even on its own, it’s already revealing things—like which letterforms stay stable across centuries, and which are quick to splinter under pressure.

    APEX begins here: not with theory, but with tracing. With building a system that doesn’t just store letterforms, but actually listens to what they’re doing. That’s what this first trench is for. Now I get to start digging.

  • Monthly Reads, 1: March 2025

    Monthly Reads, 1: March 2025

    There’s no single unifying theme to this list—but there is a feeling. I’m reaching, at once, toward the origins of writing and the frontiers of language technology. It’s structure that’s defining me at the moment: how systems encode meaning, whether that’s Greek orthography or neural networks. And in between, I let myself breathe with fiction—stories that play with form, time, and voice themselves.

    Recently Finished:

    • Epigraphic Evidence
      A technical addition to my current work on inscriptions. Like black coffee: not always easy to imbibe, but quite efficient.
    • Data Science (MIT Press Essential Knowledge series)
      A clean introduction—refreshing for thinking about data both ancient and modern.
    • Ripley’s Game (Patricia Highsmith)
      Cold, elegant, amoral. Hilarious at points. A good palate cleanser between denser texts.
    • The Sequel (Jean Hanff Korelitz)
      Read this mostly for plot, not language—but I love thinking about narrative structures and the great Second Novel Problem.
    • The English Understand Wool (Helen DeWitt)
      Sharp, strange, and delightful. I love a novel about an out-of-touch eccentric navigating the world.

    Currently Reading:

    • Kairos (Jenny Erpenbeck)
      A novel about political and personal time, and a very complicated affair. Thorny for sure.
    • Word by Word: The Secret Life of Dictionaries (Kory Stamper)
      The theme of choice made at all stages of lexicography deeply resonates with me as I encode my own systemic information. Chapters like “Bitch” and “Posh” capture this especially well.
    • Writing and the Origins of Greek Literature (Barry B. Powell)
      I keep coming back to this one in small sips. Chapters go down easy.
    • Greek Writing from Knossos to Homer (Roger D. Woodard)
      Foundational for understanding the transmission of the Greek alphabet. Very well written and thoroughly researched.
    • Algorithms (MIT Press Essential Knowledge series)
      A manageable way to reframe my thinking on rules and structure—not unlike real-life syntactical derivations.
    • Machine Learning (MIT Press Essential Knowledge series)
      Challenging. Still finding where I fit in here. Hoping I can apply to my APEX project by Stage 3.
    • JSON for Beginners
      Very practical for APEX, which is structured with JSON and makes heavy use of standoff annotation. This allows me to encode uncertainty and multiplicitous readings, lowering the amount of assumptions baked into the dataset.
    • Co-Intelligence: Living and Working with AI (Ethan Mollick)
      For someone working on ancient inscriptional data, the future of coworking with AI is too relevant to ignore.

    I wouldn’t call this a reading list so much as a reading state, a snapshot of what it feels like to be in the thick of things: academic work, blog writing, thesis planning, and whatever this slow journey toward modern spoken French is shaping up to be.

    Picture: I’ve been stacking my recent reads as a kind of personal monument—hoping to match my own height before summer.