Text mining and visualizations

Friday, May 23rd, 2008 | rob nelson

I’d like to second Laura’s suggestion about a session on textual visualizations and document archives.  I’d like to add to that the issue of text mining.  I’m soon going to be beginning a text mining project using a couple of collections of nineteenth-century American documents: the Valley of the Shadow Archive and the complete run of the Daily Dispatch (Richmond, Virginia, paper) during the Civil War.  I emphasize the word beginning.  I’m really suggesting the panel this as a humble supplicant.  I know that Bill’s done some work on text mining (and I’m already indebted to him for his very thoughtful blog posts on the subject), and that Dan, Jeremy, and Sean Takats are beginning a major text mining project at CHNM–I was hoping that you who have been thinking about and have some expertise in text mining would be interested in such a session.  A particular issue I’d be interesting in brainstorming about is methods to use text mining for analysis, i.e. producing visualizations drawn from thousands, maybe tens or even hundreds of thousands, of documents that shed new light on historical questions.

I’ve been doing a bit of thinking about using text mining for analyzing the Valley Archive.  Compared to large online archives like Google Books and the American Periodical Series, the Valley Archive is comparatively modest in size.  But as a curated collection it does offer something that I think likely to be useful: a number of pre-existing axes that offer what I expect will be some analytically interesting opportunities to contrast different caches of documents .  One obvious point of contrast is northern vs. southern documents.  Another axes would involve the nature of the document–from public documents (any published source), to very private documents (e.g. diaries), to semi-private (e.g. letters).  Being able to throw up visualizations in six quadrants (well, that’s not the right word, but I can tell you that “sextrants” definitely isn’t right) and compare language used in northern vs. southern public writings, southern public vs. southern semi-public vs. southern very private documents, etc., might immediately offer what I’m hoping will be interesting and useful interpretative possibilities.  What are the difference between the way northerners and southerners write about “nation?”   How does (or does) that change over the course of the war?  Do we see convergence or divergence between the sections/nations?  Is there any sectional difference in the language involving or surrounding death?  How does that change over time?   And what are the differences in the language around “death” in public vs. private documents?

I was really interested in Adam’s post on “Scholarship and Digital Texts.”  Perhaps we should have a pair of sessions on texts, one focused on issues of deep markup (xml, tei, etc.) and innovative presentation of particular documents (making the most of micro collections like a critical edition) and another focused on issues of using text mining and visualizations to tackle and make use of the abundance of digitized documents now available (macro collections).

3 Responses to “Text mining and visualizations”

  1. Laura Mandell Says:

    Rob: I like the idea of the two sessions. I can bring a poem marked up as deeply as possible — fit for Monk — using TEI P5. I’m getting it prepared for Ira Greenberg (www.iragreenberg.com) to create a visualization of the poem: he’s a digital artist, and I’ll be able to show at least one of his visualizations when I come.
    Thinking about data mining, about huge amounts of data and visualizations, that seems to me more important than anything in thinking about how computation will change literary and historical analysis (I completely buy Moretti’s arguments). Laura

  2. Lisa Spiro Says:

    Sign me up! Your project sounds fascinating, Rob. I’m exploring the use of text mining tools for my project on 19th C culture and bachelorhood, so I’d love to participate in this conversation (and pretty much every other one taking place at THAT Camp too, sigh).

  3. Liste non exhaustive des thématiques abordées lors des THATCamp | ThatCamp Paris 2010 Says:

    […] thatcamp.org/2008/05/text-mining-and-visualizations/ […]