The Victorian Sage

"Many shall run to and fro, and knowledge shall be increased"

Tag: digital humanities

Analyzing Thomas Carlyle’s Writings with Voyant

A useful and user-friendly tool for basic digital analysis of texts is Voyant. I used it to analyze five works of Thomas Carlyle, taken from Project Gutenberg. The works chosen were:

Sartor Resartus (1834)

The French Revolution (1838)

On Heroes, Hero-worship and the Heroic in History (1841)

Past and Present (1843)

Latter-day Pamphlets (1850)

These were partly chosen as they are perhaps Carlyle’s most important works, but also because Gutenberg doesn’t have all Carlyle’s works. For example, I would have considered Chartism (1840) had it been there, but it wasn’t (though it can be accessed online via Google Books). Similarly, the massively influential Critical and Miscellaneous Essays (1838) were not there.

There are a couple of other minor caveats:

1) The version of Latter-day Pamphlets used was not the complete version. Like many versions, it consists of only five essays, omitting the final three.

2) The Gutenberg pages analyzed contained not only the texts of the works, but also various paratexts: title and publication details, Gutenberg’s copyright statement, and so on. This is most important regarding Past and Present, which contained an introduction by Ralph Waldo Emerson from the first US edition of the work. For a proper academic analysis, one would have to work on finding or creating a webpage or file with no such paratexts, but for the purposes of this blog, the superfluous material wasn’t enough to seriously upset the findings.

So, I simply copied and pasted the five links to the relevant pages on Gutenberg, then Voyant did the rest, returning a page filled with analysis of Carlyle’s works. First is a word cloud:

This can be adjusted to include from 25 words up. The adjustment bar, however, is very fiddly (at least on my iPad), and it’s hard to adjust the number of words with accuracy or tell what number of words are being shown. The cloud above has about 100 words, the 100 most common words across the texts. The larger the text, the greater the frequency. A quick look tells you that the most frequent word across all the texts is man. Still more pointedly, the second most frequent word is men. By clicking on the words in the cloud, we find that man gets 2293 mentions, men 1815. This tells us already a lot about Carlyle’s writing: he was interested in the male experience, he was troubled and obsessed by ideas of manhood, constantly working through these ideas. The words women and woman get only 182 and 56 mentions respectively. Already we see how Carlyle’s thought is out of kilter with these times.

We can toggle between cloud view and list view of most popular words, and while the former is perhaps more immediately striking and certainly more redolent of digital humanities, the latter view is better for a more exact picture. It allows us to ascertain for certain that he third most popular word is world. This presence illustrates the grandeur of Carlyle’s ambitions. He was a wide-gazing sage, not the narrowly focused expert that is valued in the 21st century. The frequency with which the word world occurs defines perhaps the most important difference between the Victorian intellectual and the contemporary scholar: he is not an expert an any particular thing, but rather strives to comprehend the world as a totality.

Shall is also in the top five. By clicking on the word, we can also see which work it is most popular in. In this case, it’s The French Revolution by quite a distance. This work is ostensibly one of history, but Carlyle is using shall to slip back and forth in time, to predict the future of the past, such as in the word’s very first appearance. This comes in a passage which is very typical of Carlyle, an address to the poverty-stricken masses of pre-revolutionary France on the occasion of a police crackdown on public protests/riots:

O ye poor naked wretches! and this, then, is your inarticulate cry to Heaven, as of a dumb tortured animal, crying from the uttermost depths of pain and debasement? Do these azure skies, like a dead crystalline vault, only reverberate the echo of it on you? Respond to it only by ‘hanging on the following days?’ –Not so: not forever! He are heard in Heaven. And the answer too will come,–in a horror of great darkness, and shakings of the world, and a cup of trembling from which all the nations shall drink. [My italics and underlining]

The cup of trembling was of course the French Revolution itself, which struck fear into the rich and privileged of all countries, and Carlyle is here tapping into the fear among his British readers that the Revolution could spread. So the use of shall here and in other parts of this work is a function of Carlyle’s particular mode, which might be called retroactive prophecy. It harnesses the power of the prophetical voice, with little of the epistemological risk (that is, it can hardly be wrong, because the things prophesied have for the most part already happened).

Table in Voyant showing relative frequency of “shall” in Carlyle’s works.

Voyant also supplies word count for each text. The French Revolution is the longest; Latter-day Pamphlets the shortest – though it is, as noted above, missing part of the originally published material. Not much to analyze there. Potentially more interestingly, there is considerable variation in vocabulary density across the works. Vocabulary density refers to the ratio of different words used to total word count. Carlyle’s highest vocabulary density occurs in Sartor, indicating that it is a more linguistically varied text, perhaps a more demanding and difficult text. As a particular admirer of Sartor, I think it also indicates that this work is the product of a more supple and questioning mind than the other works. The least vocabulary density is found in On Heroes. When one remembers that this work began as a series of lectures, this seems a deliberate choice by Carlyle, streamlining his vocabulary to make his ideas more accessible to a listening audience without the possibility of going back and reading over difficult parts.

Average words per sentence is another indicator of complexity. Here On Heroes has lowest wps, showing it again as the least complex text. The highest wps, though, is Pamphlets. This is an interesting development, as Carlyle’s wps had previously fallen from the heights of Sartor, but here hit a new peak. This anomalous situation warrants more developed study than I can give it here.

In the screenshot above, the final category is Distinctive Words. This means the words which characterize individual works but rarely or never appear in the other texts analyzed. Most of the words involved are proper nouns, generally the names of the works’ main characters: so Teufelsdrockh is the most distinctive word in Sartor, because Diogenes Teufelsdrockh is the book’s protagonist; abbot is the most distinctive word in Past and Present, because Abbot Samson is that book’s focus. Thus, this category seems too predictable to be really insightful, at least in the examples here.

I have only scraped the surface of the many possibilities of Voyant, not only for studies of a single author, but also, and perhaps especially, for comparison between authors. Thus I will undoubtedly return to this tool sooner rather than later, perhaps to compare Carlyle’s texts to those of some of his contemporaries. The most impressive things about the tool, in my opinion, are its astonishing ease of use (fiddly bar accompanying word cloud aside) and user-friendliness, and the fact that it is, as of now, totally free.

The Most Frequently Taught Fictional Texts in (US) Universities

The Open Syllable Project collects the booklists of over 1m syllabi (mostly US) and one can browse a list of the books used on all of these syllabi (ordered by frequency). There are over 933,000 books listed altogether, starting with the most commonly assigned book of all: Strunk and White’s Elements of Style, on almost 3,400 course lists. There are all sorts of investigations that can be done on this list, but for the moment I am just going to look at the works of fiction (mostly novels, but some that would be classed as novellas or short stories) that appear in the top 100 of the list. They are:

5. Frankenstein (Shelley)

15. Heart of Darkness (Conrad)

24. Things Fall Apart (Achebe)

36. The Great Gatsby (Fitzgerald)

43. Beloved (Morrison)

47. Huckleberry Finn (Twain)

50. The Yellow Wallpaper (Perkins)

55. The Awakening (Chopin)

57. Candide (Voltaire)

67. Invisible Man (Ellison)

70. Pride and Prejudice (Austen)

71. Their Eyes Were Watching God (Hurston)

76. Brave New World (Huxley)

87. Mrs Dalloway (Woolf)

91. The Metamorphosis (Kafka)

97. Adventures of Huckleberry Finn (Twain)

98. The Scarlet Letter (Hawthorne)

There are 17 works of fiction listed above, but Huckleberry Finn actually appears twice (with slight variations on the title), at 47 and at 97, so there’s really 16. Though the Open Syllabus Project appears to be well researched and well presented, this is a somewhat glaring oversight. Adding the scores together for both entries, it is clear that Huckleberry Finn should appear much higher on the list (in 10th position, by my calculations).

Some facts:

  • Among the 16 texts, 7 are by women, 9 by men.
  • 8 are by Americans, 4 English, 1 Pole (Conrad, although he was living in England throughout his writing career), 1 Nigerian, 1 French, 1 German.
  • 8 from the 20th century, 7 from the 19th century, 1 from the 17th century (Voltaire).

The theme of race is what really jumps out in this selection: Huckleberry Finn, Invisible Man, Beloved, Their Eyes Were Watching God, Heart of Darkness, Things Fall Apart. The last-named two are also about colonialism, and they are also 2 of the  3 most-frequently assigned fictional texts, which illustrates the centrality of the subject in US academia.

There is certainly some sense of the much-discussed “opening up the canon” here, but there is obviously a marked Eurocentrism to the choices. The exception is Achebe’s novel, though I would point out that the irony here is that Things Fall Apart is an obvious rebuttal of the super-canonical Heart of Darkness (of which Achebe was famously critical), so it is the most European of all African novels. There’s nothing from South America or Asia.

The old Leavisite canon is still there: Austen and Conrad make it; George Eliot and Henry James don’t. Dickens, liminal and semi-canonical in Leavis’ opinion, bubbles just under the top 100 here too (Great Expectations at 112). The continuing relevance of Conrad is clear: his emphasis on race and colonial relations in Heart of Darkness keep interest in his work alive, rather than for the purely literary qualities that Leavis sought (indeed, he thought Heart of Darkness one of Conrad’s lesser efforts, as he discusses in The Great Tradition). As for Austen, the case is less clear-cut. Certainly, she explored female subjectivity and the social relations of upper-class women in depth, but unlike Conrad she knew nothing of class or racial struggle. I think Raymond Williams’ account of Austen’s viewpoint as a social novelist, from which some things, some activities, some people, simply could not be seen, still stands up:

The land is seen primarily as an index of revenue and position; its visible order and control are a valued product, while the process of working it is hardly seen at all.


She is concerned with the conduct of people who, in the complications of improvement, are repeatedly trying to make themselves into a class. But where only one class is seen, no classes are seen. Her people are selected though typical individuals, living well or badly within a close social dimension. Cobbett never, of course, saw them as closely or as finely; but what he saw was what they had in common: the underlying economic process. (The Country and the City)

The fact that such a partial view has such resonance among the educationally privileged of 21st-century America is revealing in itself, and worthy of further reflection. And there is much more to provoke reflection in the results of the Open Syllabus Project, well worth a look for anyone interested in the nature of contemporary third level education.

Data on Historical Accuracy in Hollywood Films

Interesting (but also not) structuralist approach to assessing historical accuracy in recent movies from website Information is Beautiful. Selma is 100% historically accurate. I haven’t seen Selma but it sounded implausible to me that any film could be described as 100% historically accurate (even documentary footage has undergone selection of some sort), though I then noticed that IisB have a pedantry settings, and if set to maximum pedantry, Selma “only” gets 81%. Each film is divided into 50-ish scenes, and each scene gets a short commentary and comparison to documented history.


Each scene is then scored on a simple 4-option colour-coded scale and the percentage is arrived at from this. It’s a pretty straightforward methodology (if relatively time-consuming and requiring a lot of knowledge), and is mildly diverting, though I would tend to agree with Alex von Tunzelmann in the Guardian’s piece on the data: “The results are mostly in the right ballpark, but I’d be reluctant to issue such precise percentage-point scores on historical accuracy. It’s a nice touch that you can alter the pedantry level on the site. Even so, historical truth isn’t a binary: you need fuzzy logic.”


Conceptual spaces: politics, philosophy, art, literature, religion, cultural history

Charles A. Kush III

Charles Kush - Executive, Management Consultant, Board Member, Operating Partner - Ecommerce, Digital Marketing, Internet Technology

Eunoia Review

beautiful thinking

The Long Victorian - c.1789 - 1914

The literary world of the Long Nineteenth Century, c.1789 - 1914

Society of Fellows in the Humanities, Faculty of Arts, HKU

Society of Fellows in the Humanities, Faculty of Arts, HKU

Reading 1900-1950

The special collection of popular fiction at Sheffield Hallam University

ELT Planning

TEFL tips and ideas from a developing teacher

British Comparative Literature Association (BCLA)

Promoting the scholarly study of literature

Past Offences: Classic crime, thrillers and mystery book reviews

The best mystery and crime fiction (up to 1987): Book and movie reviews

Video Krypt

VHS Rules, OK?

my small infinities

My wee little life in this great big world and related sundries.

Nirvana Legacy

Dark Slivers out now: Kindle ebook or, for paperback, email

it's this or get a real job


"The game is afoot."

Exploring Youth Issues

Dr. Alan Mackie @ Edinburgh University

Bundle of Books

Thoughts from a bookworm

Selected Essays and Squibs by Joseph Suglia

The Web log of Dr. Joseph Suglia