The Victorian Sage

"Many shall run to and fro, and knowledge shall be increased"

Tag: data analysis

Comparing Dickens and Carlyle using Voyant

My last post did some basic analysis of a selection of Thomas Carlyle’s writings using Voyant. Now I want to use Voyant to compare Carlyle’s writings to those of his contemporary Charles Dickens. Dickens was primarily a novelist, and I am going to use here four novels and one novella for analysis. Specifically:

Oliver Twist (1838)

The Chimes (1844)

Bleak House (1853)

Hard Times (1854)

A Tale of Two Cities (1859)

Charles Dickens (1812-1870)

Dickens is, then, generically different from Carlyle. Carlyle was not a novelist or fiction writer. Indeed, from our point of view, it is difficult to place him generically at all. However, to his contemporaries he was a Sage. I have earlier noted that the Sage exhibited features of both the novelist and of the philosopher. Like the philosopher, he was concerned with life in the widest sense, but unlike the philosopher, the Sage did not employ logical argument to prove his validity as an interpreter of life. Rather, he used a myriad of techniques, including several from the novelist’s toolbox: narrative, characterization, dialogism, irony, sarcasm, parable, exhortation, sermonizing, and, in Carlyle’s case, sheer abuse. The abusive mode is one that is now rarely used, but it is not without power. Take this example from Carlyle:

Get out of that, you ugly and foolish windbags: do you think the Eternal God of Nature will suffer you to stand in the way of His work? If you cannot open your eyes and see that this is a thing that must be done, you had better betake yourselves elsewhere – to the lowest Gehenna were fittest – there is no place for you in a world which is ruled, in the long run, by fact and not by chimera. (Latter-day Pamphlets)

Carlyle is here contemptuous of his readers, the “foolish and ugly windbags” referred to. He does not try to convince through logic, but by the strength of his contempt for any opposing position. He almost orders the reader to convince themselves: If you cannot open your eyes… His position holds little logical authority, but its intensity is often effective. Ruskin, Carlyle’s disciple, also used this mode, as I have discussed elsewhere.

Dickens is an interesting comparison with Carlyle, both because he is the pre-eminent novelist of the time (in the Anglophone world, at least), and because his debt of influence to Carlyle is well established. He inscribed Hard Times (1854) “To Thomas Carlyle” and claimed to have read Carlyle’s French Revolution five hundred times. They had certain of the same social and perhaps even artistic aims, yet they were received very differently by the public and the press. Perhaps by comparing Carlyle with the great novelist, we can get a better idea of what the Sage was doing, and how he was doing it.

Most frequent words:

In the selective corpus inputted to Voyant, the most frequently used word is Mr, and it is followed by said, little, sir and know in that order. Remember Carlyle’s most used words were man, men, world, like, and shall. A major overlap appears to be the overwhelming male bias in their lexica. Both authors are far more interested in a specifically male experience of the world, with the female equivalents being far less commonly used. This bias is more pronounced in Carlyle, though, as woman, Miss and Mrs do also feature fairly high in Dickens’ list. The most surprising word on Dickens’ list is little, which appears 1959 times (for comparison, large is at 237; and big at 22).There is probably no other writer in whose corpus this adjective would be so prominent – and the books analyzed don’t even include Little Dorrit or The Old Curiosity Shop (protagonist: Little Nell), so the results could have been even more striking. The concept of littleness, then, is clearly central to Dickens’ work. Other than that, Carlyle’s choices are more distinctive and revealing than Dickens’. I will not repeat what I have already written about Carlyle, but regarding Dickens it is really striking how commonplace and unliterary are all of his most frequent words. Forty of the top 50 words are monosyllables, and the only entries of more than two syllables are the trisyllabic gentleman and Oliver (as in Twist, the only character name in the top 50).

Word cloud on Voyant showing Dickens’ most frequent words.

Vocabulary density:

Carlyle’s most dense text was Sartor Resartus at 0.137, with French Revolution the least dense at 0.073. With Dickens the range was from The Chimes at 0.138 to Bleak House at 0.065. Even from my few initial Voyant analyses, I can see that this measure is rather misleading if taken in isolation, as a shorter text will almost always have a higher density than a long text. So the two authors’ longest works are also the ones with the most repeated words and the lowest density. At the other end, the comparison is more revealing, as Chimes and Sartor have almost equal density, though the latter is much longer: 85251 words as opposed to 34124. So Carlyle actually demonstrates a much higher vocabulary density than Dickens, and a much larger vocabulary. In total Carlyle uses 32294 unique words, Dickens 22432. This is a strikingly large gap. Carlyle has a significantly larger vocabulary than Dickens.

Words per sentence:

I noted in the last post that Carlyle’s average wps ranged from 22.6 to 31.5 across the selective corpus. Dickens’ wps ranges from 15.7 in The Chimes to 18.6 in A Tale of Two Cities and Oliver Twist. In fact, apart from Chimes having a noticeably lower wps, there is little variation across Dickens’ texts. But they all have much lower wps than Carlyle. Carlyle was particularly fond of long sentences and complex structures. At the same time, there may be a generic reason for the big difference here: Dickens’ fiction has a lot of dialogue, and this will generally be comprised of much shorter sentences, including one-word sentences (replies like “yes”, “no”, etc.).

To ascertain the role played by such factors as genre on wps would of course require analysis of a much wider range and larger number of texts. This initial analysis does raise several interesting points about the differences between Carlyle and Dickens. The biggest surprise for me is the degree to which the statistics seems to suggest a greater sophistication in Carlyle’s works. I may perform further comparisons using other Victorian writers – novelists, Sages and other – to get a more nuanced understanding of this.

Dickens Voyant analysis: https://voyant-tools.org/?corpus=dcc74d10fbfc6d00c4dc79b07670a90c

Carlyle Voyant analysis: https://voyant-tools.org/?corpus=38b0c430d5a5179d802fac046003b23d

Voyant analysis of my PhD thesis https://voyant-tools.org/?corpus=f259039874058130cc7d18fbf033b91d

Analyzing Thomas Carlyle’s Writings with Voyant

A useful and user-friendly tool for basic digital analysis of texts is Voyant. I used it to analyze five works of Thomas Carlyle, taken from Project Gutenberg. The works chosen were:

Sartor Resartus (1834)

The French Revolution (1838)

On Heroes, Hero-worship and the Heroic in History (1841)

Past and Present (1843)

Latter-day Pamphlets (1850)

These were partly chosen as they are perhaps Carlyle’s most important works, but also because Gutenberg doesn’t have all Carlyle’s works. For example, I would have considered Chartism (1840) had it been there, but it wasn’t (though it can be accessed online via Google Books). Similarly, the massively influential Critical and Miscellaneous Essays (1838) were not there.

There are a couple of other minor caveats:

1) The version of Latter-day Pamphlets used was not the complete version. Like many versions, it consists of only five essays, omitting the final three.

2) The Gutenberg pages analyzed contained not only the texts of the works, but also various paratexts: title and publication details, Gutenberg’s copyright statement, and so on. This is most important regarding Past and Present, which contained an introduction by Ralph Waldo Emerson from the first US edition of the work. For a proper academic analysis, one would have to work on finding or creating a webpage or file with no such paratexts, but for the purposes of this blog, the superfluous material wasn’t enough to seriously upset the findings.

So, I simply copied and pasted the five links to the relevant pages on Gutenberg, then Voyant did the rest, returning a page filled with analysis of Carlyle’s works. First is a word cloud:

This can be adjusted to include from 25 words up. The adjustment bar, however, is very fiddly (at least on my iPad), and it’s hard to adjust the number of words with accuracy or tell what number of words are being shown. The cloud above has about 100 words, the 100 most common words across the texts. The larger the text, the greater the frequency. A quick look tells you that the most frequent word across all the texts is man. Still more pointedly, the second most frequent word is men. By clicking on the words in the cloud, we find that man gets 2293 mentions, men 1815. This tells us already a lot about Carlyle’s writing: he was interested in the male experience, he was troubled and obsessed by ideas of manhood, constantly working through these ideas. The words women and woman get only 182 and 56 mentions respectively. Already we see how Carlyle’s thought is out of kilter with these times.

We can toggle between cloud view and list view of most popular words, and while the former is perhaps more immediately striking and certainly more redolent of digital humanities, the latter view is better for a more exact picture. It allows us to ascertain for certain that he third most popular word is world. This presence illustrates the grandeur of Carlyle’s ambitions. He was a wide-gazing sage, not the narrowly focused expert that is valued in the 21st century. The frequency with which the word world occurs defines perhaps the most important difference between the Victorian intellectual and the contemporary scholar: he is not an expert an any particular thing, but rather strives to comprehend the world as a totality.

Shall is also in the top five. By clicking on the word, we can also see which work it is most popular in. In this case, it’s The French Revolution by quite a distance. This work is ostensibly one of history, but Carlyle is using shall to slip back and forth in time, to predict the future of the past, such as in the word’s very first appearance. This comes in a passage which is very typical of Carlyle, an address to the poverty-stricken masses of pre-revolutionary France on the occasion of a police crackdown on public protests/riots:

O ye poor naked wretches! and this, then, is your inarticulate cry to Heaven, as of a dumb tortured animal, crying from the uttermost depths of pain and debasement? Do these azure skies, like a dead crystalline vault, only reverberate the echo of it on you? Respond to it only by ‘hanging on the following days?’ –Not so: not forever! He are heard in Heaven. And the answer too will come,–in a horror of great darkness, and shakings of the world, and a cup of trembling from which all the nations shall drink. [My italics and underlining]

The cup of trembling was of course the French Revolution itself, which struck fear into the rich and privileged of all countries, and Carlyle is here tapping into the fear among his British readers that the Revolution could spread. So the use of shall here and in other parts of this work is a function of Carlyle’s particular mode, which might be called retroactive prophecy. It harnesses the power of the prophetical voice, with little of the epistemological risk (that is, it can hardly be wrong, because the things prophesied have for the most part already happened).

Table in Voyant showing relative frequency of “shall” in Carlyle’s works.

Voyant also supplies word count for each text. The French Revolution is the longest; Latter-day Pamphlets the shortest – though it is, as noted above, missing part of the originally published material. Not much to analyze there. Potentially more interestingly, there is considerable variation in vocabulary density across the works. Vocabulary density refers to the ratio of different words used to total word count. Carlyle’s highest vocabulary density occurs in Sartor, indicating that it is a more linguistically varied text, perhaps a more demanding and difficult text. As a particular admirer of Sartor, I think it also indicates that this work is the product of a more supple and questioning mind than the other works. The least vocabulary density is found in On Heroes. When one remembers that this work began as a series of lectures, this seems a deliberate choice by Carlyle, streamlining his vocabulary to make his ideas more accessible to a listening audience without the possibility of going back and reading over difficult parts.

Average words per sentence is another indicator of complexity. Here On Heroes has lowest wps, showing it again as the least complex text. The highest wps, though, is Pamphlets. This is an interesting development, as Carlyle’s wps had previously fallen from the heights of Sartor, but here hit a new peak. This anomalous situation warrants more developed study than I can give it here.

In the screenshot above, the final category is Distinctive Words. This means the words which characterize individual works but rarely or never appear in the other texts analyzed. Most of the words involved are proper nouns, generally the names of the works’ main characters: so Teufelsdrockh is the most distinctive word in Sartor, because Diogenes Teufelsdrockh is the book’s protagonist; abbot is the most distinctive word in Past and Present, because Abbot Samson is that book’s focus. Thus, this category seems too predictable to be really insightful, at least in the examples here.

I have only scraped the surface of the many possibilities of Voyant, not only for studies of a single author, but also, and perhaps especially, for comparison between authors. Thus I will undoubtedly return to this tool sooner rather than later, perhaps to compare Carlyle’s texts to those of some of his contemporaries. The most impressive things about the tool, in my opinion, are its astonishing ease of use (fiddly bar accompanying word cloud aside) and user-friendliness, and the fact that it is, as of now, totally free.

Data on Historical Accuracy in Hollywood Films

Interesting (but also not) structuralist approach to assessing historical accuracy in recent movies from website Information is Beautiful. Selma is 100% historically accurate. I haven’t seen Selma but it sounded implausible to me that any film could be described as 100% historically accurate (even documentary footage has undergone selection of some sort), though I then noticed that IisB have a pedantry settings, and if set to maximum pedantry, Selma “only” gets 81%. Each film is divided into 50-ish scenes, and each scene gets a short commentary and comparison to documented history.

infoisbeautiful

Each scene is then scored on a simple 4-option colour-coded scale and the percentage is arrived at from this. It’s a pretty straightforward methodology (if relatively time-consuming and requiring a lot of knowledge), and is mildly diverting, though I would tend to agree with Alex von Tunzelmann in the Guardian’s piece on the data: “The results are mostly in the right ballpark, but I’d be reluctant to issue such precise percentage-point scores on historical accuracy. It’s a nice touch that you can alter the pedantry level on the site. Even so, historical truth isn’t a binary: you need fuzzy logic.”

emma reads

books + nefarious plots

shakemyheadhollow

Conceptual spaces: politics, philosophy, art, literature, religion, cultural history

Charles A. Kush III

Charles Kush - Executive, Management Consultant, Board Member, Operating Partner - Ecommerce, Digital Marketing, Internet Technology

Eunoia Review

beautiful thinking

The Long Victorian

Sleep is good, books are better

Society of Fellows in the Humanities, Faculty of Arts, HKU

Society of Fellows in the Humanities, Faculty of Arts, HKU

Reading 1900-1950

The special collection of popular fiction at Sheffield Hallam University

ELT Planning

TEFL tips and ideas from a developing teacher

Marc Champagne

I'm a philosopher. I think.

Past Offences: Classic crime, thrillers and mystery book reviews

The best mystery and crime fiction (up to 1987): Book and movie reviews

Video Krypt

VHS Rules, OK?

my small infinities

My wee little life in this great big world and related sundries.

Nirvana Legacy

Write to nicksoulsby@hotmail.com for a free PDF copy of the Dark Slivers book

gregfallis.com

it's this or get a real job

221B

"The game is afoot."

Exploring Youth Issues

Dr. Alan Mackie @ University of Dundee

Bundle of Books

Thoughts from a bookworm