48
It would seem that I have far too much time on my hands. After the post about a Star Trek "test", I started wondering if there could be any data to back it up and... well here we go:
The Next Generation
Name | Percentage of Lines |
---|---|
PICARD | 20.16 |
RIKER | 11.64 |
DATA | 10.1 |
LAFORGE | 6.93 |
WORF | 6.14 |
TROI | 5.4 |
CRUSHER | 5.11 |
WESLEY | 2.32 |
DS9
Name | Percentage of Lines |
---|---|
SISKO | 13.0 |
KIRA | 8.23 |
BASHIR | 7.79 |
O'BRIEN | 7.31 |
ODO | 7.26 |
QUARK | 6.98 |
DAX | 5.73 |
WORF | 3.18 |
JAKE | 2.31 |
GARAK | 2.29 |
NOG | 2.01 |
ROM | 1.89 |
DUKAT | 1.76 |
EZRI | 1.53 |
Voyager
Name | Percentage of Lines |
---|---|
JANEWAY | 17.7 |
CHAKOTAY | 8.76 |
EMH | 8.34 |
PARIS | 7.63 |
TUVOK | 6.9 |
KIM | 6.57 |
TORRES | 6.45 |
SEVEN | 6.1 |
NEELIX | 4.99 |
KES | 2.06 |
Enterprise
Name | Percentage of Lines |
---|---|
ARCHER | 24.52 |
T'POL | 13.09 |
TUCKER | 12.72 |
REED | 7.34 |
PHLOX | 5.71 |
HOSHI | 4.63 |
TRAVIS | 3.83 |
SHRAN | 1.26 |
Discovery
Note: This is a limited dataset, as the source site only has transcripts for seasons 1, 2, and 4
Name | Percentage of Lines |
---|---|
BURNHAM | 22.92 |
SARU | 8.2 |
BOOK | 6.21 |
STAMETS | 5.44 |
TILLY | 5.17 |
LORCA | 4.99 |
TARKA | 3.32 |
TYLER | 3.18 |
GEORGIOU | 2.96 |
CULBER | 2.83 |
RILLAK | 2.17 |
DETMER | 1.97 |
OWOSEKUN | 1.79 |
ADIRA | 1.63 |
COMPUTER | 1.61 |
ZORA | 1.6 |
VANCE | 1.07 |
CORNWELL | 1.07 |
SAREK | 1.06 |
T'RINA | 1.02 |
If anyone is interested, here's the (rather hurried) Python used:
#!/usr/bin/env python
#
# This script assumes that you've already downloaded all the episode lines from
# the fantastic chakoteya.net:
#
# wget --accept=html,htm --relative --wait=2 --include-directories=/STDisco17/ http://www.chakoteya.net/STDisco17/episodes.html -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Enterprise/ http://www.chakoteya.net/Enterprise/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Voyager/ http://www.chakoteya.net/Voyager/episode_listing.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/DS9/ http://www.chakoteya.net/DS9/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/NextGen/ http://www.chakoteya.net/NextGen/episodes.htm -m
#
# Then you'll probably have to convert the following files to UTF-8 as they
# differ from the rest:
#
# * Voyager/709.htm
# * Voyager/515.htm
# * Voyager/416.htm
# * Enterprise/41.htm
#
import re
from collections import defaultdict
from pathlib import Path
EPISODE_REGEX = re.compile(r"^\d+\.html?$")
LINE_REGEX = re.compile(r"^(?P<name>[A-Z']+): ")
EPISODES = Path("www.chakoteya.net")
DISCO = EPISODES / "STDisco17"
ENT = EPISODES / "Enterprise"
TNG = EPISODES / "NextGen"
DS9 = EPISODES / "DS9"
VOY = EPISODES / "Voyager"
class CharacterLines:
def __init__(self, path: Path) -> None:
self.path = path
self.line_count = defaultdict(int)
def collect(self) -> None:
for episode in self.path.glob("*.htm*"):
if EPISODE_REGEX.match(episode.name):
for line in episode.read_text().split("\n"):
if m := LINE_REGEX.match(line):
self.line_count[m.group("name")] += 1
@property
def as_percentages(self) -> dict[str, float]:
total = sum(self.line_count.values())
r = {}
for k, v in self.line_count.items():
percentage = round(v * 100 / total, 2)
if percentage > 1:
r[k] = percentage
return {k: v for k, v in reversed(sorted(r.items(), key=lambda _: _[1]))}
def render(self) -> None:
print(self.path.name)
print("| Name | Percentage of Lines |")
print("| ---------------- | ------------------- |")
for character, pct in self.as_percentages.items():
print(f"| {character:16} | {pct} |")
if __name__ == "__main__":
for series in (TNG, DS9, VOY, ENT, DISCO):
counter = CharacterLines(series)
counter.collect()
counter.render()
Fascinating stuff I love that you did this. I'm surprised Morn didn't rank higher considering how chatty he is in every scene.
Number of lines vs number of words spoken vs length of time speaking probably would have a lot of variation in results.