#data-format #manga #transcription #statistics #readable #language #ways

app mangatrans

Manga transcription data format and ways to render them into readable formats, statistics and more

1 stable release

1.0.1 Nov 18, 2022
1.0.0 Aug 18, 2022

#431 in Images

GPL-3.0-or-later

48KB
869 lines

Mangatrans

Manga transcription data format and ways to render them into readable formats, statistics and more.

Goals

The goal of this project is to be able to write manga transcriptions as data and not at the final result. The program should parse this data and construct various forms of output from this, such as readables, language reports and statistics.

Transcription

A transcription should have the literal text from the manga in it. Along with it, it could contain extra auxiliary information such as the text with the kanjis replaced, romanized text and translation or notes.

The goal of this is to be able to keep it along while reading the manga or be able to go back and prevent redoing work such as figuring out the spelling of kanji.

Language Report

An extra tool for learning Japanese while reading and transcribing manga. Could contain lists such common kanji and their mappings into hiragana. The introduction of simple frequency statistics and sorted output may reveal patterns in the language used such as the most common words and kanij, useful for learning.

Statistics Report

A tool that is of novel interest. Knowing what locations show up most in the manga, who speaks most, which characters converse lots which each other and other novelties may be interesting to know for fans of the manga. From my experience, as you go through a chapter to transcribe it, transcribing the language and translating it, understanding it, dealing with kanji is much more effort than simply stating the location and which characters are present. That's why you might as well transcribe these things as they don't inflate the effort by much and give rise to the opportunity to compute insights that are just really fun.

Configuration

Transcribing into a data format and not into a format that is to be consumed directly has some advantages. Configuration may allow different output to be generated from the data. The reason could be different people's preferences or different goals. If you have done much work and realize you want different formatting, instead of reformatting your work you can just generate a different output. This is one of the main goals of the project since I just kept changing my formatting in the beginning of my transcribing journey.

Feature list

  • parse data
    • pictures data: location, characters, nr, page
    • transcription
    • translation
    • kanji map
    • from/to (directed from to which character)
    • One or more pattern
  • general
    • report: volumes and chapters included, picture and morae counts
    • customization with config file
    • recursive locations
  • transcribe
    • original text and translation
    • kanji replacement
    • automatic romanization
    • automatic indentation
    • text consistency improvements
    • page headers
  • statistics
    • ranked locations: appreanances, morae spoken in
    • characters ranked on number of appearances
    • characters ranked on morae spoken
    • characters ranked on morae spoken to by other characters
    • character pairs ranked on number of morae spoken in their interactions
    • characters ranked on overall prominence
  • language report
    • hiragana/katakana characters ranked by count
    • kanji's ranked by count
    • words ranked by count

Data format

The data is written down in a toml file.
Every chapter should be in it's own toml file.
Every file starts with some information about the chapter:

# this is a comment and will be ignored by the program
manga = "日常"
author = "あらゐけいいち"
volume = 1
chapter = 1
subchapter = 0.5
title = "日常の1.5"

The subchapter field is optional, the rest is expected to be there.
After that you supply an array of pictures like this:

[[pic]]
    # picture data
[[pic]]
    # picture data
[[pic]]
    # picture data

An example of picture data:

[[pic]]
nr = 2
page = 1
location = "shinonome house"
characters = ["nano"]
    [[pic.text]]
        # text data
    [[pic.text]]
        # text data
    [[pic.text]]
        # text data

Picture number(nr) is expected to be there.
page is optional and sets the page number. If page is not present, it is assumed we're still on the same page as last time you declared a page number. Every chapter must have it's first picture assigned a page number so it's knows how to continue.
location, characters and array of texts are optional.
Every chapter's first page must have an initial location. An example of text data:

[[pic.text]]
from = "nano"
to = "hakase"
lines = ["今日", "日直 でしたー"]
kmap = [
    ["今日", "きょう"],
    ["", "につ"],
    ["", "ちょく"],
]
transl = ["Today", "is my shift!"]
notes = ["This in an optional note, the transcriber may want to say something about this text."]
todo = true

from is used to describe which characters says the current text. I personally regard the narator as a "character" in this field.
to holds the character to which this text is directed. I personally regard the audience as a "character" in this field. The field is optional. If a character speaks to themselves, or thinks internally you can just leave this field out.
lines is a mandatory field and should be an array of strings, each containing a line of transcribed text. This text should be transcribed literally from the manga, apart from spaces and is used for many calculations and transformations. To split of words you can insert ascii spaces " " between characters. These are removed in the output of the transcription but will be used for things such as the romanized version.
kmap is an optional field that defines a mapping of kanji to hiragana or katakana. This is used for the substitution.
transl is for translation of lines. It's optional and you may have a different number of entries in the transl array than you have in the lines array.
notes is optional and you may write down notes about this text here. todo is optional and when it's set so true, it will be logged that this text needs work. You may want to set todo = true if there is an error, something is incomplete etc. This way it will be logged every time so you won't forget it after a while and you know what items need some work. Setting todo = false doesn't do anything, it's the same as leaving todo out.

When a kanji appears multiple times in a text, you must give the correct mapping as many times as it appears. This is so that it is possible to have sentences that have the same kanji with different mappings. An example of this:

[[pic.text]]
from = "yukko"
lines = ["そう考 えると", "不幸中 の 幸いって", "ヤツだね"]
kmap = [
    ["", "かんが"],
    ["", ""],
    ["", "こう"],
    ["", "ちゅ"],
    ["", "さいわ"],
]
transl = ["If you think about it that way it's a blessing in disguise."]

One Or More pattern

The 'One Or More' pattern has been implemented with backwards compatibility. This means that in fields that can take multiple values with arrays, you can now also leave out the array if you just have one value. For example, when denoting which characters appear in a picture you could write characters = ["yukko", "mai"]. For a single character you can write characters = ["nano"]. With the 'One Or More' pattern you can write the former as characters = "nano". Fields that support the pattern are: characters, form, to, lines, kmap, transl and notes.

Example

For an example check out example.toml. For more information on the toml language visit https://toml.io/en/

All example material such as example lines and characters in example.toml and this readme are from the manga 日常 by あらゐけいいち. The material is used for educational purposes.

Usage

Program is used through a command line interface (CLI).

USAGE:
    mangatrans [OPTIONS] <INPUTFILES>...

ARGS:
    <INPUTFILES>...

OPTIONS:
    -d, --outputdir <OUTPUTDIR>
    -h, --help                       Print help information
    -l, --log <log>                  [default: true]
    -m, --mode <MODE>                [default: transcribe] [possible values: transcribe, stats,
                                     language]
    -o, --outputmode <OUTPUTMODE>    [default: stdout] [possible values: stdout, file]
    -V, --version                    Print version information

Sample output

Sample output generated from chapter 1 of the manga 日常.

Partial sample output of the transcription mode

  • picture 65
    • text 1
      • なのちゃん
        休みなんて
        めずらしいね
      • なのちゃん
        やすみなんて
        めずらしいね
      • nano chan
        yasuminante
        mezurashii ne
      • It's rare for Nano chan not to be present, right?
    • text 2
      • 故障かな?
      • こしょうかな?
      • koshou kana?
      • Malfunction maybe?
    • text 3
      • ちょっとやめなよー
        本人バレてないと
        思ってるんだから
      • ちょっとやめなよー
        ほんにんバレてないと
        おもってるんだから
      • chotto yamenayo-
        honnin baretenaito
        omotterundakara
      • Stop it, she doesn't know we found out yet.

Sample output of the statistics mode

Manga: 日常
Volumes: 1
Chapters: 1
Pictures: 74
Morae spoken: 880
Locations:
    street: 30 appearances, 339 morae spoken in.
    roof: 19 appearances, 130 morae spoken in.
    classroom: 17 appearances, 311 morae spoken in.
    shinonome house: 3 appearances, 73 morae spoken in.
    chimney: 2 appearances, 7 morae spoken in.
    school hallway: 1 appearances, 12 morae spoken in.
    school grounds: 1 appearances, 8 morae spoken in.
    aioi lookalike family garden: 1 appearances, 0 morae spoken in.
Character appearances:
    yukko: 33
    nano: 24
    mio: 19
    mai: 9
    kokeshi: 6
    person: 5
    akabeko: 5
    headphones guy: 4
    aioi mom lookalike: 2
    nakanojo: 2
    izumi: 1
    crow wug: 1
    hakase: 1
    mono: 1
    chissan lookalike: 1
Morae spoken:
    yukko: 309
    nano: 238
    mio: 143
    narator: 74
    izumi: 65
    headphones guy: 16
    mai: 14
    person: 11
    nakanojo: 9
    hakase: 1
Morae spoken to:
    audience: 74
    yukko: 73
    class: 65
    mio: 55
    hakase: 48
    mai: 42
    person: 17
    nakanojo: 3
Conversation pairs in morae:
    mio, yukko: 114
    audience, narator: 74
    class, izumi: 65
    hakase, nano: 48
    mai, yukko: 40
    mai, mio: 16
    nakanojo, person: 12
    person, person: 8

Partial sample output of the language mode

Kanji frequencies::: 5: せい: 4: こう: 4:: 3:: 3
    今日: きょう: 3: あか: 3:: 3: のめ: 2:: 2: せん: 2: くだ: 2: おも: 2: につ: 2: ほん: 2: ちょく: 2: にん: 2: しの: 2:: 2: はら: 2

License

Copyright (C) 2022 Cody Bloemhard

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.

Dependencies

~2–2.8MB
~53K SLoC