立ち浪に真向き兎 ('frontwards facing rabbit in a standing wave')
This repository contains kamon (Japanese family crest) data from three sources:
-
Edo period Ansei Bukan (安政武鑑, Armory of the Ansei Reign Years) from the Center for Open Data in the Humanities.
-
Various open-source images from Wikimedia (see here for licensing details).
-
Images from https://github.com/Rebolforces/kamondataset. Note that since we are uncertain about the copyright status of these data, we do not provide these images directly. Instead, please navigate to that site, download the tarball, and install all the images in the subdirectory
train
directly underdata/mon-white-224
here.
All of these are "blazoned" with descriptions following the standard methods for describing kamon. With the exception of the Wikimedia data, which already came with descriptions, all examples were blazoned by hand.
Total dataset size is 7,410 images paired with descriptions. The data includes (machine-generated) dependency parses and English translations for all descriptions.
pip3 install -r requirements.txt
kamon_dataset.py
contains a wrapper that presents the data as a
torch.utils.data.Dataset
.
For example, the following loads the validation set. Each entry maps from a tensor representing the image to a sequence of vocabulary items corresponding to the phrase describing the crest.
val = kamon_dataset.KamonDataset(division="val", one_hot=False)
val[0]
(tensor([[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]]]), [49, 1366, 1252])
The dependency parsing of the crest descriptions, along with their translation
into English in index_parsed_claude_all.jsonl
and
index_parsed_claude_all_translated_claude.jsonl
, respectively, were performed
using Claude 3.5 Sonnet.
Kamon are theoretically open-ended since one can create new designs by combining existing motifs (or even creating new motifs) in new ways. Some of the ways of constructing new crests can be described by simple grammatical rules. Here we provide a simple grammar-based generator to create synthetic examples (some more plausible than others):
python synthetic_examples.py --num=10
See the examples in the synthetic
subdirectory, for example:
月輪に覗き尻合わせ三つ紅葉 ('Peeking bottoms-together three maple leaves in a moon ring')
One challenge is to generate the description of a crest given an image of that
crest. Vision models are not particularly well tuned for this sort of data, and
there are some important differences between scene-to-text and this problem.
The motifs in kamon are usually highly stylized, so that to recognize, say, a
wave requires knowing what a typical kamon stylization of a wave looks
like. Motifs may be modified and arranged in various ways, and while these
modifications and arrangements are quite restricted, they also often require
some amount of reasoning. For example a motif such as a plant leaf may be
arranged three in a circle, with either the heads pointed to the center (頭合わ
せ) or the bottoms pointed to the center (尻合わせ). But head
here means the
top of the motif as it would normally be displayed, and bottom
the
reverse. This requires knowing for each motif what the typical display
arrangement is, which is not obvious from the geometry of the motif. This,
coupled with the fact that the dataset for Kamon is small, makes crest-to-text
conversion challenging.
A baseline model using VGG is provided. The architecture is given schematically below:
The input image is replicated N times, where N is the maximum token length
of the output text. The image is optionally masked with a position-specific
mask, then passed to a VGG model that is shared across positions. The final
layer of the VGG model is removed so that the penultimate layer can be used for
features. The VGG model by default is set to be trainable. The VGG's features
along with the --ngram_length - 1
previous VGG features and the
--ngram_length - 1
previous logits are input features for predicting the
logits for the current position. The intuition behind the masking is that since
the descriptions of the crests largely proceed from the outside inwards, the
model should focus on different parts of the image at different times, and thus
when considering the first output term it might learn to mask out bits of the
image that are usually less relevant for predicting that term. For example, many
crests are surrounded by some sort of ring or hexagon or other container, and
this is described first. In order to describe that, the motifs inside the
container are irrelevant. In practice it should be noted that the masking does
not (yet) seem to make much difference.
A training script (with masking on) can be found in train.sh
and an inference
script in test.sh
.
Decoding output on the test set from one training run can be seen here. Note that this training and evaluation omits the Edo-period data.
A script for generating an HTML page visualizing the inference output can be found in visualize_outputs.py
.