-
In this section, we will explain how our experimental framework works. The comparison with other methods is shown in Fig. 1.
Let $t$ denote the ground truth image, $x$ denote the input image and $y$ denote the fake image and $c$ denote the target font label. Let $G$ be the generator and $G(x, c) \to y$ means combining x and c to feed into G to generate a fake image $y$. Suppose the discriminator is $D$ and $D({\mathit{\boldsymbol{x}}}) \to \{ {D_{{\rm{dis}}}}({\mathit{\boldsymbol{x}}}), {D_{{\rm{cls}}}}({\mathit{\boldsymbol{x}}})\} $ means giving an image ${\mathit{\boldsymbol{x}}}$ into the discriminator. ${D_{{\rm{dis}}}}$ means the result that D produces on whether the image is real. ${D_{{\rm{cls}}}}$ is the probability distribution of the target font labels that $D$ produces.
Framework steps. In our model, we use x with one style and ${\mathit{\boldsymbol{c}}}$ as input label (one-hot vector). Then, we merge ${\mathit{\boldsymbol{x}}}$ and ${\mathit{\boldsymbol{c}}}$ into a matrix. Using the matrix feeds G and generates fake images. Finally, D discriminates the images true or false, and each image is given a font label probability.
In order to get good experimental results, we use the adversarial loss (see formula (1) and (2)) to ensure we can produce high-quality images. The semantic consistent loss (see formula (7)) is used to keep the contents of input images and output images consistent. The font classification loss (see formula (3) and (4)) helps the model to generate and transform the fonts correctly.
Adversarial loss. Below is the basic adversarial loss function to ensure that the image generated by the generator can "fool" the discriminator:
$$L_{{\rm{adv}}}^{\rm{d}} = {E_{\mathit{\boldsymbol{t}}}}[\log {D_{{\rm{dis}}}}({\mathit{\boldsymbol{t}}})] + {E_{{\mathit{\boldsymbol{x, c}}}}}[\log (1 - {D_{{\rm{dis}}}}(G({\mathit{\boldsymbol{x, c}}})))]$$ (1) Here D tries to distinguish real images from photo-realistic images which are generated by G. We use $\mathit{\boldsymbol{t}}$ as the supervision to make $D$ to have distinguishing ability to the maximum extent.
This loss function is to assist $G$ making photo-realistic images to fool the discriminator:
$$L_{{\rm{adv}}}^{\rm{g}} = {E_{{\mathit{\boldsymbol{x, c}}}}}[\log (1 - {D_{{\rm{dis}}}}(G({\mathit{\boldsymbol{x, c}}})))]$$ (2) Here G tries to minimize this objective.
Fonts Classification Loss. We give the ground truth images $\mathit{\boldsymbol{t}}$ and ground truth label $\mathit{\boldsymbol{c'}}$ as the supervision to let $D$ learn to classify the fonts by minimizing objective below. ${D_{{\rm{cls}}}}({\mathit{\boldsymbol{c'}}}|{\mathit{\boldsymbol{t}}})$ is a probability distribution on target font's labels computed by D. The loss defines as
$$L_{{\rm{cls}}}^{\rm{d}} = {E_{{\mathit{\boldsymbol{t, c''}}}}}[ - \log {D_{{\rm{cls}}}}({\mathit{\boldsymbol{c'|t}}})]$$ (3) The loss function for font classification of fake images is defined as
$$L_{{\rm{cls}}}^g = {E_{{\mathit{\boldsymbol{x, c}}}}}[ - \log {D_{{\rm{cls}}}}({\mathit{\boldsymbol{c}}}|G({\mathit{\boldsymbol{x, c}}}))]$$ (4) Here $G$ tries to minimize this objective to generate fake images which will be classified as target labels.
Gradient penalty loss. We use gradient penalty[14] to get faster convergence and produce higher-quality photorealistic samples. ${\nabla _{{\mathit{\boldsymbol{x'}}}}}$ denotes gradient. $\alpha $ is a hyperparameter. The loss is defined as
$$x' = \alpha * t + (1 - \alpha ) * x$$ (5) $$ {\rm{G}}{{\rm{P}}_{{\mathit{\boldsymbol{x, t}}}}} = {\lambda _{{\rm{gp}}}}{(||{\nabla _{{\mathit{\boldsymbol{x'}}}}}{D_{{\rm{dis}}}}({\mathit{\boldsymbol{x'}}})|{|_2} - 1)^2} $$ (6) Semantic consistent loss. In our model, we want generated Chinese characters to be the same as the given ones, so we use the L1 loss function. The semantic consistent loss is defined as
$$ {L_{{\rm{feat}}}} = {E_{{\mathit{\boldsymbol{x, c}}}}}[||{\mathit{\boldsymbol{t}}} - G({\mathit{\boldsymbol{x, c}}})|{|_1}] $$ (7) and minimizing this objective can make G keeping the content consistent.
Final optimization objective function. Combining all the loss functions, we train the final optimization objective function as
$$ \mathop {\min }\limits_G \mathop {\max }\limits_D {L_{DG}} = {L_D} + {L_G} $$ (8) and
$$ {L_D} = {\lambda _{{\rm{adv}}}}L_{{\rm{adv}}}^{\rm{d}} + {\lambda _{{\rm{cls}}}}(L_{{\rm{cls}}}^{\rm{d}} + G{P_{{\mathit{\boldsymbol{x, t}}}}}) $$ (9) $$ {L_G} = {\lambda _{{\rm{adv}}}}L_{{\rm{adv}}}^{\rm{g}} + {\lambda _{{\rm{cls}}}}L_{{\rm{cls}}}^{\rm{g}} + {\lambda _{{\rm{feat}}}}{L_{{\rm{feat}}}} $$ (10) Where ${\lambda _{{\rm{adv}}}}$, ${\lambda _{{\rm{feat}}}}$ and ${\lambda _{{\rm{cls}}}}$ are weights that apply to losses to have a better trade-off in semantics, classification and adaptation.
Algorithm. The algorithm of the proposed method is as follows.
Input: Source image ${\mathit{\boldsymbol{x}}}$ and target label ${\mathit{\boldsymbol{c}}}$; Target image ${\mathit{\boldsymbol{t}}}$ and target label ${\mathit{\boldsymbol{c}}}$
randomly initialized a generator $G$ and a discriminator $D$
repeat
for number of training epochs do
for number of batch-size do
//for generator
${\theta _G} \leftarrow {\theta _G} - \mu \frac{{\partial {L_{DG}}}}{{\partial {\theta _G}}}$, ${L_{DG}}$ as Eq.7
//for discriminator
${\theta _{{D_{{\rm{dis}}}}}} \leftarrow {\theta _{{D_{{\rm{dis}}}}}} + \mu \frac{{\partial L_{{\rm{adv}}}^{\rm{d}}}}{{\partial {\theta _{{D_{{\rm{dis}}}}}}}}$, $L_{{\rm{adv}}}^{\rm{d}}$ as Eq.1
${\theta _{{D_{{\rm{cls}}}}}} \leftarrow {\theta _{{D_{{\rm{cls}}}}}} + \mu \frac{{\partial L_{{\rm{cls}}}^{\rm{d}}}}{{\partial {\theta _{{D_{{\rm{dis}}}}}}}}$, $L_{{\rm{cls}}}^{\rm{d}}$ as Eq.3
${\theta _D} \leftarrow {\theta _{{D_{{\rm{dis}}}}}} + {\theta _{{D_{{\rm{cls}}}}}}$
end for
end for
Until convergence
$$\begin{gathered} {\hat \theta _D} \leftarrow {\theta _D} \\ {\hat \theta _G} \leftarrow {\theta _G} \\ \end{gathered} $$ Output: the optimized $G$ and $D$ by ${\hat \theta _D}, {\hat \theta _G}$
Learning to Write Multi-Stylized Chinese Characters by Generative Adversarial Networks
-
摘要: 随着生成对抗网络(GAN)的发展,中文字体转换领域的研究越来越多,研究者能够生成高质量的汉字图像。这些字体转换模型可以使用GAN将源字体转换为目标字体。然而,目前的方法有以下局限:1)生成的图像模糊;2)模型一次只能学习和生成一种目标字体。针对这些问题,该文开发了一种全新的模式来执行中文字体转换。首先,将字体信息附加到图像上,告诉生成器需要转换的字体;然后,通过卷积网络提取和学习特征映射,并使用转置卷积网络生成照片真实图像。使用真实图像作为监控信息,以确保生成的字符和字体与它们自身一致。这个模型只需要训练一次,就能够将一种字体转换为多种字体并生成新的字体。对7个中文字体数据集的大量实验表明,该方法在中文字体转换中优于其他几种方法。Abstract: With the development of Generative Adversarial Networks (GAN), more and more researches have been conducted in the field of Chinese fonts transformation and researchers are able to generate high-quality images of Chinese characters. These font transformation models can transform a source font to a target font using GAN. However, current methods have limitations that 1) generated images are oftentimes blurry and 2) models can only learn and produce one target font at a time. To address these problems, we have developed a brand-new model to perform Chinese font transformation. First, font information is attached to images to tell the generator the fonts that we want to transform. Then, the generator extracts and learns feature mappings through convolutional networks and generates photo-realistic images using transposed convolutional networks. The ground truth images are then used as supervisory information to ensure that characters and fonts generated are consistent with themselves. This model only needs to be trained once, but it is able to transform one font to multiple fonts and produce new fonts. Extensive experiments on seven Chinese font datasets show the superiority of the proposed method over several other methods in Chinese font transformation.
-
[1] SUN Dan-yang, REN Tong-zheng, LI Chong-xuan, et al. Learning to write stylized Chinese characters by reading a handful of examples[C]//Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. Stockholm, Sweden: [s.n]. 2018: 920-927. [2] ISOLA P, ZHU J Y, ZHOU T, et al. Image-to-image translation with conditional adversarial networks[C]//IEEE Conference on Computer Vision & Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 5967-5976 [3] CHO Yun-jey I, CHOI Min-je, KIM Mun-young, et al. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation[C]//IEEE Conference on Computer Vision and Pattern Recognition.[S.l.]: IEEE, 2018: 8789-8797. [4] XU Song-hua, JIN Tao, JIANG Hao, et al. Automatic generation of personal Chinese handwriting by capturing the characteristics of personal handwriting[C]//Proceedings of the Twenty-First Conference on Innovative Applications of Artificial Intelligence. Pasadena, California, USA: [s.n.], 2009. [5] XU Song-hua, JIANG Hao, JIN Tao, et al. Automatic generation of Chinese calligraphic writings with style imitation[J]. IEEE Intelligent Systems, 2009, 4(2):44-53. http://cn.bing.com/academic/profile?id=b0616b77bc72a90fcb74ca578cca1ac2&encoded=0&v=paper_preview&mkt=zh-cn [6] MIRZA M, OSINDERO S. Conditional generative adversarial nets[EB/OL].[2014-11-06]. https://arxiv.org/abs/1411.1784. [7] SUN Han-fei, LUO Yi-ming, LU Zhang. Unsupervised Typography Transfer[EB/OL].[2018-02-07]. https://arxiv.org/abs/1802.02595. [8] CHANG Jie, GU Yu-jun, ZHANG Ya. Chinese typography transfer[EB/OL].[2017-07-16]. https://arxiv.org/abs/1707.04904. [9] GOODFELLOW I J, JEAN P A, MIRZA M, et al. Generative adversarial networks[EB/OL].[2014-06-04]. https://arxiv.org/abs/1406.2661v1. [10] ZHU Jun-yan, PARK T, ISOLA P. Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 2242-2251. [11] KIM T, CHA M, KIM H, et al. Learning to discover cross-domain relations with generative adversarial networks[C]//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: [s.n.], 2017: 1857-1865. [12] TZENG E, HOFFMANM J, SAENKO K, et al. Adversarial discriminative domain adaptation[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 2962-2971. [13] HE Kai-ming, ZHANG Xiang-yu, REN Shao-qing, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.