Abstract of doctoral dissertation Computer science: Enhancing performance of mathematical expression detection in scientific document images

Chia sẻ: Minh Tú | Ngày: | Loại File: PDF | Số trang:27

Thêm vào BST

Báo xấu

25
lượt xem 4
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

The thesis mainly aims to solve the following tasks: Firstly, the thesis extensively analyzes a wide range of existing approaches for the ME detection in scientific document images. Then, the thesis investigates and proposes novel methods to improve the detection accuracy of MEs. After enhancing the detection accuracy of MEs, the thesis investigates and pro poses a framework to improve the accuracy of the recognition of MEs in scientific document images.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Abstract of doctoral dissertation Computer science: Enhancing performance of mathematical expression detection in scientific document images

MINISTRY OF EDUCATION AND TRAINING UNIVERSITY OF SCIENCE AND TECHNOLOGY BUI HAI PHONG ENHANCING PERFORMANCE OF MATHEMATICAL EXPRESSION DETECTION IN SCIENTIFIC DOCUMENT IMAGES Major: Computer Science Code: 9480101 ABSTRACT OF DOCTORAL DISSERTATION COMPUTER SCIENCE Hanoi −2021
This study is completed at: Hanoi University of Science and Technology Supervisors: 1. Assoc. Prof. Hoang Manh Thang 2. Assoc. Prof. Le Thi Lan Reviewer 1: Reviewer 2: Reviewer 3: This dissertation will be defended before approval commitee at Hanoi University of Science and Technology: Time , date month year 2021 This dissertation can be found at: 1. Ta Quang Buu Library - Hanoi University of Science and Technology 2. Vietnam National Library
INTRODUCTION Motivation Up to now, a huge number of scientific documents have been produced. Scientific doc- uments have provided valuable information for research community. The documents need to be digitized to allow users to retrieve information efficiently. Recently, most documents have been published in the PDF format. However, a large number of documents have been still available in raster format. It is obvious that the PDF processing techniques cannot be applied for such raster document images. We need to apply image processing for the digitization of the document images. The key steps of the document digitization are: document analysis, optical character recognition and content searching [2]. The digitization of standard text rich docu- ments has considered as a solved problem. However, the digitization of scientific documents that contained rich MEs is a non trivial task. Actually, scientific documents usually consist of heterogeneous components: tables, figures, texts and MEs. In scientific documents, MEs may be mixed with various components and sizes, styles of MEs may frequently vary. Therefore, the improvement of accuracy of the detection and recognition of MEs is an important step of the digitization of scientific documents. Inspired by the above ideas, the thesis mainly aims to improve the accuracy of detection and recognition of MEs in scientific document images. Introduction of ME detection and recognition in document images In mathematics, an expression or mathematical expression is a finite combination of symbols that is well-formed according to rules that depend on the context [5]. In scientific documents, MEs are classified in two categories, i.e. isolated (displayed) and inline (embedded) expressions. Isolated expressions display in separate lines, meanwhile inline expressions are mixed with other components in document pages, e.g. texts and figures. The detection of expressions aims to locate MEs in document images. Meanwhile, the recognition of MEs aims at converting expressions from image format to string (representation in Latex). An example of ME detection and recognition is illustrated in Figure 1. Actually, the detection and recognition of MEs in document images are closely related. The accuracy of the detection allows to obtain accuracy of the recognition. In contrast, the incorrect detection may cause errors in the recognition of MEs. The hypotheses of the thesis are assumed as follows: (1) The thesis focuses on the de- tection and recognition of MEs in scientific document images that have been written in a formal way. The thesis aims to detect MEs in the body of documents, the detection of MEs contained in other document components such as tables, figures are actually investigated in other problems (table or figure detection). Moreover, the size of MEs should not pass the size of the whole documents. (2) Scientific documents can be generated in various ways: camera 1
Figure 1 Example of the detection (a) and a detected ME in a document image (b). Isolated and inline MEs are denoted in red and blue, respectively. Extracted ME is recognized and represented using Latex (c). captured images, handwritten documents, scanned format or PDF conversion. Moreover, the detection accuracy highly depends on the quality of the documents. Like conventional meth- ods in document analysis, the thesis focuses on the detection of MEs in document images that are scanned at high resolution and non-skew. (3) The detection of MEs is represented by bounding boxes. Then the detected MEs are recognized and represented in Latex format [4]. Main challenges of the recognition of MEs can be described as follows: (1) Accurate recognition of a large number of mathematical symbols is a difficult task. (2) Some symbols in MEs may play different roles in different contexts. (3) Operator symbols can be explicit or implicit. When consecutive operator symbols exist in an expression, we can apply operator precedence rules to group the symbols into units. (4) In addition, mathematical notation has many dialects. Similar to natural languages, it is impossible to design a system that can recognize all dialects. As a result, our systems are developed based on a subset of the mathematical notation only. Contributions The main scientific contributions of the thesis are threefold: (1) First, a hybrid method of two stages has been proposed for the effective detection of MEs. Both hand-crafted and deep learning features are extensively investigated and combined to improve the detection accuracy. The merit of the method is that it can operate directly on the ME images without the employment of character recognition. (2) Second, an end-to-end framework for mathematical expression detection in scientific document images is proposed without using any Optical Character Recognition (OCR) or Document Analysis techniques as in conventional methods. The distance transform is firstly applied for input document images in order to take advantages of the distinguished features of spatial layout of MEs. Then, the transformed images are fed 2
into the Faster Region with Convolutional Neural Network (Faster R-CNN) that has been optimized to improve the accuracy of the detection. (3) Finally, the detection and recognition of MEs has been integrated in a system. The MEs in document images have been detected and recognized. The recognition results are represented in Latex. Thesis structure Chapter Introduction firstly presents the basic information and definition of ME detection and recognition. Then, the scope of the thesis is presented. The main contributions of the thesis are also summarized in the chapter. In chapter 1, significant related works to the detec- tion and recognition of MEs are reviewed. Based on the current limitations, the contributions of the thesis are proposed. Chapter 2 presents the ME detection using the fusion technique of hand crafted and deep learning features. Chapter 3 presents the ME detection using the combination of the Distance Transform (DT) of images and Faster R-CNN. The framework allows to achieve high accuracy of detection with an end-to-end way. Chapter 4 presents the system of ME detection and recognition. Chapter Conclusion gives the conclusion and future works of the thesis. CHAPTER 1 LITERATURE REVIEW In this chapter, significant works of the detection and recognition of MEs in document images are analysed. 1.1 Document analysis Traditional approaches for ME detection in document images normally consist of two steps [10, 14]: document analysis and ME detection. The first step focuses on obtaining text lines and words of text paragraphs. Whereas, the second one focuses on the separation of MEs and normal texts. Document layout analysis can be defined as the task of segmenting a given document into semantically meaningful regions. Page segmentation which is a well-researched topic of document analysis aims to specify regions in documents and classify them into phys- ical components such as tables, figures, texts. In recent years, the page segmentation is an active research topic and has attracted more and more researches. Firstly, the image prepro- cessing (noise removal and skew correction) is performed. Then, each component (e.g. text, figure, or table) is separated based on their structure layout. Traditional page segmentation techniques can be divided into four types: top-down, bottom-up, multi-scale resolution and hybrid method. In recent years, deep learning approaches have been utilized for the page seg- mentation. The advantage of the approaches is that the page segmentation task is performed without the prior knowledge of document structure. 3
Table 1.1 Summary of significant handcrafted features for isolated ME detection Feature of text line Description Density [14] The density of black pixel in a text line Height/Width of text line [19] Ratio of height of a text line to the document Left and right indent [14, 20] Left and right indent of text lines Variation of centers of characters [14] Variation of centers of characters in the text line Below and above space [23] Space between the text line and adjacent text lines 1.2 ME detection methods in document images Various approaches for the ME detection have been proposed. The approaches can be divided into three categories: rule-based, handcrafted feature extraction and DNN methods. 1.2.1 Rule-based detection Early researches in ME detection have performed using different rules. Proposed rules are normally proposed by the different layout, morphology of MEs in comparing with text. Many heuristic rules and predefined thresholds have been proposed for the detection. In general, researches in the early period of ME detection have been tested in small private datasets. The methods can detect MEs in some specific cases. Many errors have existed in the detection of MEs in complex layout documents. 1.2.2 Handcrafted feature extraction methods for the ME detection The handcrafted feature extraction methods have designed a set of features for ME de- tection. Table 1.1 summarizes features that have been designed for isolated ME detection [9, 23, 14]. Meanwhile, Table 1.2 summarizes features that have been designed for inline ME detection. After the feature extraction, various machine learning classifiers such as k-Nearest Neighbor (k-NN), Support Vector Machine (SVM) have been fine-tuned to detect MEs. Table 1.2 Summary of significant handcrafted features for inline ME detection Feature of word Description Special symbol [15] The word contains special symbols or not Density [14] The density of black pixel in a word Height/Width of word [14] Ratio of height of a word to the whole document Variation of centers of characters [14] Variation of centers of characters in a word Space between characters [23] Inner space between characters in a word 1.2.3 Deep neural network for ME detection In recent years, DNNs have proved the outstanding performance in the recognition and detection of mathematical expressions. The work in [21] takes the advantages of CNNs in the detection of isolated and inline expressions in document images. A CNN architecture based on the U-net is used for detecting mathematical expressions. The post-processing is performed to obtain accurate expressions. For the CNN, the training on diverse datasets can improve the detection accuracy. Moreover, the accuracy of the detection depends on the size of image blocks in the training of CNN. The achieved precision and recall of the detection of MEs of the 4
method are 95.2% and 91%, respectively. The limitation of the method is that mathematical symbols are detected with high accuracy, however the layout analysis of symbols has not been solved to construct complete expressions. The works in [22] have applied the SSD-512 and YOLO v3 neural networks for ME detection. 1.3 ME recognition 1.3.1 Traditional approaches for ME recognition The expression recognition has researched since 1960s. In the literature, various ap- proaches have been proposed for ME recognition based on three steps: Symbol segmentation, Symbol classification and Structure analysis. The survey [1] has presented various traditional mathematical expression recognition approaches based on: (1) Symbol segmentation (2) Sym- bol recognition (3) Structure analysis. Techniques for segmentation of symbols have been performed based on the analysis of connected components or projection profile of images. The existing segmentation techniques have difficulties when performing complex (e.g. fraction, sum function) or touching (e.g. exponential function) symbols. The recognition of of symbols has developed by using the the hand crafted feature extraction and various classifiers. In summary, traditional approaches for the recognition of MEs have extensively investi- gated three stages: symbol segmentation, symbol recognition and structure analysis. Main drawbacks of such methods are as follows: (1) The accuracy of the recognition of MEs is still low. Errors in segmentation, recognition and structural analysis may cause errors in the final recognition. (2) Much human efforts have been made for each stage. In particular, the com- putation in structural analysis of symbols is complex. (3) Recognition algorithms have been designed for specific ME datasets. The algorithms are difficult to evaluate across datasets. 1.3.2 Neural network approaches for ME recognition The work in [24] applied the combination of a CNN and a RNN to recognize isolated MEs in an end-to-end way. The CNN has been trained on isolated images captured by cameras. Recently, the combination of CNN and RNN model [25] has been designed for handwritten expression recognition. So far, many researches have investigated the recognition of segmented MEs. The recognition of MEs that are embedded in document images needs to be investigated. The work in [3] has proposed the neural network based on scale paired adversarial learning for ME recognition. In recent years, several DNNs based on the encoder-decoder architecture have proved the outstanding performance in the ME recognition task [25, 17, 6]. These DNNs have designed to solve the challenges of recognition of complicated two-dimensional structures of MEs. 1.4 Datasets and evaluation metrics 1.4.1 Datasets In the literature, most methods have been evaluated on private datasets (e.g. [20, 15]). The private datasets have not been published for research community. Therefore, it is difficult to compare the performance of various ME detection methods. In the thesis, two public 5
datasets have been used for performance evaluation of ME detection. The Marmot [13] and GTDB [21] datasets have been recently published. Table 1.3 Statistic of the Marmot and GTDB datasets GTDB Marmot Datasets Training Testing Training Testing Number of pages 569 236 330 70 Number of isolated expressions 4218 2488 1322 253 Number of inline expressions 22178 9397 6951 956 Number of text fonts 30 18 Average number of MEs per page 47.55 23.70 The Marmot consists of 400 non-skew scientific document pages with 1575 isolated and 7907 inline expressions. The resolution of each page image is around 500 dpi. The GTDB dataset has recently been used for performance evaluation of researches [22]. The dataset consists of diverse font and mathematical symbol styles. The training and testing datasets are described in Table 1.3. 1.4.2 Evaluation metrics To evaluate the performance of the ME detection, two evaluation metrics have been applied. The Precision (P), Recall (R) and F1 score have been used for the performance evaluation of ME detection. Precision is the proportion of the true positives against all the positive results; Recall is the proportion of the true positives against all the true results and F1 score is the harmonic mean of precision and recall. The Precision and Recall metrics are popularly used, however, in order to obtain the in-depth analysis of ME detection, the Intersection over Union (IoU ) metric has been applied in the thesis. The Word error rate (WER) and Expression error rate (ExpRate) evaluation metrics [25] are used to evaluate the accuracy of the ME recognition. The ExpRate evaluates the proportion of recognition of MEs in Latex strings that match the ground truth. Meanwhile, WER evaluates the number of actions (deletion, substitution or insertion) that are performed to obtain correct strings. CHAPTER 2 Detection of MEs using the late fusion of handcrafted and deep learning features 2.1 Introduction Scientific document images usually consist of heterogeneous components (e.g., figures, ta- bles, text and MEs). Conventional approaches have attempted the page segmentation and the handcrafted feature extraction techniques for the ME detection in document images. Conven- 6
tional methods have faced many difficulties in the detection of inline MEs. Therefore, in the chapter, a hybrid method of two stages is proposed for the effective detection of mathematical expressions. First, the layout analysis of entire document images is introduced to improve the accuracy of text line and word segmentation. Then, both isolated and inline expressions in document images are detected. Both hand-crafted and deep learning features are extensively investigated and combined to improve the detection accuracy. The proposed system for the ME detection is illustrated in Figure 2.1. The proposed system takes a binary document image as input and outputs an image with position information of detected MEs. Like document analysis and expression detection methods, input of the proposed method is a non-skew docu- ment image. The algorithm can handle camera-captured and scanned document images. After the pre-processing, the document is analyzed to obtain text lines for isolated expression detec- tion. Non-isolated expressions are segmented into words for inline expression detection. After the segmentation, the late fusion of handcrafted and deep learning features are applied for the isolated and inline expression detection modules. Finally, the post-processing is performed in order to obtain the accurate position information of MEs in document images. Figure 2.1 Overall description of the proposed system for mathematical expression detection. 2.2 Page segmentation In the section, the estimation of projection profile of images is performed recursively to analyze the structure of documents [8]. In fact, the horizontal and vertical projection profiles of an image is the horizontal and vertical distribution of black pixels, respectively. Thus, the 7
technique is useful for the analysis of scanned documents. 2.3 Handcrafted feature extraction for ME detection Figure 2.2 The flowchart of the isolated and inline expression detection by using handcrafted feature extraction The flowchart of the isolated and inline expression classification is described in Figure 2.2. In the handcrafted feature extraction approach, the powerful feature extraction and classifier are applied to improve the accuracy of the classification of both isolated and inline expressions. 2.3.1 Handcrafted feature extraction for isolated ME detection For isolated expression detection, a text line image is represented in the frequency domain by using Fourier transformation. Given an image a with the size of M × N and its Discrete Fourier Transform (DFT) A(Ω, ψ), the mathematical equation of DFT [7] is defined as follows: M X X N A(Ω, ψ) = a(m, n)e−j(Ωm+ψn) (2.1) m=1 n=1 The FFT [11] has used to transform input document images to the frequency efficiently. For the detection of isolated MEs, FFT phase and magnitude are used as the features. After 8
that, SVM, kNN, Decision tree and RF are optimized as the classifiers. This is a popular machine learning model to solve binary classification. 2.3.2 Handcrafted feature extraction for inline ME detection To determine an extracted word from a text line is an inline ME (variable, operator, function) or a textual word, a binary classification method is proposed. A key step in the classification is to extract the dominant features of observed words. The important feature that is used to discriminate inline MEs from textual words is the italic font style of images. In scientific documents, inline MEs are typically represented in italic font. For feature extraction, firstly vertical projection profile (VPP) and HPP of each variable or textual word image is computed. Then, peaks and valleys (troughs) of the VPP and HPP are determined. In mathematical definition, peaks and valleys are local maxima and minima, respectively. The feature extraction method is based on the Gaussian distribution of the peaks and valleys. The feature vector is formed by features of projection profile of variable and textual word images. For each image, these features are described as follows: (1) The number of peaks in the VPP and HPP. (2) The mean (average) of values of peaks in the VPP and HPP. (3) The standard deviation of values of peaks in the VPP and HPP. (4) The number of valleys in the VPP and HPP. (5) The mean (average) of values of valleys in the VPP and HPP. (6) The standard deviation of values of valleys in the VPP and HPP. For an image of size m × n, the complexity of the feature extraction is O(m) and O(n) for VPP and HPP, respectively because the feature extraction is performed by finding and analyzing the peaks and valleys in the VPP and HPP of image. After the feature extraction process, in order to discriminate variables from textual words, different machine-learning based algorithms are used. In the section, different classification models are applied: SVM, kNN, Decision tree and RF. For the classifiers, tuned parameters play an important role to achieve high performance. Therefore, different parameters of the each classifier are considered in order to determine the optimal values for the classification. 2.4 Deep learning method for ME detection To improve the accuracy of the detection of both isolated and inline expressions, the transfer learning technique of AlexNet and ResNet-18 [18] those are popular Neural Networks are employed. Comparing with AlexNet, the architecture of ResNet-18 consists of deeper layers and ResNet-18 normally shows better results in the classification task. Figure 2.3 illustrates the flowchart of the transfer learning of CNNs for isolated and inline expression detection module. The dominant features are automatically extracted by 9
Figure 2.3 The flowchart of the isolated and inline expression classification by using the transfer learning of CNNs the network without any domain specific knowledge. Then, the classification is performed by softmax layer of the network. 2.5 Fusion of handcrafted and deep learning features for ME detec- tion In recent years, the strategy of fusion of multimodality has shown better performance in comparing with single modality for classification task. The fusion techniques for improvement of object classification are proposed. In this work, the confidence scores obtained from hand designed features with RF and CNN features with softmax are combined using product and average operators. In our work, the min, max, average and product operators have been applied for the score-based fusion. The flowchart of the fusion is described in Figure 2.4. The obtained scores aree used to classify the expressions and texts. 10
Figure 2.4 The flowchart of the late fusion of handcrafted and deep learning features in the classification of isolated and inline MEs. 2.6 Post-processing for ME detection In the detection of MEs, it is not rare that large isolated expressions are split into several text lines. The strategies have relied on the results of the character recognition to determine the conditions of merging successive text lines to become an expression. Figure 2.5 demonstrates example of the post-processing. 11
(a) Before post-processing (b) After post-processing Figure 2.5 Example of the post-processing for ME detection 2.7 Experimental results Table 2.1 Performance comparison between the proposed and existing methods of isolated expression detection on the Marmot dataset (highest scores are in bold) Detected Isolated expression Error Method Correct Partial Total Missed False Total Method in [14] 26.87% 44.89% 71.76% 9.89% 18.35% 28.24% Proposed methods FFT and RF 31.02% 42.32% 73.34% 9.04% 17.62% 26.66% Using AlexNet 47.22% 41.44% 88.66% 2.78% 8.56% 11.34% Using ResNet-18 50.89% 39.27% 90.16% 3.55% 6.29% 9.84% Average operator 51.34% 39.45% 90.79% 3.55% 5.66% 9.21% Product operator 51.34% 39.84% 91.18% 3.14% 5.68% 8.82% Table 2.2 Performance comparison between the proposed and existing methods of inline ex- pression detection on Marmot dataset (highest scores are in bold) Detected Inline expression Error Method Correct Partial Total Missed False Total Method in [14] 1.74% 28.87% 30.61% 9.93% 59.46% 69.39% Proposed methods Projection profile and RF 11.05% 41.40% 52.45% 8.36% 39.19% 47.55% Using AlexNet 21.54% 56.25% 77.79% 7.60% 14.61% 22.21% Using of ResNet-18 22.68% 57.06% 79.74% 5.59% 14.67% 20.26% Average operation 22.79% 57.96% 79.85% 5.79% 14.36% 20.15% Product operation 22.90% 58.45% 81.35% 5.40% 13.25% 18.65% The performance comparison between the proposed and conventional methods of isolated and inline expression detection in the Marmot dataset is shown in Tables 2.1 and 2.2, respec- tively. The proposed system outperforms conventional method due to the effective strategies on document analysis and novel classification techniques. Particularly, the transfer learning of CNNs obtains the highest accuracy in the detection because the CNNs extract more visual 12
Figure 2.6 Examples of the expression detection in a sample page in the GTDB dataset. The detection and ground-truth expressions are marked in blue and red, respectively. features of images than those in other methods. The method [14] focuses to extract features of bounding boxes of characters in word images. The method is not effective for the detection of inline expressions because there is not much variation in the visualization of inline expressions. The method using FFT and projection profiles of images obtains higher accuracy than the method [14] because it can extract two-dimensional layout features of MEs. It is clearly shown in Table 2.2 that the accuracy of detection of the inline expression is much improved by using the transfer learning of CNNs. The performance of the method using the transfer learning of the ResNet-18 is slightly higher than that of AlexNet. The out-performance is obtained because the deeper architecture of the ResNet-18 allows to extract visual feature better than that of AlexNet. The fusion of RF and ResNet-18 allows to obtain the highest performance in the isolated expression detection because the predicted scores of two models are aggregated for the final classification and the misclassification is reduced. 2.8 Summary of the chapter The chapter has presented a fusion approach that detects both isolated and inline MEs in document images. The improvements in the page segmentation and the classification of MEs and texts are combined to improve the performance of the overall detection system. The main results in this chapter have been published in the following publications 6 and 7. CHAPTER 3 The detection of MEs by using the combination of the Distance Transform and Faster R-CNN 3.1 Overview of the proposed method for ME detection using the DT and the Faster R-CNN Last chapter has presented the ME detection method that consists of multi steps. In the chapter, the employment of DT and the optimization of anchor boxes of the Faster R- CNN are proposed to detect MEs. Comparing with multistep method, the proposed method allows to improve the accuracy of MEs in an end-to-end way where the human-resource is 13
Figure 3.1 Flowchart of the proposed method for the ME detection using the DT and the Faster R-CNN. The detected isolated and inline MEs are denoted in blue and black, respectively. Figure 3.2 Faster R-CNN based on Resnet-50 in this study consists of an RPN and fully connected detection sub-networks. The isolated and inline MEs detected are marked in blue and black boxes, respectively. reduced. The proposed method for the detection of isolated and inline MEs is described in Figure 3.1. Figure 3.2 describes the components of the Faster R-CNN. Firstly, the input binary document images are transformed to gray images by using the distance transform to enhance the difference between spatial layout of MEs and back ground. Then, transformed images are fed into the Faster R-CNN for the ME detection. To improve the accuracy of the ME detection, the anchor boxes of RPN are optimized to generate. The detection of isolated MEs is performed in transformed images. The window masks are utilized to mark the regions of detected isolated MEs. Then, the detection of inline MEs is performed on non-isolated ME regions in document images. 3.2 The detection of MEs using the DT and the Faster R-CNN 3.2.1 Distance transform of document image The input document image is transformed to enhance the features of the ME regions. The transformation of document images allows a more accurate detection of MEs by using 14
the Faster R-CNN because the model was initially designed for the object detection of natural images. In this study, the DT ([16]) is applied for document images. The greyscale image is then converted to the RGB one. The DT is applied for each channel of the RGB image. For a binary image of size m × n, the complexity of the DT of the image is O(m × n). Actually, the DT focuses on calculating the distance between each pixel of ME and 8 neighbor pixels. Thus, the complexity of the algorithm depends on the number of pixels of the input image. (a) RGB conversion of a binary document image using the Euclidean metric. (b) RGB conversion of a binary document image using the city block metric. Figure 3.3 Page image after RGB conversion using the (a) Euclidean and (b) city block metrics. The width and height of the image are shown on the x- and y-axes. The colour bar shows the colour scale of the image vertically. 3.2.2 ME detection using a Faster R-CNN The Faster R-CNN consists of two sub-networks: the RPN and fully connected detection networks. This section describes the two networks that have been optimised to detect MEs in this study. 3.2.2.1 Region proposal network The RPN network aims to generate candidate regions for ME. Input of the RPN is a n × n window of feature map. For each location of input window, k region proposals are simultaneously predicted. The proposals are denoted anchor boxes. Position information of anchor boxes is represented by the coordinates of the center and width, height of boxes. So, 2k classification scores (ME or not) and 4k (coordinates and sizes of regions) are generated. By default, k is set at 9 and the scales and sizes of boxes are predefined by the Faster R-CNN. 15
In order to obtain the optimal set of anchor boxes, the statistic information of both isolated and inline MEs is analyzed in our work. Document images in training set are normalized to the size of 600x900 (pixels) and bounding boxes of MEs are resized according to the scales of documents. The number and sizes of anchor boxes are estimated the based on the overlap ratio between the proposal anchor boxes and MEs. The optimal values of anchor boxes have been selected as 15 and 12 for isolated and inline ME detection, respectively. 3.2.2.2 Fully connected detection network The region proposals obtained by the RPN are fed into the fully connected detection network. The network classifies if a region proposal is an expression or not using a softmax layer. The position information of MEs is also obtained using a Box Regression layer. The Resnet-50 is utilized as the backbone of the Faster R-CNN in this work. The ResNet-50 consists of 177 layers corresponding to 50 residual layers. The architecture of Faster R-CNN is formed by adding the Region Proposal, RoI max pooling, Box Regression layers to the Resnet-50. 3.3 Experimental results Table 3.1 Performance comparison between the proposed and existing methods of isolated expression detection on the Marmot dataset (Highest scores of the proposed method are in bold) Detection Error Models Correct Partial Total Missed False Total Method in ([14]) 26.87% 44.89% 71.76% 9.89% 18.35% 28.24% Method using FFT (chapter 2) 31.02% 42.32% 73.34% 9.04% 17.62% 26.66% The fusion method (chapter 2) 51.34% 39.84% 91.18% 3.14% 5.68% 8.82% Proposed method 84.80% 8.10% 92.90% 2.27% 4.83% 7.10% Table 3.2 Performance comparison between the proposed and existing methods of inline ex- pression detection on the Marmot dataset (Highest scores of the proposed method are in bold) Detection Error Method Correct Partial Total Missed False Total Method in ([14]) 1.74% 28.87% 30.61% 9.93% 59.46% 69.39% Method using PP (chapter 2) 11.05% 41.40% 52.45% 8.36% 39.19% 47.55% The fusion method (chapter 2) 22.90% 58.45% 81.35% 5.40% 13.25% 18.65% Proposed method 75.95% 9.95% 85.90% 6.25% 8.20% 14.10% 3.3.1 Comparison of the proposed and state-of-the-art methods used in ME de- tection The performance comparison between the proposed and conventional methods that have applied multistage approaches in ([14]) on the Marmot dataset is shown in Tables 3.1 and 3.2, while that on the GTDB one is shown in Tables 3.3 and 3.4. The method in ([14]) focuses to extract features of bounding boxes of characters in word images. The method is not effective for the detection of inline expression because there is not much variation in the visualization 16
Table 3.3 Performance comparison between the proposed and existing methods of isolated expression detection on the GTDB dataset (Highest scores of the proposed method are in bold) Detection Error Method Correct Partial Total Missed False Total Method in ([14]) 26.22% 44.87% 71.09% 9.91% 19.00% 28.91% Method using FFT (chapter 2) 30.86% 42.12% 72.98% 9.25% 17.77% 27.02% The fusion method (chapter 2) 50.37% 39.14% 89.51% 3.16% 7.33% 10.49% Proposed method 83.79% 7.25% 91.04% 2.15% 6.81% 8.96% Table 3.4 Performance comparison between the proposed and existing methods of inline ex- pression detection on the GTDB dataset (Highest scores of the proposed method are in bold) Detection Error Method Correct Partial Total Missed False Total Method in ([14]) 1.56% 28.67% 30.23% 9.97% 59.80% 69.77% Method using PP (chapter 2) 10.48% 41.36% 51.84% 8.26% 39.90% 48.16% The fusion method (chapter 2) 22.76% 57.44% 80.20% 5.46% 14.34% 19.80% Proposed method 75.20% 9.95% 85.15% 6.15% 8.70% 14.85% of the inline expression. The method using the Fast Fourier Transform (FFT) and projection profile (pp) of images obtains higher accuracy than the method in ([14]) because it can extract two-dimensional layout features of MEs. The method applying the page segmentation and the transfer learning of the Alexnet has obtained higher accuracy than that of handcrafted feature methods because the CNN extracts features efficiently. It is clearly shown in Tables that the expression detection by employing the Faster R-CNN and DT gains the highest accuracy. Particularly, the percentage of the correct detection of expressions is significantly improved by the proposed method. The results demonstrate the effectiveness of the combination of the DT and the Faster R-CNN. Table 3.5 shows the performance comparison between the proposed and state of the art methods on the GTDB dataset. The method of Samsung RD obtains the highest accuracy because the character information is integrated for the detection. Our proposed method shows better performance to other methods. The DT and the optimization of anchor boxes of the Faster R-CNN allow to obtain better accuracy than those of methods using the SSD512 in ([22]) and the Yolov3 in ([22]). The Michiking system in ([22]) has obtained the lowest accuracy because the system applied traditional image processing and handcrafted feature extraction in the detection. Figure 3.4 illustrates the detection of expressions in a document page on the GTDB dataset. 17
Table 3.5 Performance comparison of the proposed and the state of the art methods on the GTDB dataset Method IoU ≥ 0.5 IoU ≥ 0.75 Samsung based on graph theory ([12]) 94.36% 94.17% RIT 2 based on SSD512 ([22]) 83.14% 75.29% RIT 1 based on Yolov3 ([22]) 74.4% 63.20% Michiking system ([22]) 36.87% 19.10% Proposed method 83.79% 77.20% Figure 3.4 Examples of the ME detection in a page in the GTDB dataset. The detected and ground-truth MEs are marked in blue and red, respectively. 3.4 Summary of the chapter The chapter has presented an end-to-end framework that detects isolated and inline MEs in document images. DT with various distance metrics, including the Euclidean, city block, and chessboard metrics, was applied for document images to take advantage of the ME layouts. Moreover, optimisation and generation strategies of the anchor boxes of the RPN of the Faster R-CNN were proposed to improve the accuracy of the detection. The main results in this chapter have been published in the publication 8. CHAPTER 4 Detection and recognition of MEs in document images 4.1 Overview of the proposed system for the detection and recogni- tion of MEs The proposed system consists of two modules. Firstly, MEs were detected by the detection module. Then, MEs are recognized by the recognition module. Figure 4.1 describes the overall system for the detection and recognition of expressions in document images. MEs have been detected by an end-to-end framework that is the Faster R-CNN that has been presented in chapter 3. Then, detected MEs have been directly fed into the WAP [25] for the recognition. Actually, the WAP is an advanced encoder-decoder model to solve the image to markup problem, it is especially important to ensure the translation of each local region of the input image. 18