Multi-lingual Optical Character Recognition Seminar

Usual meeting time: Friday, 11:00am - 11:50pm. Room: ENR2 S375.

Spring 2020 Schedule of Meetings and Talks

# Date Room Speaker or Topic Agenda or Title and Abstract
44 Friday, 2-21-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • An implementation of CTC prefix search. (Marek Rychlik, Dwight Nwaigwe)
  • How to decode Latin alphabets with PCA and nearest neighbor search? (Marek Rychlik)
  • BiLSCM+CTC experiments with the Bromello font, aLatin cursive font. (Marek Rychlik)
  • Results evaluation of OCR from commercial OCR software. (Yan Han)
  • Discussion of other ongoing research.
44 Friday, 2-14-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
43 Friday, 2-7-2020 MATH 402 Odin Fernando Eufracio Vazquez
  • Speaker: Odin Fernando Eufracio Vazquez
  • Affiliation: Centro de Investigación en Matemáticas A.C. (CIMAT)
  • Title: Nonnegative Matrix Factorization with low-rank regularization for automatic feature extraction.
  • Room: MATH 402 (not ENR2 S375)
  • Join Zoom Meeting: ID 989-919-119
  • Abstract:

    In machine learning, Nonnegative Matrix Factorization (NMF) is a method of dimensionality reduction where the nonnegative constraints in NMF impose only additive combinations. One of the challenges in MFN is to determine the rank of the factorization; the correct choice of the rank would allow us to extract better features and thus promote a part-based representation of the data.

    In this work, we propose to include a diagonal matrix D and minimize the rank of factorization through the penalization of the elements in the diagonal of D. We derive an iterative algorithm with closed formulas by alternately minimizing local cost functions.

    We demonstrate the efficacy of our algorithm by performing experiments on synthetic data, images, texts, and gene expressions data sets. We show that our proposed algorithm not only estimates the factors with high precision but by minimizing the rank of factorization, our algorithm can learn interpretable features from the data.

42 Friday, 1-30-2020 ENR2 S375 Research, funding, collaborations Group discussion and research updates on the following topics:
  • Report on conversation with SGCI.
  • TRIPODS 2 Translational Lab proposal.
  • Current research.
41 Friday, 1-23-2020 ENR2 S375 Organizational meeting. Group discussion and research updates on the following topics:
  • A review of research papers.
  • Funding strategies.

2019 Schedule of Meetings and Talks

# Date Room Speaker Title and Abstract
40 Friday, 12-06-2019 ENR2 S375 Semester Wrap-Up Group discussion and research updates on the following topics:
  • A review of implemented OCR-related algorithms.
  • Sequence-to-label mapping with LSTM as means of recognizing isolated characters.
  • Planning for the next semester.
40 Friday, 11-21-2019 ENR2 S375 Research Updates Group discussion and research updates on the following topics:
  • A review of best algorithms for finding distance between damaged characters.
  • Automatic differentiation in MATLAB 2019b.
  • Generating W-language samples with a chaotic dynamical system (interval map).
  • Learning parameters of chaotic Lorenz attractors with standard Deep Learning tools.
39 Friday, 11-15-2019 ENR2 S375 Research Updates
  • Marek Rychlik: New features in the MATLAB Deep Learning Toolkit. Abstract: The R2019b version of MATLAB has a number of new, impressive features for machine learning. I will focus on the automatic differentiation features.
  • Group discussion of dynamic time warping for multiple outline cycles (teleportation).
38 Friday, 11-01-2019 ENR2 S375 Research Updates
  • Marek Rychlik: Dynamic time warping with and without teleportation. Abstract: Dynamic Time Warping (DTW) may be used to align time while traversing similar data. In Ocr we need also teleporation (instant transfer to another signal space). Rudimentary C++ code was used for benchmarking (without teleporation yet).
  • New videos! Arabic characters on Youtube channel
  • Dwight Nwaigwe will report on current progress on the Hessian of the multi-class logistic regression loss.
  • Dylan Murphy will update us on the sweep implementation of the character outline algorithm, and related topics.
37 Friday, 10-25-2019 ENR2 S375 Research updates
  • Marek Rychlik: Update on current performance on Traditional Chinese, Rotated text, Latin text, font mapping. Survey of new videos on the Youtube channel.
36 Friday, 10-18-2019 ENR2 S375 Research updates
  • NOTE: Marek Rychlik will give a talk at the TRIPODS seminar on Monday, 10-21-2019.
  • Marek Rychlik will demonstrate full OCR processing based on the character outline algorithm and cross-correlation, for Latin alphabets.
35 Friday, 10-11-2019 ENR2 S375 Dylan Murphy
  • TITLE: Line-sweeping and other incremental improvements Abstract: In the cycle-detection algorithm, the expensive step is graph-traversal. One way to improve the speed of this algorithm, as we have seen, is to use a more efficient representation of the adjacency matrix in terms of linked lists. I will present another method, which avoids the graph-traversal step entirely by connecting cycles during the edge-detection step using a linesweep-style algorithm. A naive implementation of this approach in Python produced speedups similar to the linked-list approach, reducing the processing time for a page to several seconds, down from hundreds of seconds.
  • Marek Rychlik will report on a modified algorithm capable of processing many Chinese pages in real (MATLAB) time, thanks to non-uniform sampling of character outlines. If there is time, I will discuss Dynamic Time Warping (DTW).
34 Friday, 10-04-2019 ENR2 S375 Marek Rychlik, Group Discussion
  • TITLE: Multi-page clustering of Traditional Chinese text Abstract: I will report on recent advances of Chinese text processing. The main advance is increasing the speed of computing character outlines by a factor of 100 as compared to the last report. This allows processing of entire books in acceptable time, without expensive hardware resources.
  • Research and software development updates.
34 Friday, 9-27-2019 ENR2 S375 Raymundo Navarette
  • TITLE: Bias reduction in multi-class logistic regression and the problem of separation. Abstract: We will go over the basics of Firth's method for bias reduction in maximum likelihood estimates and apply them to multi-class logistic regression. We will discuss how this and other penalization approaches remove the problem of separation, which occurs when sample sizes are small or when all classes can be separated with linear classifiers, and leads to non-existent (infinite) optimal parameters.
  • Research and software development updates.
33 Friday, 9-20-2019 ENR2 S375 Marek Rychlik, Group
  • TITLE: Latin and Chinese character outlines as means of extracting features Abstract: An important idea of OCR present in Tesseract and research papers is that character outlines are the source of features for character classification. We will discuss Fourier transform and splines as means of smoothing and approximating character boundaries. An inherent instability due to character damage will be described, and ways to address it.
  • Research and software development updates.
32 Friday, 9-13-2019 ENR2 S375 Dwight Nwaigwe
  • TITLE: Character Matching and some problems with topological classification. Abstract: Ray Smith's overview of Tesseract mentions the use of feature in classification, noting that topological classification is not robust. We briefly go into some examples. Further, we note that his paper discusses some of the mechanics of character matching, which is based on clustering. We compare a clustering method devised for cursive fonts.
31 Friday, 9-6-2019 ENR2 S375 Group
  • Discussion of Tesseract architecture based on Ray Smith's paper.
    • The use of polygonal approximations
    • Maximally chopped characters
    • (x,y,theta) as features
    • Re-assembling chopped characters
  • Character outline calculation with MATLAB.
  • Top-down, bottom-up, adaptive processing.
  • Research updates.
30 Friday, 8-30-2019 ENR2 S375 Group
  • Planning for the semester; papers to read. One paper: Overview of Tesseract OCR Engine by Ray Smith.
  • New MATLAB code updates: a parameter-tweaking GUI.
  • Research updates.
  • MATLAB image datastores - the MATLAB ways to prepare training data.
30 Friday, 8-23-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Research updates (Marek Rychlik): Report on 20% improvement on digits 0-3 of the MNIST dataset. Abstract: Using the new regularization technique in the "Patternnet" paper I was able to reduce the number of errors from 610 to about 490, which is approximately 20%. The regularization shares some features of the LASSO method in statistics.
  • New MATLAB code updates: a parameter-tweaking GUI.
  • Planning for the new semester.
29 Friday, 8-16-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Research updates (Marek Rychlik): Simplifying the "Patternnet" approach; linear programming;
  • New MATLAB code updates: a parameter-tweaking GUI.
  • Planning for the new semester.
28 Friday, 8-09-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Research updates: Connection between the "Patternnet" multi-class logistic regression and Support Vector Machines, and Linear Programming aspects.
  • Tesseract 5 new features (Yan Han).
27 Friday, 8-02-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Update on multi-logistic regression paper (Marek Rychlik).
  • Tesseract 5 new features (Yan Han).
26 Friday, 7-26-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Caching Unicode character images in MATLAB (Marek Rychlik). Abstract: Rendering Traditional Chinese characters from Unicode glyphs is a time-consuming operation, especially in MATLAB. We need the characters to be at least 60 pixels in size, which generates about 4,000 bits per character (one of 60,000+). Therefore, rather than generating characters on the fly, we reuse generated images by caching them. Two caching strategies are implemented: in RAM and in an SQLite database. Packing bits is used as a form of compression for the database implementation. A speedup achieved is 5-10 fold.
25 Friday, 7-19-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • MATLAB MEX interface to Tesseract 4 (Marek Rychlik). Abstract: I wrote C++ code which effectively wraps Tesseract 4 in MATLAB. It entirely bypasses the Vision Toolbox toolkit wrapper, which only supports version 3 of Tesseract. I will explain the implementation and capabilities. I will also demonstrate its application to provide a complete OCR system for Traditional Chinese, using custom page segmentation (class PageScan) and LSTM-based character recognition.
  • Update on Sayyed's work (Sayyed Vazirizade) on Arabic/Persian/Pashto page segmentation.
  • Review of OCR papers listed at the Tesseract site (Marek Rychlik).
  • Commentary on the "Patternnet" paper: the existence question.
  • Forthcoming meeting with Rep. Grijalva's office.
24 Friday, 7-12-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Update on Sayyed's work.
  • New code for Chinese page segmentation (Marek Rychlik).
  • Commentary on the "Patternnet" paper: the existence question.
  • Forthcoming meeting with Rep. Grijalva's office.
23 Friday, 7-05-2019 ENR2 S375 NO MEETING. 4th of July break.
22 Friday, 6-28-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Preparing for the meeting with Rep. Grijalva's office.
  • Update on Sayyed's work.
  • MATLAB techniques for preparing training datasets: structures, cell arrays, .mat files.
  • Unsupervised classification of Chinese characters.
22 Friday, 6-21-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. This meeting will be devoted to:
  • Updates on the multi-class logistic regression research. Kronecker product and Hadamard product.
  • Image processing training.
  • Sayyed's adaptive thresholding code.
  • Unsupervised classification of Chinese characters.
21 Friday, 6-14-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. This meeting will be devoted to:
  • Image processing training.
  • Sayyed's adaptive thresholding code.
  • Unsupervised classification of Chinese characters.
  • Serializaton and using .mat files.
20 Friday, 6-7-2019 ENR2 S395 Group discussion This meeting will be devoted to:
  • Image processing training.
  • Review of current segmentation code for Pashto/Persian. MATLAB 'system' call under Windows.
  • MATLAB code for image processing and page segmentation algorithms.
  • Unsupervised classification of Chinese characters.
  • Serializaton and using .mat files.
19 Friday, 5-31-2019 ENR2 S375 Group discussion This meeting will be devoted to:
  • Image processing training.
  • MATLAB code for image processing.
  • Page segmentation algorithms.
18 Friday, 5-24-2019 ENR2 S395 Group discussion This meeting will be devoted to:
  • Image processing training.
  • MATLAB code for image processing.
  • Page segmentation algorithms.
17 Friday, 5-17-2019 ENR2 S395 Group discussion This first meeting of Summer 2019 will be devoted to:
  • Summer research themes; everyone is invited to talk for a few minutes about their plans and problems;
  • Review the new collection at the Library of Congress of Chinese documents, needing OCR.
16 Friday, 5-11-2019 ENR2 S395 NO MEETING NO MEETING.
15 Friday, 5-03-2019 ENR2 S395 Group discussion Due to Exam Session, the agenda is tentative:
  • Research progress (Marek Rychlik, Dwight Nwaigwe, Aaron Peterson, Ryan Coatney)
  • New document choices (Yan Han)
  • Tesseract training (Dylan Murphy)
14 Friday, 4-26-2019 ENR2 S395 Group discussion We will cover a variety of topics, including:
  • The problem of "loose clouds" in Chinese text (Yan Han, Marek Rychlik)
  • Building Web applications with MATLAB (Marek Rychlik)
  • MATLAB as an application delivery system: in-MATLAB, standalone with free MATLAB Runtime, Web with MATLAB Web application server
  • Ideas and obstacles of the Tier 2 proposal
  • Research and development ideas for the summer
13 Friday, 4-19-2019 ENR2 S395 Group discussion We will cover a variety of topics, including:
  • Chinese character experiments (Yan Han and his students)
  • Building standalone applications with MATLAB (Marek Rychlik)
  • GitHub limitation on file size
  • New installer site on BitBucket
  • Conversation with an NEH Officer regarding Tier 2 proposal
12 Friday, 4-12-2019 ENR2 S395 Marek Rychlik Research talk, and a group discussion of current topics:
  • Marek Rychlik (University of Arizona) TITLE: Character recognition with multi-class logistic regression. ABSTRACT: Multi-class logistic regression network is simple and it has a good training algorithm and quite impressive accuracy on the MNIST dataset. I will discuss my recent paper on this subject https://arxiv.org/abs/1903.12600 and the implementation in our GitHub repository.
11 Friday, 4-5-2019 ENR2 S395 Group discussion, Marek Rychlik, Yan Han We will cover a variety of topics:
  • NEH Grant proposal Tier 2 proposal content.
  • Report on meeting with Clayton Morrison's group (Marek Rychlik).
  • Tier~2 proposal and the ISO standard (Yan Han).
  • Report on a program im2latex (Marek Rychlik); this program can convert images to math equations.
10 Friday, 3-29-2019 ENR2 S395 Group discussion We will discuss various on-going efforts. This includes:
  • The mechanics of training Tesseract.
  • Preparations for Tesseract training runs on multi-core computer(s).
  • NEH Grant proposal writing - Tier 2.
9 Friday, 3-22-2019 ENR2 S395 Group discussion We will discuss various on-going efforts. This includes:
  • New code on GitHub: an implementation of CTC in MATLAB
  • The page segmentation algorithm
  • Training Tesseract
  • NEH Grant proposal writing - Tier 2
8 Friday, 3-15-2019 ENR2 S395 Marek Rychlik
  • Marek Rychlik (University of Arizona) will speak on CTC. ABSTRACT: CTC (Connectionist Temporal Classification) is a probability model for segmentation of outputs of recurrent neural networks into meaningful chunks. The technique is used for handwriting and script language segmentation, and for natural speech. In Deep Learning, CTC is simply a kind of rather sophisticated loss function. I will discuss the paper of Alex Graves (currently at Deep Mind) who introduced this method about 10 years ago to handwriting recognition and OCR.
7 Friday, 3-1-2019 ENR2 S395 Marek Rychlik
  • Marek Rychlik (University of Arizona) will talk about the Tesseract C++ API. Some C++ programming examples will be discussed. One example allows translation of Pashto ligatures to Unicode.
6 Friday, 2-22-2019 ENR2 S395 Dylan Murphy
  • Dylan Murphy (University of Arizona) will discuss the OCR software. Open source packages Kraken and Tesseract will be discussed. The talk will cover use and system architecture of these systems, as well as the process of training for new language recognition.
5 Friday, 2-15-2019 ENR2 S395 Sayyed Vazirizade
  • Sayyed Vazirizade (University of Arizona) will review Persian OCR software.
4 Friday, 2-8-2019 ENR2 S395 Ryan Coatney, Yan Han
  • Ryan Coatney (University of Arizona) will continue talking about a paper by Kobus et. all, applying Gaussian processes to modeling 1-dimensional structures (plants), and potential applications to OCR (est. 25 min).
  • Yan Han (University of Arizona) will talk about APIs for embedding text in PDF (est. 25 min).
3 Friday, 2-1-2019 ENR2 S395 Mike Maizels, Ryan Coatney
  • Mike Maizels (Harvard and University of Arkansas) will discuss an arts-related project involving OCR (15 min).
  • Ryan Coatney (University of Arizona) will talk about a paper by Kobus et. all, applying Gaussian processes to modeling 1-dimensional structures (plants), and potential applications to OCR (30 min).
2 Friday, 1-25-2019 ENR2 S395 Marek Rychlik A method for Chinese OCR using Hough and Fourier transforms. I will explain the algorithm published in our GitHub repository. Also, I will briefly describe the selected papers which we can collectively study. The slides of this talk are available.
1 Friday, 1-18-2019 ENR2 S395 Organizational meeting. Agenda will include:
  • Introductions
  • Description of the NEH grant research
  • Resources for Pashto and Chinese

Zoom recordings

They are available on the restricted page of this website. However, you need to ask the organizers for the credentials to access this page.

The organizers