Multi-lingual Optical Character Recognition Seminar

Spring/Summer 2019 Schedule of Meetings and Talks

Usual meeting time: Friday, 11:00am - 11:50pm. Room: ENR2 S375.
# Date Room Speaker Title and Abstract
30 Friday, 8-23-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Research updates (Marek Rychlik): Report on 20% improvement on digits 0-3 of the MNIST dataset. Abstract: Using the new regularization technique in the "Patternnet" paper I was able to reduce the number of errors from 610 to about 490, which is approximately 20%. The regularization shares some features of the LASSO method in statistics.
  • New MATLAB code updates: a parameter-tweaking GUI.
  • Planning for the new semester.
29 Friday, 8-16-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Research updates (Marek Rychlik): Simplifying the "Patternnet" approach; linear programming;
  • New MATLAB code updates: a parameter-tweaking GUI.
  • Planning for the new semester.
28 Friday, 8-09-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Research updates: Connection between the "Patternnet" multi-class logistic regression and Support Vector Machines, and Linear Programming aspects.
  • Tesseract 5 new features (Yan Han).
27 Friday, 8-02-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Update on multi-logistic regression paper (Marek Rychlik).
  • Tesseract 5 new features (Yan Han).
26 Friday, 7-26-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Caching Unicode character images in MATLAB (Marek Rychlik). Abstract: Rendering Traditional Chinese characters from Unicode glyphs is a time-consuming operation, especially in MATLAB. We need the characters to be at least 60 pixels in size, which generates about 4,000 bits per character (one of 60,000+). Therefore, rather than generating characters on the fly, we reuse generated images by caching them. Two caching strategies are implemented: in RAM and in an SQLite database. Packing bits is used as a form of compression for the database implementation. A speedup achieved is 5-10 fold.
25 Friday, 7-19-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • MATLAB MEX interface to Tesseract 4 (Marek Rychlik). Abstract: I wrote C++ code which effectively wraps Tesseract 4 in MATLAB. It entirely bypasses the Vision Toolbox toolkit wrapper, which only supports version 3 of Tesseract. I will explain the implementation and capabilities. I will also demonstrate its application to provide a complete OCR system for Traditional Chinese, using custom page segmentation (class PageScan) and LSTM-based character recognition.
  • Update on Sayyed's work (Sayyed Vazirizade) on Arabic/Persian/Pashto page segmentation.
  • Review of OCR papers listed at the Tesseract site (Marek Rychlik).
  • Commentary on the "Patternnet" paper: the existence question.
  • Forthcoming meeting with Rep. Grijalva's office.
24 Friday, 7-12-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Update on Sayyed's work.
  • New code for Chinese page segmentation (Marek Rychlik).
  • Commentary on the "Patternnet" paper: the existence question.
  • Forthcoming meeting with Rep. Grijalva's office.
23 Friday, 7-05-2019 ENR2 S375 NO MEETING. 4th of July break.
22 Friday, 6-28-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Preparing for the meeting with Rep. Grijalva's office.
  • Update on Sayyed's work.
  • MATLAB techniques for preparing training datasets: structures, cell arrays, .mat files.
  • Unsupervised classification of Chinese characters.
22 Friday, 6-21-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. This meeting will be devoted to:
  • Updates on the multi-class logistic regression research. Kronecker product and Hadamard product.
  • Image processing training.
  • Sayyed's adaptive thresholding code.
  • Unsupervised classification of Chinese characters.
21 Friday, 6-14-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. This meeting will be devoted to:
  • Image processing training.
  • Sayyed's adaptive thresholding code.
  • Unsupervised classification of Chinese characters.
  • Serializaton and using .mat files.
20 Friday, 6-7-2019 ENR2 S395 Group discussion This meeting will be devoted to:
  • Image processing training.
  • Review of current segmentation code for Pashto/Persian. MATLAB 'system' call under Windows.
  • MATLAB code for image processing and page segmentation algorithms.
  • Unsupervised classification of Chinese characters.
  • Serializaton and using .mat files.
19 Friday, 5-31-2019 ENR2 S375 Group discussion This meeting will be devoted to:
  • Image processing training.
  • MATLAB code for image processing.
  • Page segmentation algorithms.
18 Friday, 5-24-2019 ENR2 S395 Group discussion This meeting will be devoted to:
  • Image processing training.
  • MATLAB code for image processing.
  • Page segmentation algorithms.
17 Friday, 5-17-2019 ENR2 S395 Group discussion This first meeting of Summer 2019 will be devoted to:
  • Summer research themes; everyone is invited to talk for a few minutes about their plans and problems;
  • Review the new collection at the Library of Congress of Chinese documents, needing OCR.
16 Friday, 5-11-2019 ENR2 S395 NO MEETING NO MEETING.
15 Friday, 5-03-2019 ENR2 S395 Group discussion Due to Exam Session, the agenda is tentative:
  • Research progress (Marek Rychlik, Dwight Nwaigwe, Aaron Peterson, Ryan Coatney)
  • New document choices (Yan Han)
  • Tesseract training (Dylan Murphy)
14 Friday, 4-26-2019 ENR2 S395 Group discussion We will cover a variety of topics, including:
  • The problem of "loose clouds" in Chinese text (Yan Han, Marek Rychlik)
  • Building Web applications with MATLAB (Marek Rychlik)
  • MATLAB as an application delivery system: in-MATLAB, standalone with free MATLAB Runtime, Web with MATLAB Web application server
  • Ideas and obstacles of the Tier 2 proposal
  • Research and development ideas for the summer
13 Friday, 4-19-2019 ENR2 S395 Group discussion We will cover a variety of topics, including:
  • Chinese character experiments (Yan Han and his students)
  • Building standalone applications with MATLAB (Marek Rychlik)
  • GitHub limitation on file size
  • New installer site on BitBucket
  • Conversation with an NEH Officer regarding Tier 2 proposal
12 Friday, 4-12-2019 ENR2 S395 Marek Rychlik Research talk, and a group discussion of current topics:
  • Marek Rychlik (University of Arizona) TITLE: Character recognition with multi-class logistic regression. ABSTRACT: Multi-class logistic regression network is simple and it has a good training algorithm and quite impressive accuracy on the MNIST dataset. I will discuss my recent paper on this subject https://arxiv.org/abs/1903.12600 and the implementation in our GitHub repository.
11 Friday, 4-5-2019 ENR2 S395 Group discussion, Marek Rychlik, Yan Han We will cover a variety of topics:
  • NEH Grant proposal Tier 2 proposal content.
  • Report on meeting with Clayton Morrison's group (Marek Rychlik).
  • Tier~2 proposal and the ISO standard (Yan Han).
  • Report on a program im2latex (Marek Rychlik); this program can convert images to math equations.
10 Friday, 3-29-2019 ENR2 S395 Group discussion We will discuss various on-going efforts. This includes:
  • The mechanics of training Tesseract.
  • Preparations for Tesseract training runs on multi-core computer(s).
  • NEH Grant proposal writing - Tier 2.
9 Friday, 3-22-2019 ENR2 S395 Group discussion We will discuss various on-going efforts. This includes:
  • New code on GitHub: an implementation of CTC in MATLAB
  • The page segmentation algorithm
  • Training Tesseract
  • NEH Grant proposal writing - Tier 2
8 Friday, 3-15-2019 ENR2 S395 Marek Rychlik
  • Marek Rychlik (University of Arizona) will speak on CTC. ABSTRACT: CTC (Connectionist Temporal Classification) is a probability model for segmentation of outputs of recurrent neural networks into meaningful chunks. The technique is used for handwriting and script language segmentation, and for natural speech. In Deep Learning, CTC is simply a kind of rather sophisticated loss function. I will discuss the paper of Alex Graves (currently at Deep Mind) who introduced this method about 10 years ago to handwriting recognition and OCR.
7 Friday, 3-1-2019 ENR2 S395 Marek Rychlik
  • Marek Rychlik (University of Arizona) will talk about the Tesseract C++ API. Some C++ programming examples will be discussed. One example allows translation of Pashto ligatures to Unicode.
6 Friday, 2-22-2019 ENR2 S395 Dylan Murphy
  • Dylan Murphy (University of Arizona) will discuss the OCR software. Open source packages Kraken and Tesseract will be discussed. The talk will cover use and system architecture of these systems, as well as the process of training for new language recognition.
5 Friday, 2-15-2019 ENR2 S395 Sayyed Vazirizade
  • Sayyed Vazirizade (University of Arizona) will review Persian OCR software.
4 Friday, 2-8-2019 ENR2 S395 Ryan Coatney, Yan Han
  • Ryan Coatney (University of Arizona) will continue talking about a paper by Kobus et. all, applying Gaussian processes to modeling 1-dimensional structures (plants), and potential applications to OCR (est. 25 min).
  • Yan Han (University of Arizona) will talk about APIs for embedding text in PDF (est. 25 min).
3 Friday, 2-1-2019 ENR2 S395 Mike Maizels, Ryan Coatney
  • Mike Maizels (Harvard and University of Arkansas) will discuss an arts-related project involving OCR (15 min).
  • Ryan Coatney (University of Arizona) will talk about a paper by Kobus et. all, applying Gaussian processes to modeling 1-dimensional structures (plants), and potential applications to OCR (30 min).
2 Friday, 1-25-2019 ENR2 S395 Marek Rychlik A method for Chinese OCR using Hough and Fourier transforms. I will explain the algorithm published in our GitHub repository. Also, I will briefly describe the selected papers which we can collectively study. The slides of this talk are available.
1 Friday, 1-18-2019 ENR2 S395 Organizational meeting. Agenda will include:
  • Introductions
  • Description of the NEH grant research
  • Resources for Pashto and Chinese

Zoom recordings

They are available on the restricted page of this website. However, you need to ask the organizers for the credentials to access this page.

The organizers