Multi-lingual Optical Character Recognition Seminar

Fall 2020 Schedule of Meetings and Talks

Usual meeting time: Friday, 11:00am - 11:50pm. The meetings will be exclusively on line by Zoom. Use this link to connect.
# Date Room Speaker or Topic Agenda or Title and Abstract
83 Friday, 11-20-2020 Zoom Group Discussion Agenda items:
  • Funding updates.
  • Research updates.
82 Friday, 11-13-2020 Zoom Group Discussion Agenda items:
  • CTC implementation in MATLAB (Dwight Nwaigwe, Marek Rychlik). CTC with logits can be implemented without patching MATLAB installation. Some built-in classes are overridden. This approach may work with other projects.
  • Research updates.
81 Friday, 11-06-2020 Zoom Elsayed Issa
  • Speaker: Elsayed Issa
  • Affiliation: University of Arizona, School of Middle Eastern and North African Studies
  • Title: Machine-Extracted Text Summaries for Arabic L2 Learning
  • Abstract: Text summarization is the process of creating a concise and coherent summary of a longer text while preserving the meaning and the important information in the text. Automatic summaries reduce reading time, improve the effectiveness of indexing, and help in question-answering systems. In this talk, we will discuss a line of research on automatic text summarization for L2 microlearning where summaries serve as small learning pieces that L2 learners read instead of larger documents. We use Probabilistic Topic Modeling (PTM) and its Latent Dirichlet Allocation (LDA) algorithm as well as a sentence extraction approach to implement our system. Topic modeling is used to discover the underlying topics in a text document or several documents. The basic assumption behind it is that a document can be represented by a set of latent topics, multinomial distributions over words, and assume that each document can be described as a mixture of these topics. Each document has then a set of topics and probability distributions associated with them. At the same time, each topic has a set of words and their probabilities of occurrence given that document and topic, i.e., topic models build bags for topics to extract information. The extractive method selects and extracts the more relevant pieces or sentences than others in a longer text.
80 Friday, 10-30-2020 Zoom Group Discussion Agenda items:
  • New approaches to Chinese; attention in neural nets (Yan Han, Dylan Murphy, Dwight Nwaigwe).
  • Implementing CTC with logits in MATLAB (Dwight Nwaigwe, Marek Rychlik).
  • Research updates.
79 Friday, 10-23-2020 Zoom Group Discussion Agenda items:
  • New approaches to Chinese (Yan Han, Dylan Murphy, Dwight Nwaigwe).
  • Continuation of MRI Proposal and the POWER architecture (Marek Rychlik).
  • Implementing CTC with logits in MATLAB (Dwight Nwaigwe, Marek Rychlik).
  • Research updates.
78 Friday, 10-16-2020 Zoom Group Discussion Agenda items:
  • MRI Proposal and the POWER architecture (Marek Rychlik).
  • Implementing CTC with logits in MATLAB (Dwight Nwaigwe).
  • Research updates.
77 Friday, 10-09-2020 Zoom Group Discussion Agenda items:
76 Friday, 10-02-2020 Zoom Group Discussion Agenda items:
  • Implementing CTC with logits in MATLAB (Dwight Nwaigwe).
  • Research updates.
75 Friday, 9-25-2020 Zoom Group Discussion Agenda items:
  • The upcoming Middle East Studies Association (MESA) meeting (Marek Rychlik). Abstract: I am scheduled to participate in a panel discussion and present a paper on Tuesday, 10/06/2020, 1:30pm. The program of the meeting is available on-line.
  • CTC implementation strategies (Dwight Nwaigwe, Marek Rychlik, Ryan Coatney).
  • New papers in paper repository.
  • Research updates.
74 Friday, 9-18-2020 Zoom Group Discussion Agenda items:
  • The DoD White Paper update (Marek Rychlik).
  • Research updates.
73 Friday, 9-11-2020 Zoom Group Discussion Agenda items:
  • DoD funding pursuit update (Marek Rychlik, Yan Han). Items needed (deadline for submission: Sep. 21):
    • Project description (3-page limit)
    • Approximate yearly budget
  • Updates to Git (non-public) repository (Marek Rychlik):
    • Moved seminar website into the repository
    • Planning to integrate 'Papers' folder, so that there is only one collection of papers folder
  • Strategy to develop CTC with logits in MATLAB (Dwight Nwaigwe, Marek Rychlik).
  • A new paper with spectral estimates. (Dwights Nwaigwe).
  • Research updates.
72 Friday, 9-04-2020 Zoom Group Discussion Agenda items:
  • DoD funding pursuit update (Marek Rychlik, Yan Han).
  • Strategy to develop CTC with logits in MATLAB (Dwights Nwaigwe).
  • Research updates.
71 Friday, 8-28-2020 Zoom Group Discussion Agenda items:
  • DoD funding pursuit --- collaborator and mentor search update (Marek Rychlik, Yan Han).
  • DoD two-page project summary writing effort (Marek Rychlik). NOTE: Ideas beyond current project may be developed.
  • Preliminary report on Fourier and Cepstral Analysis incorporation for handling vertical jitter (Marek Rychlik).
  • Video of Chinese character recognition posted on YouTube (Dwights Nwaigwe).
  • Announcement: the "deductron paper" published (Marek Rychlik).
  • Research updates.
70 Friday, 8-21-2020 Zoom Group Discussion Agenda items:
  • DoD funding --- collaborator and mentor search update (Marek Rychlik).
  • C++ and ImageMagick for OCR (Marek Rychlik) Abstract: Working with frameworks (e.g. MATLAB Deep Learning Toolkit) typically leads to insurmountable problems due to framework design flaws and limitations. This is why one eventually want to take advantage of the power and flexibility of C++. I will demonstrate this with bits of code written in C++ using ImageMagick and Boost multiarray class.
  • Research updates.
Usual meeting time: Friday, 11:00am - 11:50pm. Room: ENR2 S375.

Summer 2020 Schedule of Meetings and Talks

# Date Room Speaker or Topic Agenda or Title and Abstract
69 Friday, 8-14-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Using sequence-to-label mapping for OCR (Dwight Nwaigwe).
  • DoD funding and grant writing update (Marek Rychlik).
  • Implementing the snake scanning pattern to implement baseline-free learning of cursive scripts (Marek Rychlik).
  • Research updates.
68 Friday, 8-07-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Research topics suggestions for the Fall. Papers to read in the Git repository.
  • DoD funding and grant writing plan (Marek Rychlik).
  • Classifying English characters with RNN in sequence-to-label mapping mode (Marek Rychlik). The code was posted to git.
  • Research updates.
67 Friday, 7-31-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Conference in Soul in 2021 (Yan Han).
  • DoD funding and grant writing plan (Marek Rychlik).
  • Classifying English characters with RNN in sequence-to-label mapping mode (Marek Rychlik).
  • Papers to read in the Git repository
  • Research updates.
66 Friday, 7-24-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • DoD funding opportunities in Machine Learning (Marek Rychlik). NOTE: I e-mailed a copy of the materials from the Webinar I attended.
  • Sharing files with iSCSI (Marek Rychlik). Abstract: The RSNA dataset is over 400GB, which is more diskspace than most laptops have. I will discuss the use of the iSCSI protocol to share a disck across the network from a server. I will briefly compare to the NSF protocal popular in the U*nix world.
  • Research updates.
65 Friday, 7-17-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Search for new funding opportunities (Marek Rychlik).
  • Updates to hardware and software resources.
  • Research updates.
64 Friday, 7-10-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Calling Python from MATLAB and improved generation of Unicode strings (Marek Rychlik).
  • New theoretical results on eigenvalues of the Hessian in multi-class logistic regression (Dwight Nwaigwe).
  • Research updates.
63 Friday, 7-3-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Medical imaging - Kaggle RSNA intracranial hemorrhage dataset (Marek Rychlik). Abstract: I will report on downloading the dataset and basic processing in MATLAB. I will discuss a preparatory step for ML: creation of a custom DICOM data store class.
  • Research updates.
62 Friday, 6-26-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Medical imaging - similarities and differences with OCR (Marek Rychlik). Abstract: A 2019 Kaggle-style challenge dealing with radiological data concluded with 3 best solutions being a combination of CNN and LSTM/GRU, things commonly used in OCR. The major difference is that we deal with Big Data. I will introduce a half-a-terabyte training dataset, used in the Kaggle competition.
  • Determinant and eigenvalue identities and inequalities in machine learning (Dwights Nwaigwe).
  • Research updates.
61 Friday, 6-19-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Deductron implementation in Python/Keras, ROM and 15% speedup of training (Dylan Murphy).
  • A determinant lemma for sums of Kronecker products (Dwight Nwaigwe).
  • Dissertation on dynamic responsibility (Ryan Coatney).
  • Other research developments.
60 Friday, 6-12-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Deductron implementation in Python/Keras (Dylan Murphy).
  • Deductron as a replacement for LSTM in and NLP application (Marek Rychlik, Dylan Murphy).
  • A determinant lemma for sums of Kronecker products (Dwight Nwaigwe).
  • Other research developments.

Spring 2020 Schedule of Meetings and Talks

# Date Room Speaker or Topic Agenda or Title and Abstract
59 Friday, 6-5-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Deductron implementation and in Python/Keras (Dylan Murphy).
  • Keras-based OCR in Python for the Bromello font (Marek Rychlik).
  • Other research developments.
58 Friday, 5-29-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Deductron implementation and in Python/Keras (Dylan Murphy).
  • Keras-based OCR in Python for the Bromello font (Marek Rychlik).
  • Other research developments.
57 Friday, 5-22-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Future funding.
  • Research updates.
56 Friday, 5-15-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • NEH proposal submission.
  • Research updates.
55 Friday, 5-8-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Grant proposal writing.
  • White paper submission.
  • Research updates.
54 Friday, 5-1-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • White paper submission.
  • Grant proposal writing.
53 Friday, 4-23-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • White paper progress.
  • Transition to working on the next proposal.
52 Friday, 4-17-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • New code and new datasets (Marek Rychlik):
    • Drawing Chinese characters in MATLAB efficiently.
    • Breaking up pages with new LineBreakerApp into lines and characters.
  • Training deep networks on Chinese characters (Dwight Nwaigwe).
  • White paper update.
51 Friday, 4-10-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Creation of new training data for Pashto (Yan Han, Marek Rychlik).
  • The line-breaking algorithm for Farsi and Pashto based on bounding box overlaps (Marek Rychlik, Sayyed Vazirizade).
  • Chinese rendering with Python (Dwight Nwaigwe).
  • White paper progress.
50 Friday, 4-03-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • White paper and NEH reporting.
49 Friday, 3-27-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • White paper and NEH reporting (Marek Rychlik).
  • OCR on Chinese characters using multi-class logistic regression and new approaches (Dwight Nwaigwe).
  • Data augmentation approach to handling warped text.
  • Discussion of other ongoing research.
48 Friday, 3-20-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • A deep learning pipeline for OCR on Farsi (Marek Rychlik). Abstract:

    Arabic writing system (used, e.g. by Persian/Farsi) uses 3 forms of a character (initial, medial and final) to reflect its position in a ligature. In addition, the characters can be richly decodated with diacritics. We applied a deep learning workflow similar to a generic video processing pipeline to perform OCR on Farsi. Following Latin script approach, we trained the system on unigrams, bigrams and augmented characters: ordinary letters decorated with diacritics. We performed preliminary validation on OCR_GS_Data, a publicly available "gold standard" dataset. We utilized only the labels of the dataset so far, and generated synthetic data by typesetting those labels (lines of text in Farsi). The performance is as expected: the system behaves well on short ligatures not involving the medial form. It is expected that after seeing the medial form in training data, the system will attain the desired performance.

  • OCR on Chinese characters using multi-class logistic regression and new approaches (Dwight Nwaigwe).
  • OCR of Latin script font Bromello achieves 100% accuracy (progress report).
  • Perfected CTC implementation in MATLAB (progress report). The use of parallelism and improved visual progress tracking.
  • Discussion of other ongoing research.
47 Friday, 3-6-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Decoding Bromello font bigrams with BiLSTM and CTC (Marek Rychlik). Abstract:

    As in previous talk of 2/28, Bromello is used as an example of a Latin cursive font. We are working with bidirectional LSTM to perform sequence-to-sequence mapping (in contrast, in the last talk we performed sequence-to-label mapping). We use out home-grown implementation of CTC (Connectionist Temporal Classification) layer, to perform complete OCR. The results indicate that the syste is not capable to insert Graves's blank, i.e. it fails to produce "strongly predicted blanks". As a consequence, some characters disappear. Especially acute problem is repetitions of the same character. The working hypothesis is that the problem is fundamental, and it reflects the limitations of the Grave's probabilistic model.

  • A recent improvement of numerical stability in our CTC implementation.
  • Discussion of other ongoing research.
  • NEH reporting.
46 Friday, 2-28-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Decoding Bromello font bigrams without CTC with high accuracy (Marek Rychlik). Abstract: Bromello is a Latin cursive font for creating decorative, script texts in English. It poses similar problems to Arabic scripts for OCR. By suitably preparing training data I am able to decode Bromello texts with bidirectional LSTM without using CTC. Furthermore, LSTM is only used in sequence-to-label mapping mode.
  • Acuisition of a new Chinese labeled character database (Dwight Nwaigwe).
  • Discussion of ongoing research.
45 Friday, 2-21-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • An implementation of CTC prefix search. (Marek Rychlik, Dwight Nwaigwe)
  • How to decode Latin alphabets with PCA and nearest neighbor search? (Marek Rychlik)
  • BiLSCM+CTC experiments with the Bromello font, aLatin cursive font. (Marek Rychlik)
  • Results evaluation of OCR from commercial OCR software. (Yan Han)
  • Discussion of other ongoing research.
44 Friday, 2-14-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
43 Friday, 2-7-2020 MATH 402 Odin Fernando Eufracio Vazquez
  • Speaker: Odin Fernando Eufracio Vazquez
  • Affiliation: Centro de Investigación en Matemáticas A.C. (CIMAT)
  • Title: Nonnegative Matrix Factorization with low-rank regularization for automatic feature extraction.
  • Room: MATH 402 (not ENR2 S375)
  • Join Zoom Meeting: ID 989-919-119
  • Abstract:

    In machine learning, Nonnegative Matrix Factorization (NMF) is a method of dimensionality reduction where the nonnegative constraints in NMF impose only additive combinations. One of the challenges in MFN is to determine the rank of the factorization; the correct choice of the rank would allow us to extract better features and thus promote a part-based representation of the data.

    In this work, we propose to include a diagonal matrix D and minimize the rank of factorization through the penalization of the elements in the diagonal of D. We derive an iterative algorithm with closed formulas by alternately minimizing local cost functions.

    We demonstrate the efficacy of our algorithm by performing experiments on synthetic data, images, texts, and gene expressions data sets. We show that our proposed algorithm not only estimates the factors with high precision but by minimizing the rank of factorization, our algorithm can learn interpretable features from the data.

42 Friday, 1-30-2020 ENR2 S375 Research, funding, collaborations Group discussion and research updates on the following topics:
  • Report on conversation with SGCI.
  • TRIPODS 2 Translational Lab proposal.
  • Current research.
41 Friday, 1-23-2020 ENR2 S375 Organizational meeting. Group discussion and research updates on the following topics:
  • A review of research papers.
  • Funding strategies.

2019 Schedule of Meetings and Talks

# Date Room Speaker Title and Abstract
40 Friday, 12-06-2019 ENR2 S375 Semester Wrap-Up Group discussion and research updates on the following topics:
  • A review of implemented OCR-related algorithms.
  • Sequence-to-label mapping with LSTM as means of recognizing isolated characters.
  • Planning for the next semester.
40 Friday, 11-21-2019 ENR2 S375 Research Updates Group discussion and research updates on the following topics:
  • A review of best algorithms for finding distance between damaged characters.
  • Automatic differentiation in MATLAB 2019b.
  • Generating W-language samples with a chaotic dynamical system (interval map).
  • Learning parameters of chaotic Lorenz attractors with standard Deep Learning tools.
39 Friday, 11-15-2019 ENR2 S375 Research Updates
  • Marek Rychlik: New features in the MATLAB Deep Learning Toolkit. Abstract: The R2019b version of MATLAB has a number of new, impressive features for machine learning. I will focus on the automatic differentiation features.
  • Group discussion of dynamic time warping for multiple outline cycles (teleportation).
38 Friday, 11-01-2019 ENR2 S375 Research Updates
  • Marek Rychlik: Dynamic time warping with and without teleportation. Abstract: Dynamic Time Warping (DTW) may be used to align time while traversing similar data. In Ocr we need also teleporation (instant transfer to another signal space). Rudimentary C++ code was used for benchmarking (without teleporation yet).
  • New videos! Arabic characters on Youtube channel
  • Dwight Nwaigwe will report on current progress on the Hessian of the multi-class logistic regression loss.
  • Dylan Murphy will update us on the sweep implementation of the character outline algorithm, and related topics.
37 Friday, 10-25-2019 ENR2 S375 Research updates
  • Marek Rychlik: Update on current performance on Traditional Chinese, Rotated text, Latin text, font mapping. Survey of new videos on the Youtube channel.
36 Friday, 10-18-2019 ENR2 S375 Research updates
  • NOTE: Marek Rychlik will give a talk at the TRIPODS seminar on Monday, 10-21-2019.
  • Marek Rychlik will demonstrate full OCR processing based on the character outline algorithm and cross-correlation, for Latin alphabets.
35 Friday, 10-11-2019 ENR2 S375 Dylan Murphy
  • TITLE: Line-sweeping and other incremental improvements Abstract: In the cycle-detection algorithm, the expensive step is graph-traversal. One way to improve the speed of this algorithm, as we have seen, is to use a more efficient representation of the adjacency matrix in terms of linked lists. I will present another method, which avoids the graph-traversal step entirely by connecting cycles during the edge-detection step using a linesweep-style algorithm. A naive implementation of this approach in Python produced speedups similar to the linked-list approach, reducing the processing time for a page to several seconds, down from hundreds of seconds.
  • Marek Rychlik will report on a modified algorithm capable of processing many Chinese pages in real (MATLAB) time, thanks to non-uniform sampling of character outlines. If there is time, I will discuss Dynamic Time Warping (DTW).
34 Friday, 10-04-2019 ENR2 S375 Marek Rychlik, Group Discussion
  • TITLE: Multi-page clustering of Traditional Chinese text Abstract: I will report on recent advances of Chinese text processing. The main advance is increasing the speed of computing character outlines by a factor of 100 as compared to the last report. This allows processing of entire books in acceptable time, without expensive hardware resources.
  • Research and software development updates.
34 Friday, 9-27-2019 ENR2 S375 Raymundo Navarette
  • TITLE: Bias reduction in multi-class logistic regression and the problem of separation. Abstract: We will go over the basics of Firth's method for bias reduction in maximum likelihood estimates and apply them to multi-class logistic regression. We will discuss how this and other penalization approaches remove the problem of separation, which occurs when sample sizes are small or when all classes can be separated with linear classifiers, and leads to non-existent (infinite) optimal parameters.
  • Research and software development updates.
33 Friday, 9-20-2019 ENR2 S375 Marek Rychlik, Group
  • TITLE: Latin and Chinese character outlines as means of extracting features Abstract: An important idea of OCR present in Tesseract and research papers is that character outlines are the source of features for character classification. We will discuss Fourier transform and splines as means of smoothing and approximating character boundaries. An inherent instability due to character damage will be described, and ways to address it.
  • Research and software development updates.
32 Friday, 9-13-2019 ENR2 S375 Dwight Nwaigwe
  • TITLE: Character Matching and some problems with topological classification. Abstract: Ray Smith's overview of Tesseract mentions the use of feature in classification, noting that topological classification is not robust. We briefly go into some examples. Further, we note that his paper discusses some of the mechanics of character matching, which is based on clustering. We compare a clustering method devised for cursive fonts.
31 Friday, 9-6-2019 ENR2 S375 Group
  • Discussion of Tesseract architecture based on Ray Smith's paper.
    • The use of polygonal approximations
    • Maximally chopped characters
    • (x,y,theta) as features
    • Re-assembling chopped characters
  • Character outline calculation with MATLAB.
  • Top-down, bottom-up, adaptive processing.
  • Research updates.
30 Friday, 8-30-2019 ENR2 S375 Group
  • Planning for the semester; papers to read. One paper: Overview of Tesseract OCR Engine by Ray Smith.
  • New MATLAB code updates: a parameter-tweaking GUI.
  • Research updates.
  • MATLAB image datastores - the MATLAB ways to prepare training data.
30 Friday, 8-23-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Research updates (Marek Rychlik): Report on 20% improvement on digits 0-3 of the MNIST dataset. Abstract: Using the new regularization technique in the "Patternnet" paper I was able to reduce the number of errors from 610 to about 490, which is approximately 20%. The regularization shares some features of the LASSO method in statistics.
  • New MATLAB code updates: a parameter-tweaking GUI.
  • Planning for the new semester.
29 Friday, 8-16-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Research updates (Marek Rychlik): Simplifying the "Patternnet" approach; linear programming;
  • New MATLAB code updates: a parameter-tweaking GUI.
  • Planning for the new semester.
28 Friday, 8-09-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Research updates: Connection between the "Patternnet" multi-class logistic regression and Support Vector Machines, and Linear Programming aspects.
  • Tesseract 5 new features (Yan Han).
27 Friday, 8-02-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Update on multi-logistic regression paper (Marek Rychlik).
  • Tesseract 5 new features (Yan Han).
26 Friday, 7-26-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Caching Unicode character images in MATLAB (Marek Rychlik). Abstract: Rendering Traditional Chinese characters from Unicode glyphs is a time-consuming operation, especially in MATLAB. We need the characters to be at least 60 pixels in size, which generates about 4,000 bits per character (one of 60,000+). Therefore, rather than generating characters on the fly, we reuse generated images by caching them. Two caching strategies are implemented: in RAM and in an SQLite database. Packing bits is used as a form of compression for the database implementation. A speedup achieved is 5-10 fold.
25 Friday, 7-19-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • MATLAB MEX interface to Tesseract 4 (Marek Rychlik). Abstract: I wrote C++ code which effectively wraps Tesseract 4 in MATLAB. It entirely bypasses the Vision Toolbox toolkit wrapper, which only supports version 3 of Tesseract. I will explain the implementation and capabilities. I will also demonstrate its application to provide a complete OCR system for Traditional Chinese, using custom page segmentation (class PageScan) and LSTM-based character recognition.
  • Update on Sayyed's work (Sayyed Vazirizade) on Arabic/Persian/Pashto page segmentation.
  • Review of OCR papers listed at the Tesseract site (Marek Rychlik).
  • Commentary on the "Patternnet" paper: the existence question.
  • Forthcoming meeting with Rep. Grijalva's office.
24 Friday, 7-12-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Update on Sayyed's work.
  • New code for Chinese page segmentation (Marek Rychlik).
  • Commentary on the "Patternnet" paper: the existence question.
  • Forthcoming meeting with Rep. Grijalva's office.
23 Friday, 7-05-2019 ENR2 S375 NO MEETING. 4th of July break.
22 Friday, 6-28-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Preparing for the meeting with Rep. Grijalva's office.
  • Update on Sayyed's work.
  • MATLAB techniques for preparing training datasets: structures, cell arrays, .mat files.
  • Unsupervised classification of Chinese characters.
22 Friday, 6-21-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. This meeting will be devoted to:
  • Updates on the multi-class logistic regression research. Kronecker product and Hadamard product.
  • Image processing training.
  • Sayyed's adaptive thresholding code.
  • Unsupervised classification of Chinese characters.
21 Friday, 6-14-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. This meeting will be devoted to:
  • Image processing training.
  • Sayyed's adaptive thresholding code.
  • Unsupervised classification of Chinese characters.
  • Serializaton and using .mat files.
20 Friday, 6-7-2019 ENR2 S395 Group discussion This meeting will be devoted to:
  • Image processing training.
  • Review of current segmentation code for Pashto/Persian. MATLAB 'system' call under Windows.
  • MATLAB code for image processing and page segmentation algorithms.
  • Unsupervised classification of Chinese characters.
  • Serializaton and using .mat files.
19 Friday, 5-31-2019 ENR2 S375 Group discussion This meeting will be devoted to:
  • Image processing training.
  • MATLAB code for image processing.
  • Page segmentation algorithms.
18 Friday, 5-24-2019 ENR2 S395 Group discussion This meeting will be devoted to:
  • Image processing training.
  • MATLAB code for image processing.
  • Page segmentation algorithms.
17 Friday, 5-17-2019 ENR2 S395 Group discussion This first meeting of Summer 2019 will be devoted to:
  • Summer research themes; everyone is invited to talk for a few minutes about their plans and problems;
  • Review the new collection at the Library of Congress of Chinese documents, needing OCR.
16 Friday, 5-11-2019 ENR2 S395 NO MEETING NO MEETING.
15 Friday, 5-03-2019 ENR2 S395 Group discussion Due to Exam Session, the agenda is tentative:
  • Research progress (Marek Rychlik, Dwight Nwaigwe, Aaron Peterson, Ryan Coatney)
  • New document choices (Yan Han)
  • Tesseract training (Dylan Murphy)
14 Friday, 4-26-2019 ENR2 S395 Group discussion We will cover a variety of topics, including:
  • The problem of "loose clouds" in Chinese text (Yan Han, Marek Rychlik)
  • Building Web applications with MATLAB (Marek Rychlik)
  • MATLAB as an application delivery system: in-MATLAB, standalone with free MATLAB Runtime, Web with MATLAB Web application server
  • Ideas and obstacles of the Tier 2 proposal
  • Research and development ideas for the summer
13 Friday, 4-19-2019 ENR2 S395 Group discussion We will cover a variety of topics, including:
  • Chinese character experiments (Yan Han and his students)
  • Building standalone applications with MATLAB (Marek Rychlik)
  • GitHub limitation on file size
  • New installer site on BitBucket
  • Conversation with an NEH Officer regarding Tier 2 proposal
12 Friday, 4-12-2019 ENR2 S395 Marek Rychlik Research talk, and a group discussion of current topics:
  • Marek Rychlik (University of Arizona) TITLE: Character recognition with multi-class logistic regression. ABSTRACT: Multi-class logistic regression network is simple and it has a good training algorithm and quite impressive accuracy on the MNIST dataset. I will discuss my recent paper on this subject https://arxiv.org/abs/1903.12600 and the implementation in our GitHub repository.
11 Friday, 4-5-2019 ENR2 S395 Group discussion, Marek Rychlik, Yan Han We will cover a variety of topics:
  • NEH Grant proposal Tier 2 proposal content.
  • Report on meeting with Clayton Morrison's group (Marek Rychlik).
  • Tier~2 proposal and the ISO standard (Yan Han).
  • Report on a program im2latex (Marek Rychlik); this program can convert images to math equations.
10 Friday, 3-29-2019 ENR2 S395 Group discussion We will discuss various on-going efforts. This includes:
  • The mechanics of training Tesseract.
  • Preparations for Tesseract training runs on multi-core computer(s).
  • NEH Grant proposal writing - Tier 2.
9 Friday, 3-22-2019 ENR2 S395 Group discussion We will discuss various on-going efforts. This includes:
  • New code on GitHub: an implementation of CTC in MATLAB
  • The page segmentation algorithm
  • Training Tesseract
  • NEH Grant proposal writing - Tier 2
8 Friday, 3-15-2019 ENR2 S395 Marek Rychlik
  • Marek Rychlik (University of Arizona) will speak on CTC. ABSTRACT: CTC (Connectionist Temporal Classification) is a probability model for segmentation of outputs of recurrent neural networks into meaningful chunks. The technique is used for handwriting and script language segmentation, and for natural speech. In Deep Learning, CTC is simply a kind of rather sophisticated loss function. I will discuss the paper of Alex Graves (currently at Deep Mind) who introduced this method about 10 years ago to handwriting recognition and OCR.
7 Friday, 3-1-2019 ENR2 S395 Marek Rychlik
  • Marek Rychlik (University of Arizona) will talk about the Tesseract C++ API. Some C++ programming examples will be discussed. One example allows translation of Pashto ligatures to Unicode.
6 Friday, 2-22-2019 ENR2 S395 Dylan Murphy
  • Dylan Murphy (University of Arizona) will discuss the OCR software. Open source packages Kraken and Tesseract will be discussed. The talk will cover use and system architecture of these systems, as well as the process of training for new language recognition.
5 Friday, 2-15-2019 ENR2 S395 Sayyed Vazirizade
  • Sayyed Vazirizade (University of Arizona) will review Persian OCR software.
4 Friday, 2-8-2019 ENR2 S395 Ryan Coatney, Yan Han
  • Ryan Coatney (University of Arizona) will continue talking about a paper by Kobus et. all, applying Gaussian processes to modeling 1-dimensional structures (plants), and potential applications to OCR (est. 25 min).
  • Yan Han (University of Arizona) will talk about APIs for embedding text in PDF (est. 25 min).
3 Friday, 2-1-2019 ENR2 S395 Mike Maizels, Ryan Coatney
  • Mike Maizels (Harvard and University of Arkansas) will discuss an arts-related project involving OCR (15 min).
  • Ryan Coatney (University of Arizona) will talk about a paper by Kobus et. all, applying Gaussian processes to modeling 1-dimensional structures (plants), and potential applications to OCR (30 min).
2 Friday, 1-25-2019 ENR2 S395 Marek Rychlik A method for Chinese OCR using Hough and Fourier transforms. I will explain the algorithm published in our GitHub repository. Also, I will briefly describe the selected papers which we can collectively study. The slides of this talk are available.
1 Friday, 1-18-2019 ENR2 S395 Organizational meeting. Agenda will include:
  • Introductions
  • Description of the NEH grant research
  • Resources for Pashto and Chinese

Zoom recordings

They are available on the restricted page of this website. However, you need to ask the organizers for the credentials to access this page.

The organizers