#
|
Date
|
Room
|
Speaker
|
Title and Abstract
|
40
|
Friday, 12-06-2019
|
ENR2 S375
|
Semester Wrap-Up
|
Group discussion and research updates on the following topics:
-
A review of implemented OCR-related algorithms.
-
Sequence-to-label mapping with LSTM as means of
recognizing isolated characters.
-
Planning for the next semester.
|
40
|
Friday, 11-21-2019
|
ENR2 S375
|
Research Updates
|
Group discussion and research updates on the following topics:
-
A review of best algorithms for finding
distance between damaged characters.
-
Automatic differentiation in MATLAB 2019b.
-
Generating W-language samples with a chaotic
dynamical system (interval map).
-
Learning parameters of chaotic Lorenz
attractors with standard Deep Learning tools.
|
39
|
Friday, 11-15-2019
|
ENR2 S375
|
Research Updates
|
-
Marek Rychlik: New features in the MATLAB Deep
Learning Toolkit. Abstract: The R2019b version of
MATLAB has a number of new, impressive features for
machine learning. I will focus on the automatic
differentiation features.
-
Group discussion of dynamic time warping
for multiple outline cycles (teleportation).
|
38
|
Friday, 11-01-2019
|
ENR2 S375
|
Research Updates
|
-
Marek Rychlik: Dynamic time warping with and without
teleportation. Abstract: Dynamic Time Warping (DTW) may
be used to align time while traversing similar data.
In Ocr we need also teleporation (instant transfer
to another signal space). Rudimentary C++ code was
used for benchmarking (without teleporation yet).
- New videos! Arabic characters
on
Youtube
channel
-
Dwight Nwaigwe will report on current progress on
the Hessian of the multi-class logistic regression
loss.
-
Dylan Murphy will update us on the sweep implementation
of the character outline algorithm, and related
topics.
|
37
|
Friday, 10-25-2019
|
ENR2 S375
|
Research updates
|
-
Marek Rychlik: Update on current performance on
Traditional Chinese, Rotated text, Latin text, font
mapping. Survey of new videos on
the
Youtube
channel
.
|
36
|
Friday, 10-18-2019
|
ENR2 S375
|
Research updates
|
-
NOTE: Marek Rychlik will give a talk at the TRIPODS
seminar on Monday, 10-21-2019.
-
Marek Rychlik will demonstrate full OCR processing
based on the character outline algorithm and
cross-correlation, for Latin alphabets.
|
35
|
Friday, 10-11-2019
|
ENR2 S375
|
Dylan Murphy
|
-
TITLE:
Line-sweeping and other incremental improvements
Abstract:
In the cycle-detection algorithm,
the expensive step is graph-traversal. One way to
improve the speed of this algorithm, as we have
seen, is to use a more efficient representation of
the adjacency matrix in terms of linked lists. I
will present another method, which avoids the
graph-traversal step entirely by connecting cycles
during the edge-detection step using a
linesweep-style algorithm. A naive implementation of
this approach in Python produced speedups similar to
the linked-list approach, reducing the processing
time for a page to several seconds, down from
hundreds of seconds.
-
Marek Rychlik will report on a modified algorithm
capable of processing many Chinese pages in real (MATLAB) time,
thanks to non-uniform sampling of character outlines. If there
is time, I will discuss Dynamic Time Warping (DTW).
|
34
|
Friday, 10-04-2019
|
ENR2 S375
|
Marek Rychlik, Group Discussion
|
-
TITLE:
Multi-page clustering of Traditional
Chinese text
Abstract:
I will report on recent advances of
Chinese text processing. The main advance is
increasing the speed of computing character outlines
by a factor of 100 as compared to the last report.
This allows processing of entire books in acceptable
time, without expensive hardware resources.
-
Research and software development updates.
|
34
|
Friday, 9-27-2019
|
ENR2 S375
|
Raymundo Navarette
|
-
TITLE:
Bias reduction in multi-class logistic
regression and the problem of separation.
Abstract:
We will go over the basics of
Firth's method for bias reduction in maximum
likelihood estimates and apply them to multi-class
logistic regression. We will discuss how this and
other penalization approaches remove the problem of
separation, which occurs when sample sizes are small
or when all classes can be separated with linear
classifiers, and leads to non-existent (infinite)
optimal parameters.
-
Research and software development updates.
|
33
|
Friday, 9-20-2019
|
ENR2 S375
|
Marek Rychlik, Group
|
-
TITLE:
Latin and Chinese character outlines as
means of extracting features
Abstract:
An important idea of OCR present in
Tesseract and research papers is that character
outlines are the source of features for character
classification. We will discuss Fourier transform
and splines as means of smoothing and approximating
character boundaries. An inherent instability due
to character damage will be described, and ways to
address it.
-
Research and software development updates.
|
32
|
Friday, 9-13-2019
|
ENR2 S375
|
Dwight Nwaigwe
|
-
TITLE:
Character Matching and some problems with topological classification.
Abstract:
Ray Smith's overview of Tesseract
mentions the use of feature in classification,
noting that topological classification is not
robust. We briefly go into some examples. Further,
we note that his paper discusses some of the
mechanics of character matching, which is based on
clustering. We compare a clustering method devised
for cursive fonts.
|
31
|
Friday, 9-6-2019
|
ENR2 S375
|
Group
|
-
Discussion of Tesseract architecture based on Ray Smith's paper.
- The use of polygonal approximations
- Maximally chopped characters
- (x,y,theta) as features
- Re-assembling chopped characters
-
Character outline calculation with MATLAB.
-
Top-down, bottom-up, adaptive processing.
-
Research updates.
|
30
|
Friday, 8-30-2019
|
ENR2 S375
|
Group
|
-
Planning for the semester; papers to read. One paper:
Overview of Tesseract OCR Engine
by Ray Smith.
-
New MATLAB code updates: a parameter-tweaking GUI.
-
Research updates.
-
MATLAB image datastores - the MATLAB ways to prepare training
data.
|
30
|
Friday, 8-23-2019
|
ENR2 S375
|
Marek Rychlik, Group
|
Note that for the rest of the summer the rooom is
ENR2 S375
.
Themes to be covered:
-
Research updates (Marek Rychlik): Report on 20% improvement
on digits 0-3 of the MNIST dataset. Abstract: Using the new
regularization technique in the "Patternnet" paper I was able
to reduce the number of errors from 610 to about 490, which is
approximately 20%. The regularization shares some features of the
LASSO method in statistics.
-
New MATLAB code updates: a parameter-tweaking GUI.
-
Planning for the new semester.
|
29
|
Friday, 8-16-2019
|
ENR2 S375
|
Marek Rychlik, Group
|
Note that for the rest of the summer the rooom is
ENR2 S375
.
Themes to be covered:
-
Research updates (Marek Rychlik): Simplifying the "Patternnet"
approach; linear programming;
-
New MATLAB code updates: a parameter-tweaking GUI.
-
Planning for the new semester.
|
28
|
Friday, 8-09-2019
|
ENR2 S375
|
Marek Rychlik, Group
|
Note that for the rest of the summer the rooom is
ENR2 S375
.
Themes to be covered:
-
Research updates: Connection between the "Patternnet"
multi-class logistic regression and Support Vector Machines,
and Linear Programming aspects.
-
Tesseract 5 new features (Yan Han).
|
27
|
Friday, 8-02-2019
|
ENR2 S375
|
Marek Rychlik, Group
|
Note that for the rest of the summer the rooom is
ENR2 S375
.
Themes to be covered:
-
Update on multi-logistic regression paper (Marek Rychlik).
-
Tesseract 5 new features (Yan Han).
|
26
|
Friday, 7-26-2019
|
ENR2 S375
|
Marek Rychlik, Group
|
Note that for the rest of the summer the rooom is
ENR2 S375
.
Themes to be covered:
-
Caching Unicode character images in MATLAB (Marek
Rychlik). Abstract: Rendering Traditional Chinese
characters from Unicode glyphs is a time-consuming
operation, especially in MATLAB. We need the
characters to be at least 60 pixels in size, which
generates about 4,000 bits per character (one of
60,000+). Therefore, rather than generating
characters on the fly, we reuse generated images by
caching them. Two caching strategies are
implemented: in RAM and in an SQLite
database. Packing bits is used as a form of
compression for the database implementation. A
speedup achieved is 5-10 fold.
|
25
|
Friday, 7-19-2019
|
ENR2 S375
|
Marek Rychlik, Group
|
Note that for the rest of the summer the rooom is
ENR2 S375
.
Themes to be covered:
-
MATLAB MEX interface to Tesseract 4 (Marek Rychlik).
Abstract: I wrote C++ code which effectively wraps Tesseract 4
in MATLAB. It entirely bypasses the Vision Toolbox toolkit
wrapper, which only supports version 3 of Tesseract. I will explain
the implementation and capabilities. I will also demonstrate its
application to provide a complete OCR system for Traditional Chinese,
using custom page segmentation (class PageScan) and LSTM-based
character recognition.
-
Update on Sayyed's work (Sayyed Vazirizade) on
Arabic/Persian/Pashto page segmentation.
-
Review of OCR papers listed at the Tesseract site (Marek Rychlik).
-
Commentary on the "Patternnet" paper: the existence question.
-
Forthcoming meeting with Rep. Grijalva's office.
|
24
|
Friday, 7-12-2019
|
ENR2 S375
|
Group discussion
|
Note that for the rest of the summer the rooom is
ENR2 S375
.
Themes to be covered:
-
Update on Sayyed's work.
-
New code for Chinese page segmentation (Marek Rychlik).
-
Commentary on the "Patternnet" paper: the existence question.
-
Forthcoming meeting with Rep. Grijalva's office.
|
23
|
Friday, 7-05-2019
|
ENR2 S375
|
NO MEETING.
|
4th of July break.
|
22
|
Friday, 6-28-2019
|
ENR2 S375
|
Group discussion
|
Note that for the rest of the summer the rooom is
ENR2 S375
.
Themes to be covered:
-
Preparing for the meeting with Rep. Grijalva's office.
-
Update on Sayyed's work.
-
MATLAB techniques for preparing training datasets: structures,
cell arrays, .mat files.
-
Unsupervised classification of Chinese
characters.
|
22
|
Friday, 6-21-2019
|
ENR2 S375
|
Group discussion
|
Note that for the rest of the summer the rooom is
ENR2 S375
.
This meeting will be devoted to:
-
Updates on the multi-class logistic regression research.
Kronecker product and Hadamard product.
-
Image processing training.
-
Sayyed's adaptive thresholding code.
-
Unsupervised classification of Chinese
characters.
|
21
|
Friday, 6-14-2019
|
ENR2 S375
|
Group discussion
|
Note that for the rest of the summer the rooom is
ENR2 S375
.
This meeting will be devoted to:
-
Image processing training.
-
Sayyed's adaptive thresholding code.
-
Unsupervised classification of Chinese
characters.
-
Serializaton and using .mat files.
|
20
|
Friday, 6-7-2019
|
ENR2 S395
|
Group discussion
|
This meeting will be devoted to:
-
Image processing training.
-
Review of current segmentation code for Pashto/Persian.
MATLAB 'system' call under Windows.
-
MATLAB code for image processing and
page segmentation algorithms.
-
Unsupervised classification of Chinese
characters.
-
Serializaton and using .mat files.
|
19
|
Friday, 5-31-2019
|
ENR2 S375
|
Group discussion
|
This meeting will be devoted to:
-
Image processing training.
-
MATLAB code for image processing.
-
Page segmentation algorithms.
|
18
|
Friday, 5-24-2019
|
ENR2 S395
|
Group discussion
|
This meeting will be devoted to:
-
Image processing training.
-
MATLAB code for image processing.
-
Page segmentation algorithms.
|
17
|
Friday, 5-17-2019
|
ENR2 S395
|
Group discussion
|
This first meeting of Summer 2019 will be
devoted to:
-
Summer research themes; everyone
is invited to talk for a few minutes about their plans
and problems;
-
Review
the
new
collection
at the Library of Congress of Chinese
documents, needing OCR.
|
16
|
Friday, 5-11-2019
|
ENR2 S395
|
NO MEETING
|
NO MEETING.
|
15
|
Friday, 5-03-2019
|
ENR2 S395
|
Group discussion
|
Due to Exam Session, the agenda is tentative:
- Research progress
(Marek Rychlik, Dwight Nwaigwe, Aaron Peterson, Ryan Coatney)
- New document choices (Yan Han)
- Tesseract training (Dylan Murphy)
|
14
|
Friday, 4-26-2019
|
ENR2 S395
|
Group discussion
|
We will cover a variety of topics, including:
- The problem of "loose clouds" in Chinese text (Yan Han, Marek Rychlik)
- Building Web applications with MATLAB (Marek Rychlik)
- MATLAB as an application delivery system: in-MATLAB, standalone
with free MATLAB Runtime, Web with MATLAB Web application server
- Ideas and obstacles of the Tier 2 proposal
- Research and development ideas for the summer
|
13
|
Friday, 4-19-2019
|
ENR2 S395
|
Group discussion
|
We will cover a variety of topics, including:
- Chinese character experiments (Yan Han and his students)
- Building standalone applications with MATLAB (Marek Rychlik)
- GitHub limitation on file size
- New installer site on BitBucket
- Conversation with an NEH Officer regarding Tier 2 proposal
|
12
|
Friday, 4-12-2019
|
ENR2 S395
|
Marek Rychlik
|
Research talk, and a group discussion of current topics:
-
Marek Rychlik (University of Arizona)
TITLE:
Character recognition with multi-class
logistic regression.
ABSTRACT:
Multi-class logistic
regression network is simple and it has
a good training algorithm and quite impressive
accuracy on the MNIST dataset. I will discuss my recent
paper on this subject
https://arxiv.org/abs/1903.12600
and the implementation in
our GitHub repository
.
|
11
|
Friday, 4-5-2019
|
ENR2 S395
|
Group discussion, Marek Rychlik, Yan Han
|
We will cover a variety of topics:
- NEH Grant proposal Tier 2 proposal content.
- Report on meeting with Clayton Morrison's group (Marek Rychlik).
- Tier~2 proposal and the ISO standard (Yan Han).
-
Report on a program im2latex (Marek Rychlik); this
program can convert images to math equations.
|
10
|
Friday, 3-29-2019
|
ENR2 S395
|
Group discussion
|
We will discuss various on-going efforts. This includes:
- The mechanics of training Tesseract.
- Preparations for Tesseract training runs on multi-core computer(s).
- NEH Grant proposal writing - Tier 2.
|
9
|
Friday, 3-22-2019
|
ENR2 S395
|
Group discussion
|
We will discuss various on-going efforts. This includes:
- New code on GitHub: an implementation of CTC in MATLAB
- The page segmentation algorithm
- Training Tesseract
- NEH Grant proposal writing - Tier 2
|
8
|
Friday, 3-15-2019
|
ENR2 S395
|
Marek Rychlik
|
-
Marek Rychlik (University of Arizona)
will speak on CTC.
ABSTRACT:
CTC (Connectionist Temporal Classification)
is a probability model for segmentation of outputs
of recurrent neural networks into meaningful
chunks. The technique is used for handwriting and
script language segmentation, and for natural
speech. In Deep Learning, CTC is simply a kind of
rather sophisticated loss function. I will discuss
the paper of Alex Graves (currently at Deep Mind)
who introduced this method about 10 years ago to
handwriting recognition and OCR.
|
7
|
Friday, 3-1-2019
|
ENR2 S395
|
Marek Rychlik
|
-
Marek Rychlik (University of Arizona)
will
talk about the Tesseract C++ API. Some C++ programming
examples will be discussed. One example allows translation
of Pashto ligatures to Unicode.
|
6
|
Friday, 2-22-2019
|
ENR2 S395
|
Dylan Murphy
|
-
Dylan Murphy (University of Arizona)
will
discuss the OCR software. Open source packages
Kraken and Tesseract will be discussed. The talk
will cover use and system architecture of these
systems, as well as the process of training for new
language recognition.
|
5
|
Friday, 2-15-2019
|
ENR2 S395
|
Sayyed Vazirizade
|
-
Sayyed Vazirizade (University of Arizona)
will review Persian OCR software.
|
4
|
Friday, 2-8-2019
|
ENR2 S395
|
Ryan Coatney, Yan Han
|
-
Ryan Coatney (University of Arizona)
will continue talking about a paper by Kobus et. all, applying Gaussian
processes to modeling 1-dimensional structures (plants), and potential applications to OCR (est.
25 min
).
-
Yan Han (University of Arizona)
will talk about APIs for embedding text in PDF (est.
25 min
).
|
3
|
Friday, 2-1-2019
|
ENR2 S395
|
Mike Maizels, Ryan Coatney
|
-
Mike Maizels (Harvard and University of Arkansas)
will discuss an arts-related project involving OCR (
15 min
).
-
Ryan Coatney (University of Arizona)
will talk about a paper by Kobus et. all, applying Gaussian
processes to modeling 1-dimensional structures (plants), and potential applications to OCR (
30 min
).
|
2
|
Friday, 1-25-2019
|
ENR2 S395
|
Marek Rychlik
|
A method for Chinese OCR using Hough and Fourier
transforms.
I will explain the algorithm published
in our
GitHub repository
.
Also, I will briefly describe the
selected papers
which we can
collectively study. The
slides of this talk
are available.
|
1
|
Friday, 1-18-2019
|
ENR2 S395
|
|
Organizational meeting. Agenda will include:
- Introductions
- Description of the NEH grant research
- Resources for Pashto and Chinese
|