Multi-lingual Optical Character Recognition Seminar

Spring 2023 Schedule of Meetings and Talks (semester theme: the language of medical forms)

Usual meeting time: Monday, 10:00am - 11am, Room TBD. Meetings will be dual mode (live and on Zoom). Use this link to connect.
# Date Room Speaker or Topic Agenda or Title and Abstract
164 Monday, 01-30-2023 Zoom Only, Zoom Group discussion Agenda items:
    • Soteria updates (Marek Rychlik)
      • MATLAB, other software installed
      • UNOS data accessible
      • Test run: extraction of formatted text from pure PDF completed
    • Information on new projects related to liver (Marek Rychlik, Yan Han)
      • Collaboration with Ali Bilgin
      • Data request made
      • Main component is processing CT scans
    • Talks
      • Forthcoming talk on parallel processing with MATLAB - 1/20, RTG seminar (Marek)
      • Past talk (1/25) Wednesday at environmental health (marek, bekir)
      • Forthcoming talk at Northeastern University planned in May (Marek, Bekir?)
    • Gradudate student project updates.
164 Monday, 01-23-2023 Zoom Only, Zoom Group discussion Agenda items:
    • Updates
      • Soteria and forthcoming meeting with staff
      • Adding liver transplant data, CT scans, meeting with UNOS
      • Accessing UNOS data
      • Talk at the Environmental Health Sciences on 1/25 (Drachman A118, Co-speakers: Marek Rychlik, Bekir Tanriover)
      • Talk to IARPA (Marek Rychlik)
    • Gradudate student project updates.
    • Review of summer research (Marek Rychlik)
      • Motivation 1: methods relevant to off-line handwritng recognition
      • Motivation 2 (original): robust algorithms for identifying table boundaries
      • Graph theory methods, MATLAB functions: bwdist, bwdistgeodesic, minspantree
163 Monday, 01-16-2023 Zoom Only, Zoom Group discussion NOTE: This organizational meeting is Zoom-only due to a school holiday. Agenda items:
  • Updates
    • Soteria is on-line
    • UNOS data are available (2015-2022)
  • Gradudate student project updates.

Fall 2022 Schedule of Meetings and Talks (semester theme: the language of medical forms)

Usual meeting time: Monday, 12:00pm - 12:50pm, Room ENR2 S375. Meetings will be dual mode (live and on Zoom). Use this link to connect.
# Date Room Speaker or Topic Agenda or Title and Abstract
162 Monday, 12-05-2022 ENR2 S375, Zoom Group discussion Agenda items:
  • Semester wrap-up; what's next?
    • Soteria is still being set-up
    • A special topics course is running, course description in D2L
    • Grant decision in February
    • Other funding
  • Gradudate student project updates.
  • Update on the PDF parser: decoding of compressed streams implemented (Marek Rychlik).
161 Monday, 11-28-2022 ENR2 S375, Zoom Marek Rychlik Agenda items:
  • Speaker: Marek Rychlik
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Building a true PDF parser (Part 2)
  • Abstract: The parser discussed in the previous talk has now the ability to parse sizable PDF files, such as the medical forms. It is capable of constructing an AST (Abstract Syntax Tree) of a PDF file and reconstructing the PDF file from it (a powerful test of correctness!). I will discuss or review:
    1. The PDF structure based on COS (Carousel Object System, ~1993).
    2. Indirect objects and the 'xref' (Cross-Reference Table)
    3. Streams, compression and resulting challenges in lexical analysis
    4. Next steps: interpreting graphics operations
    5. Tools, such as PDFExplorer (Windows only), pdfreader ("a Pythonic API to PDF documents") and RUPS (built on the iText Java library).
  • Gradudate student project updates.
160 Monday, 11-21-2022 ENR2 S375, Zoom Marek Rychlik Agenda items:
  • Speaker: Marek Rychlik
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Building a true PDF parser
  • Abstract: A new software tool will be discussed and demonstrated: a PDF parser utilizing compiler development tools bison and flex. In previous talks we applied these tools to build a parser for the text extracted from PDF. In this talk the same tools will be applied to parse the PDF language. I will talk about PDF as a programming language, including parsing problems it creates, and how they are solved.
  • Gradudate student project updates.
159 Monday, 11-14-2022 ENR2 S375, Zoom Group discussion Agenda items:
  • Updates on Soteria accounts and overview (Marek Rychlik, Yan Han).
  • The first impression of the data (Marek Rychlik, Yan Han)
  • Gradudate student project updates.
  • PDF standard discussion and updates.
158 Monday, 11-07-2022 ENR2 S375, Zoom Marek Rychlik Agenda items:
  • Speaker: Marek Rychlik
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Parsing medical forms with compiler development tools (continuation)
  • Abstract: In the first part of this talk we focused on making a lexical analyzer using flex. In this talk, we will discuss building a parser using bison (the successor of yacc). We will discuss the grammar rules and the attached semantic actions. We will discuss the semantics. Our parser builds a tree which is called an abstract syntax tree, or AST. We will discuss the data structures stored in the nodes of AST. A real document (the DCD form) will be parsed.
157 Monday, 10-31-2022 ENR2 S375, Zoom Marek Rychlik Agenda items:
  • Speaker: Marek Rychlik
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Parsing medical forms with compiler development tools
  • Abstract: An important medical form in kidney transplant is the DCD flowsheet form, including tabular content in the form of a time series. I will explain how to retrieve the data from these kind of forms, when they are "pure PDF" (no OCR necessary). After extracting text from PDF while preserving layout, the problem becomes that of parsing text (NLP). As the text is well structured, we construct its grammar and build a lexical analyzer, using tools normally used by compiler developers: bison and flex. I will discuss the first implementation of this software.
156 Monday, 10-24-2022 ENR2 S375, Zoom Group discussion Agenda items:
  • Update on data, Soteria and funding (a brief summary)
  • Pending projects updates, Q&A (grad students)
  • An intro to regular grammars and regular expressions (Marek Rychlik)
    • Basics of formal language theory
    • What is a regular grammar?
    • What is a regular expression?
    • An example: parsing numerals
155 Monday, 10-17-2022 ENR2 S375, Zoom Group discussion Agenda items:
  • The arrival of the data, priorities, Soteria (a brief summary)
  • Pending projects updates, Q&A (grad students)
  • Programming in MATLAB (Marek Rychlik)
    • RANSAC implementationn
    • Programmable debugger
154 Monday, 10-10-2022 ENR2 S375, Zoom Group discussion Agenda items:
  • Pending projects updates, Q&A (grad students)
  • Programming in MATLAB (Marek Rychlik)
    • MATLAB classes, by-value semantics, handles
    • RANSAC implementationn
    • Programmable debugger
153 Monday, 10-03-2022 ENR2 S375, Zoom Marek Rychlik Agenda items:
  • Speaker: Marek Rychlik
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Using MATLAB effectively, by example.
  • Abstract: Although MATLAB is commonly used by students of Mathematics and the instructors, both are mostly self-taught and are non-programmers. The kidney project is aiming to use MATLAB as a development platform (very much like thousands of big companies are). In this talk I will focus on how to bring one's MATLAB programming skill to the next level. I will cover some of these examples (in this and planned follow up talks):
    • Notebooks (.mlx) vs. scripts vs. M-files (.m)
    • MATLAB functions and documentation standard
    • MATLAB classes
    • Deep Learning and layers as classes
    • Programmable debugger
    • Unit testing
152 Monday, 9-25-2022 ENR2 S375, Zoom Duncan Bennett Agenda items:
  • Speaker: Duncan Bennett
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Variational Autoencoders
  • Abstract: Generative models have many uses in machine learning, one of them being data augmentation/regularization. The idea being, if we can approximate the distribution of our data in a way that can be sampled then we can generate more "synthetic" data to be used in a discriminative task. A variational autoencoder (VAE) is one such model. A VAE learns a joint distribution between an observed variable X (usually with complicated distribution) and a latent variable Z (with simpler distribution) where Z is thought of as a "causal" variable of X. For example, if X are handwritten digits then Z may be; handedness, pen thickness, pen pressure etc. The VAE achieves this by learning an encoder (distribution of Z|X) and a decoder (distribution of X|Z) giving a way to translate between the two variables. In the end, a sample from Z will allow us a sample from X|Z. In this talk, we give a brief introduction to VAEs in the context of the kidney transplant project.
151 Monday, 9-19-2022 ENR2 S375, Zoom Group discussion Agenda items:
  • Grant proposal matters (Marek Rychlik, Yan Han)
  • Student progress reports and troubleshooting obstacles.
  • Latent variable models, ELBO and Variational Autoencoders (Marek Rychlik, Duncan Bennett) Abstract: This will be an introductory talk. We will introduce the basic theory of latent variable models and define the Variational Autoencoder Architecture.
150 Monday, 9-12-2022 ENR2 S375, Zoom Group discussion Agenda items:
  • Grant proposal discussion (Yan Han, Marek Rychlik)
  • Student progress reports and troubleshooting obstacles.
  • Decoding numerals in form context (Marek Rychlik) Abstract: I will outline the strategy for decoding numerals in the hand-filled forms. This project has some urgency (deadline: 2 weaks), as a preliminary report would be helpful in our funding effort.
149 Monday, 8-29-2022 ENR2 S375, Zoom Group discussion Agenda items:
  • Welcome and introductions.
  • A survey of the problem of decoding forms and PDF documents.
  • Information for newcomers:
    • Git repositories: WorldlyOcr, GitHub (worldly-ocr), kidney
    • The private part of the Ocr website (this site!)
    • Overleaf paper
    • MATLAB On-line as a communication tool
    • Java and how to integrate with MATLAB; reason for Java: best PDF resources are in Java, including PDFbox
    • Maybe Python; can be called from MATLAB.
  • Discussion of the methods, questions, project suggestions for graduate students.

Spring 2022 Schedule of Meetings and Talks (semester theme: NLP)

Usual meeting time: Friday, 11:00am - 11:50pm. Until February, the meetings will be exclusively on line by Zoom. Use this link to connect.
# Date Room Speaker or Topic Agenda or Title and Abstract
148 Friday, 8-19-2022 Zoom Group discussion Agenda items:
  • Plans for the next semester. Introductions.
  • Decoding forms and other complex layouts with 100% accuracy (Marek Rychlik, 10min). Abstract: I will show how I did it for seveal form examples. The key is to use context-sensitive dictionaries and a type system. With these hints, OCR errors are corrected successfully.
  • Possible elimination of legacy OCR? A deep learning "gold standard" dataset for medical forms (Marek Rychlik). Abstract: Due to high error rate on complex layout and limited use of language context, OCR is used only to parse the results of layout analysis or image segmentation. A byproduct of this approach is a labeled dataset which can be used in a project attempting to construct one's "better OCR".
  • Report REU Workshop results (Marek Rychlik). Abstract: During the summer I supervised research of 3 undergraduates who explored a number of approaches to page segmentation and other aspects of form decoding, resulting in a paper. I will highlight some results.
148 Friday, 8-12-2022 Zoom Group discussion Agenda items:
  • The "preproposal" discussion (Marek Rychlik, Yan Han).
  • The 8/9 UNOS meeting report and collaboration opportunities (Yan Han, Marek Rychlik).
  • Most accurate strategies for form decoding from images (Marek Rychlik). Abstract: I will show how to decode two forms with 100\% accuracy despite numerous OCR errors, by applying the idea of maximum likelihood decoding.
  • Report REU Workshop results (Marek Rychlik). Abstract: During the summer I supervised research of 3 undergraduates who explored a number of approaches to page segmentation and other aspects of form decoding, resulting in a paper. I will highlight some results.
147 Friday, 8-05-2022 Zoom Group discussion Agenda items:
  • The "preproposal" discussion.
  • A review of MATLAB NLP tools and their application to form parsing (Marek Rychlik). A potential list of tools of interest:
    • Document tokenization (splitting into words) and token classification (part of speech, e.g. "numeral", type, e.g. "letters" "digits", language, sentence information, e.g. "punctuation", "stop" words, etc.)
    • KNN search; two levels: string and token. Edit distance between tokenized documents measures distance in words, while string-level distance uses individual characters.
    • Latent Dirichlet Allocation language model (LDA), and why we may need it; review of the main idea: classifying sentences of a tokenized document by a latent (hidden) variable called "topic"; main variables: position of a token in a sentence and the topic number to which they belong; words in a topic should be "tied together", that is, if a word is in a topic, all other words in the topic should have equal probability of occurring in a sentence.
  • The "Tagged PDF" format revisited (Yan Han, Marek Rychlik)
146 Friday, 7-29-2022 Zoom Group discussion Agenda items:
  • Review of typed form parsing results and JSON output (Marek Rychlik)
    • Custom OCR model more effective then general "English" model.
    • OCR and Areas of OCR confusion: digits '1', '0' vs. letters 'l', 'O'; letter 'w' vs 'vv'.
    • NLP tools: dictionary of 'known words'; a 'phrasebook' of phrases used as keys; N-gram substitution; editDistance and knnsearch; parsing with regular expressions.
    • Clustering tools: dbscan
    • Custom techniques for decoding key/value type forms with complex layout.
  • The "Tagged PDF" format (Yan Han, Marek Rychlik)
    • Allows access to the structure and text of a PDF file, except the filled out parts
    • Oiriginal purpose - accessibility (?)
    • Open source and commercial tools for extraction
145 Friday, 7-22-2022 Zoom Group discussion Agenda items:
  • News
    • UNOS 2nd meeting delayed until early August but still on (Yan Han, Marek Rychlik)
    • The Soteria system to come on line - U of A way of handling privacy in an HPC environment
  • New code summary and challenges (Marek Rychlik, Yan Han)
    • Back-to-back processing of typed forms - 2000 lines of MATLAB code base
    • NLP topic - handling matching small vocabularies with knnsearch and rangesearch algorithms on strings utilizing edit distance
    • The PDF encryption challenge and why OCR is important
    • Clustering approaches based on dbscan
144 Friday, 7-15-2022 Zoom Group discussion Agenda items:
  • UNOS 2nd meeting preparations (Yan Han, Marek Rychlik)
  • New data (Marek Rychlik)
  • Report on REU Workshop projects (Marek Rychlik)
    • Detection defective tables in grayscale images using Gabor Wavelets
  • New code summary and challenges (Marek Rychlik)
    • RANSAC-based table detection (multi-model RANSAC)
    • The performance challenge of RANSAC and improvement strategy ideas
    • Extension of RANSAC to postprocess the models and accuracy improvement
    • Clustering approaches, Hough transform.
143 Friday, 7-08-2022 Zoom Group discussion Agenda items:
  • Various
142 Friday, 7-01-2022 Zoom Group discussion Agenda items:
  • UNOS meeting summary and next steps(Yan Han, Marek Rychlik)
  • Report on REU Workshop projects (Marek Rychlik)
    • Trimming spurs
    • Detection defective tables in grayscale images
  • New code summary and challenges (Marek Rychlik)
    • Table detection strategies
    • Stroke width and detection of bold face text
    • The underlined heading problem
141 Friday, 6-24-2022 Zoom Group discussion Agenda items:
  • Progressm in medical form partsing (Marek Rychlik). Abstract: I will discuss some recent strategies and results in parsing forms, mostly focused on table traversal and page segmentation.
  • Discussion of goals for the meeting with UNOS researchers.
  • Discussion of OHDSI ("Odyssey"), agenda for software explorations at the OHDSI site (Yan Han, Marek Rychlik).
140 Friday, 6-17-2022 Zoom Group discussion Agenda items:
  • Decoding hand-filled forms via image registration (Marek Rychlik). Abstract: Image registration is a known technique of "stitching together" images of the same scene from different viewpoints, e.g. aearial imagesc of the ground. It turns out that
  • Topic for group discussion (take 2): Pros and cons of manual, semi-automatic and fully automatic generation of JSON from forms.
139 Friday, 6-10-2022 Zoom Group discussion Agenda items:
  • Updates on parsing tables in images (Marek Rychlik). Abstract: Machine vision and graph theory have beenn applied to automated the process of discovery of table structure. The algorithm was developed using a blank form PDF, but was very successful parsing a "noisy" table with numerous artifacts, including document skew, variation in background intensity and others.
  • Topic for group discussion: Pros and cons of manual, semi-automatic and fully automatic generation of JSON from forms.
138 Friday, 6-03-2022 Zoom Group discussion Agenda items:
  • New data - blank transplant-related forms in MS Word and PDF (Yan Han, Marek Rychlik).
  • Storing objects (of any object-oriented language) in a document-oriented storage with MongoDB (Marek Rychlik).
  • What is a JSON schema - why have one and how to design one? (Yan Han)
  • Extraction of images from PDF with PDFBox, Java and MATLAB (Marek Rychlik). Calling Java from MATLAB (and vice versa) will be discussed.
137 Friday, 5-27-2022 Zoom Group discussion Agenda items:
  • Updates on the medical records project (Yan Han)
  • A short review of document specific Tesseract language model (Marek Rychlik)
  • Basics of the Mongo database and processing with MATLAB (Marek Rychlik). Abstract: Mongo DB is the database that is likely to be used with the data of the project. It differs from older databases in that it uses JSON rather than SQL. But a MATLAB user can think of it as an 'object storage', as I will demonstrate.
  • The blank form sample (Marek Rychlik). We received a blank form which can be used for image processing / OCR practice as it is not bound by privacy rules.
136 Friday, 5-20-2022 Zoom Marek Rychlik, Group discussion Agenda items:
  • Speaker: Marek Rychlik
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Decoding tables and forms with OCR and NLP techniques using MATLAB
  • Abstract: Tables in scientific papers, human-fillable forms and other documents are an important way to present and collect data. Traditionally, when tables and forms are captured as an image, it has been very difficult to recover the information accurately by machine. In this talk I will make the case that this is possible to do on a mass "big data" scale using a combination of techniques from OCR, Machine Vision and Learning, and Natural Language Processing tools. MATLAB has a large colleciton of those tools. A workflow based on MATLAB will be presented as an example involving a form containing medical information.
  • Discussion of problems, projects and challenges will follow
135 Friday, 5-13-2022 Zoom Group Discussion Agenda items:
  • Decoding forms and tables
  • Applications to healthcare
134 Friday, 5-6-2022 Zoom Group Discussion Agenda items:
  • Semester wrap up
  • Potential OCR projects
  • Broader Machine Learning agenda
133 Friday, 4-29-2022 Zoom Duncan Bennett Agenda items:
  • Speaker: Duncan Bennett
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Modern Hofield Networks (Definition and Convergence Properties)
  • Abstract Hochreiter et al have taken ideas of Krotov and Hopfield (2016, 2018) and Demircigil et al (2017) to define a network with exponential energy that takes continuous states. In this talk we discuss the definition, motivation and convergence properties of this modern Hopfield network. We plan to continue with a discussion of implementation into deep networks.
132 Friday, 4-22-2022 Zoom Marek Rychlik Agenda items:
  • Speaker: Marek Rychlik
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Hopfield network performance on the task of optical character recognition. (Continuation)
131 Friday, 4-15-2022 Zoom Marek Rychlik Agenda items:
  • Speaker: Marek Rychlik
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Hopfield network performance on the task of optical character recognition.
  • Abstract In this talk I will discuss my recent experiments in storing and retrieving characters with Hopfield neural network. The methodology will be described, which is closer to the traditional, Ising-model motivated approach. Two datasets were used
    1. Approximately 1,000 grayscale, 40x36 images of English characters, extracted from microfilm.
    2. A MATLAB dataset "Digigs", which is a synthetic dataset consisting of 5,000 digits (28x28) featuring substantial rotation and other irregularities.
    The experiment with 5k digits shows that the Hopfield network can memorize all 5k digits with recall accuracy of approximately 7%, which is much higher than predicted by the theoretical results with random patterns. I will also present some conjectures on how one could improve network performance by automatically deleting some digits from the training dataset.
129 Friday, 4-8-2022 Zoom Duncan Bennett Agenda items:
  • Speaker: Duncan Bennett
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Unsupervised prototype learning in an associative-memory network
  • Abstract Inspired by Hopfield and Krotov, we talk about some more results involving Hopfield networks and the prototype regime. We show that a the weights of a Hopfield network form clusters when trained to a set number of patterns. The clusters of the weights coincide with the clusters of the data. Furthermore, the centroids and concepts of these clusters form prototypes for each cluster. The concepts, defined as an eigenvector of an outer product of certain weights, empirically shows to better reconstruct generators of clusters.
131 Friday, 4-1-2022 Zoom Duncan Bennett Agenda items:
  • Speaker: Duncan Bennett
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Dense Associative Memory and MNIST
  • Abstract In this talk we will discuss some results by Hopfield and Krotov. They show a duality between dense associative memory and neural networks such that the choice of Hamiltonian on the dense associative memory side corresponds to the choice of activation function in a neural network. Furthermore, the degree of the Hamiltonian has an impact on how the network learns (feature extraction or protoype regime). These results are illustrated through an application to the MNIST data set.
128 Friday, 3-25-2022 Zoom Duncan Bennett Agenda items:
  • Speaker: Duncan Bennett
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Capcity of Hopfield Networks and Modern Hopfield Networks
  • Abstract A major issue with tradition Hopfield networks is that the number of memories one can store grows slowly as the number of neurons are increased. For random memories, we can only store about 0.14N memories reliably, or even less depending on our sense of "reliable". Modern Hopfield networks eliminate this problem by choice of Hamiltonian. A higher degree Hamiltonian will have sharper minima as the stored memories, increasing stability and capcity. We talk about a few of these facts in the discrete activation case.
127 Friday, 3-18-2022 Zoom Duncan Bennett/Marek Rychlik Agenda items:
  • Speaker: Marek Rychlik/Duncan Bennett
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: An Overview of Hopfield Networks
  • Abstract A Hopfield network is an artificial neural network which has roots in statistical mechanics. It is also known as an Ising model, a mathematical model of ferromagnetism, but it has also found applications in machine learning and optimization. In machine learning, we may wish to store images and recover them after perturbation. To do this, we construct a Hamiltonian that has a stable minimum at the image. In optimization, we can solve the travelling salesman problem by constructing a Hamiltonian that is minimized at the shortest path through all cities. In this talk we give a brief overview of Hopfield networks and their applications.
126 Friday, 3-4-2022 Zoom Marek Rychlik Agenda items:
  • Speaker: Marek Rychlik
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: PROgrammig in LOGic (PROLOG): Implementing the paradigm "proof=program"
  • Abstract The programming language PROLOG is different from other programming languages because it equates a proof of a theorem with a programmatic solution of a problem. Another feature of PROLOG is the ability to use the "declarative" programming style: I declare the properties of the solution and PROLOG finds it. In this talk I will explain how PROLOG relates to first order logic (the language of proofs) to programming. Interestingly, the performance of PROLOG is measured in the number of logical inferences per second. I will give examples and explain the role of unification, which is a form of pattern matching used by PROLOG, and its role in problem solving.
125 Friday, 2-25-2022 Zoom Marek Rychlik, Duncan Bennett Agenda items:
  • Speaker: Marek Rychlik/Duncan Bennett
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Word Senses and Word Sense Disambiguation
  • Abstract One major difficulty of natural language is that words are ambiguous. The same word can be used to mean different things. For example, a mouse can refer to a rodent or a device to control your cursor and usually we can use context to figure out which word sense is being used in a sentence. This process of word sense disambiguation (WSD) is useful when dealing with semantic tasks. We will discuss techniques to define word sense and machine learning methods for WSD.
124 Friday, 2-18-2022 Zoom Marek Rychlik, Duncan Bennett Agenda items:
  • Speaker: Marek Rychlik/Duncan Bennett
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Information Extraction and Machine Learning
  • Abstract Information extraction (IE) is the process of taking unstructured information embedded in texts into structured data. For example, we may take text from a work email and turn it into a table containing a meeting time, location, speaker and topic. Many approaches to this task involve the use of deep neural networks (BERT) that are pre-trained with an added layer on top which is finetuned for the particular task. We will discuss some IE tasks such as relation extraction and event extraction and give an overview of the BERT architecture.
123 Friday, 2-11-2022 Zoom Marek Rychlik, Duncan Bennett Agenda items:
  • Speaker: Marek Rychlik/Duncan Bennett
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: Logical Representations of Sentence Meaning
  • Abstract An important part of natural language is the link between the linguistic elements and non-linguistic real world knowledge. For example, if you read a restaurant menu then you have some real world knowledge to help you choose a dish. In this talk, we describe how to formalize the meaning of a sentence in a computationally tractable way. If time allows, we will discuss semantic augmentations to context-free grammars. Marek Rychlik will also give some examples in Prolog.
122 Friday, 2-04-2022 Zoom Marek Rychlik Agenda items:
  • Speaker: Marek Rychlik/Duncan Bennett
  • Affiliation: University of Arizona, Department of Mathematics
  • Title: The Cocke-Kasami-Younger (CKY) Algorithm for Syntactic Parsing
  • Abstract Syntactic parsing is the task of assigning syntactic structure to a sentence. An immediate application of parsing is grammar checking but it is also a useful intermediate stage to semantic analysis such as question answering. We discuss the Cocke-Kasami-Younger (CKY) algorithm for syntactic parsing and some of the ambiguity issues in parsing. (Duncan Bennett)
121 Friday, 1-28-2022 Zoom Group Discussion Agenda items:
  • A brief overview of context-free grammars, grammar equivalence, and Chomsky normal form with some examples in Prolog. (Duncan Bennett)
  • If time permits we may discuss CKY parsing and converting CFGs to CNF.
120 Friday, 8-21-2022 Zoom Organizational meeting. Agenda items:
  • Identifying research topics and speakers for the initial NLP talks.

Summer 2021 Schedule of Meetings and Talks

Usual meeting time: Friday, 11:00am - 11:50pm. Still, the meetings will be exclusively on line by Zoom.
# Date Room Speaker or Topic Agenda or Title and Abstract
109 Friday, 8-13-2021 Zoom Group Discussion Agenda items:
  • Semantic segmentation, medical imaging experiment and potential use in OCR (Marek Rychlik).
  • New GitHub repository activity - matlab-CTC
  • Request for the segmentation app to process Pashto content.
  • Research updates.
109 Friday, 8-06-2021 Zoom Group Discussion Agenda items:
  • LineBreaker App for page segmentation - software release, issues (Marek Rychlik)
  • Research updates.
109 Friday, 7-23-2021 Zoom Group Discussion Agenda items:
  • LineBreaker App for page segmentation - first software release report (Marek Rychlik)
  • Research updates.
109 Friday, 7-16-2021 Zoom Group Discussion Agenda items:
  • Semantic segmentation, medical imaging experiment and potential use in OCR (Marek Rychlik).
  • New GitHub repository activity - matlab-CTC
  • Request for the segmentation app to process Pashto content.
  • Research updates.
108 Friday, 7-09-2021 Zoom Group Discussion Agenda items:
  • New dissertation report, job interviewing (Dwight Nwaigwe).
  • Research updates.
107 Friday, 6-25-2021 Zoom Group Discussion Agenda items:
  • New dissertation report (Dwight Nwaigwe).
  • Research updates.
106 Friday, 6-18-2021 Zoom Group Discussion Agenda items:
  • Report on Calamari OCR (Dylan Murphy).
  • Research updates.
105 Friday, 6-04-2021 Zoom Group Discussion Agenda items:
  • NEH Tier 2 application status (post-submission).
  • Research updates.
104 Friday, 5-28-2021 Zoom Group Discussion Agenda items:
  • NEH Tier 2 application status (post-submission).
  • Planned speech research.
  • Planning outreach.
  • Research updates.

Spring 2021 Schedule of Meetings and Talks

Usual meeting time: Friday, 11:00am - 11:50pm. The meetings will be exclusively on line by Zoom. Use this link to connect.
# Date Room Speaker or Topic Agenda or Title and Abstract
103 Friday, 5-21-2021 Zoom Group Discussion Agenda items:
  • NEH Tier 2 application status (post-submission).
  • Planned speech research.
  • Research updates.
102 Friday, 5-14-2021 Zoom Group Discussion Agenda items:
  • NEH Tier 2 application status
    • Current state of the application
    • Task list updates
  • Planned speech research.
  • Research updates.
101 Friday, 5-07-2021 Zoom Group Discussion Agenda items:
  • NEH Tier 2 application status
    • Current state of the application
    • Task list updates
  • Research updates.
100 Friday, 4-30-2021 Zoom Group Discussion Agenda items:
  • NEH Tier 2 application status.
  • Research updates.
99 Friday, 4-23-2021 Zoom Group Discussion Agenda items:
  • NEH Tier 2 application status.
  • Research updates.
98 Friday, 4-16-2021 Zoom Group Discussion Agenda items:
  • NEH Tier 2 application status.
  • Research updates.
97 Friday, 4-09-2021 Zoom Group Discussion Agenda items:
  • Research updates.
96 Friday, 3-19-2021 Zoom Group Discussion Agenda items:
  • NEH support matters: report, next application.
  • Research updates.
95 Friday, 3-12-2021 Zoom Group Discussion Agenda items:
  • Research updates.
94 Friday, 2-26-2021 Zoom Group Discussion Agenda items:
  • Updates on the Borderlands proposal (Yan Han, Marek Rychlik).
  • Gaussian process modeling (Dylan Murphy).
  • A storage patent to be awarded. The end of a 3 year journey. (Marek Rychlik)
  • Research updates.
93 Friday, 2-19-2021 Zoom Group Discussion Agenda items:
  • Updates on the Borderlands proposal (Yan Han, Marek Rychlik).
  • Updates on Navajo speech samples (Marek Rychlik, Yan Han).
  • Gaussian process modeling (Dylan Murphy).
  • Research updates.
92 Friday, 2-12-2021 Zoom Group Discussion Agenda items:
  • Developments in Navajo language opportunity: past meeting with Prof. Aresta Tsosie-Paddock, forthcoming meeting with Special Collections, Overleaf of the Borderlands proposal (Yan Han, Marek Rychlik).
  • Navajo speech samples, Unicode and encoding issues (Marek Rychlik, Yan Han).
  • Gaussian process modeling ideas (Dylan Murphy, Marek Rychlik).
  • Research updates.
91 Friday, 2-5-2021 Zoom Group Discussion Agenda items:
  • Navajo language opportunity (Yan Han, Marek Rychlik).
  • Results on Multi-class logistic regression (Dwight Nwaigwe, Marek Rychlik).
  • Gaussian process modeling ideas (Dylan Murphy, Marek Rychlik).
  • Research updates.
90 Friday, 1-29-2021 Zoom Group Discussion Agenda items:
  • A new funding opportunity, the Navajo language possibilities (Yan Han, Marek Rychlik).
  • Forthcoming papers (Dwight Nwaigwe, Marek Rychlik).
  • Research updates.
89 Friday, 1-22-2021 Zoom Group Discussion Agenda items:
  • A new funding opportunity (report on workshop, Marek Rychlik).
  • Forthcoming papers (Dwight Nwaigwe, Marek Rychlik).
  • Research updates.
88 Friday, 1-15-2021 Zoom Group Discussion Agenda items:
  • A new funding opportunity (Yan Han, Marek Rychlik).
  • Research updates.
87 Friday, 1-8-2021 Zoom Dwight Nwaigwe and Group Discussion Agenda items:
  • Report on OCR with attention-based approach (Dwight Nwaigwe).
  • Report on MATLAB implementation of CTC with logarithmic scaling of probabilities (Marek Rychlik).
  • A new funding opportunity (Yan Han).
  • Research updates.

Fall 2020 Schedule of Meetings and Talks

Usual meeting time: Friday, 11:00am - 11:50pm. The meetings will be exclusively on line by Zoom. Use this link to connect.
# Date Room Speaker or Topic Agenda or Title and Abstract
86 Friday, 12-18-2020 Zoom Group Discussion Agenda items:
  • Feedback from DEPSCorp application (Marek Rychlik).
  • Progress reports.
  • Research updates.
85 Friday, 12-11-2020 Zoom Group Discussion Agenda items:
  • Progress reports.
  • Research updates.
84 Friday, 12-04-2020 Zoom Group Discussion Agenda items:
  • Progress reports.
  • Research updates.
83 Friday, 11-20-2020 Zoom Group Discussion Agenda items:
  • Funding updates.
  • Research updates.
82 Friday, 11-13-2020 Zoom Group Discussion Agenda items:
  • CTC implementation in MATLAB (Dwight Nwaigwe, Marek Rychlik). CTC with logits can be implemented without patching MATLAB installation. Some built-in classes are overridden. This approach may work with other projects.
  • Research updates.
81 Friday, 11-06-2020 Zoom Elsayed Issa
  • Speaker: Elsayed Issa
  • Affiliation: University of Arizona, School of Middle Eastern and North African Studies
  • Title: Machine-Extracted Text Summaries for Arabic L2 Learning
  • Abstract: Text summarization is the process of creating a concise and coherent summary of a longer text while preserving the meaning and the important information in the text. Automatic summaries reduce reading time, improve the effectiveness of indexing, and help in question-answering systems. In this talk, we will discuss a line of research on automatic text summarization for L2 microlearning where summaries serve as small learning pieces that L2 learners read instead of larger documents. We use Probabilistic Topic Modeling (PTM) and its Latent Dirichlet Allocation (LDA) algorithm as well as a sentence extraction approach to implement our system. Topic modeling is used to discover the underlying topics in a text document or several documents. The basic assumption behind it is that a document can be represented by a set of latent topics, multinomial distributions over words, and assume that each document can be described as a mixture of these topics. Each document has then a set of topics and probability distributions associated with them. At the same time, each topic has a set of words and their probabilities of occurrence given that document and topic, i.e., topic models build bags for topics to extract information. The extractive method selects and extracts the more relevant pieces or sentences than others in a longer text.
80 Friday, 10-30-2020 Zoom Group Discussion Agenda items:
  • New approaches to Chinese; attention in neural nets (Yan Han, Dylan Murphy, Dwight Nwaigwe).
  • Implementing CTC with logits in MATLAB (Dwight Nwaigwe, Marek Rychlik).
  • Research updates.
79 Friday, 10-23-2020 Zoom Group Discussion Agenda items:
  • New approaches to Chinese (Yan Han, Dylan Murphy, Dwight Nwaigwe).
  • Continuation of MRI Proposal and the POWER architecture (Marek Rychlik).
  • Implementing CTC with logits in MATLAB (Dwight Nwaigwe, Marek Rychlik).
  • Research updates.
78 Friday, 10-16-2020 Zoom Group Discussion Agenda items:
  • MRI Proposal and the POWER architecture (Marek Rychlik).
  • Implementing CTC with logits in MATLAB (Dwight Nwaigwe).
  • Research updates.
77 Friday, 10-09-2020 Zoom Group Discussion Agenda items:
76 Friday, 10-02-2020 Zoom Group Discussion Agenda items:
  • Implementing CTC with logits in MATLAB (Dwight Nwaigwe).
  • Research updates.
75 Friday, 9-25-2020 Zoom Group Discussion Agenda items:
  • The upcoming Middle East Studies Association (MESA) meeting (Marek Rychlik). Abstract: I am scheduled to participate in a panel discussion and present a paper on Tuesday, 10/06/2020, 1:30pm. The program of the meeting is available on-line.
  • CTC implementation strategies (Dwight Nwaigwe, Marek Rychlik, Ryan Coatney).
  • New papers in paper repository.
  • Research updates.
74 Friday, 9-18-2020 Zoom Group Discussion Agenda items:
  • The DoD White Paper update (Marek Rychlik).
  • Research updates.
73 Friday, 9-11-2020 Zoom Group Discussion Agenda items:
  • DoD funding pursuit update (Marek Rychlik, Yan Han). Items needed (deadline for submission: Sep. 21):
    • Project description (3-page limit)
    • Approximate yearly budget
  • Updates to Git (non-public) repository (Marek Rychlik):
    • Moved seminar website into the repository
    • Planning to integrate 'Papers' folder, so that there is only one collection of papers folder
  • Strategy to develop CTC with logits in MATLAB (Dwight Nwaigwe, Marek Rychlik).
  • A new paper with spectral estimates. (Dwights Nwaigwe).
  • Research updates.
72 Friday, 9-04-2020 Zoom Group Discussion Agenda items:
  • DoD funding pursuit update (Marek Rychlik, Yan Han).
  • Strategy to develop CTC with logits in MATLAB (Dwights Nwaigwe).
  • Research updates.
71 Friday, 8-28-2020 Zoom Group Discussion Agenda items:
  • DoD funding pursuit --- collaborator and mentor search update (Marek Rychlik, Yan Han).
  • DoD two-page project summary writing effort (Marek Rychlik). NOTE: Ideas beyond current project may be developed.
  • Preliminary report on Fourier and Cepstral Analysis incorporation for handling vertical jitter (Marek Rychlik).
  • Video of Chinese character recognition posted on YouTube (Dwights Nwaigwe).
  • Announcement: the "deductron paper" published (Marek Rychlik).
  • Research updates.
70 Friday, 8-21-2020 Zoom Group Discussion Agenda items:
  • DoD funding --- collaborator and mentor search update (Marek Rychlik).
  • C++ and ImageMagick for OCR (Marek Rychlik) Abstract: Working with frameworks (e.g. MATLAB Deep Learning Toolkit) typically leads to insurmountable problems due to framework design flaws and limitations. This is why one eventually want to take advantage of the power and flexibility of C++. I will demonstrate this with bits of code written in C++ using ImageMagick and Boost multiarray class.
  • Research updates.
Usual meeting time: Friday, 11:00am - 11:50pm. Room: ENR2 S375.

Summer 2020 Schedule of Meetings and Talks

# Date Room Speaker or Topic Agenda or Title and Abstract
69 Friday, 8-14-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Using sequence-to-label mapping for OCR (Dwight Nwaigwe).
  • DoD funding and grant writing update (Marek Rychlik).
  • Implementing the snake scanning pattern to implement baseline-free learning of cursive scripts (Marek Rychlik).
  • Research updates.
68 Friday, 8-07-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Research topics suggestions for the Fall. Papers to read in the Git repository.
  • DoD funding and grant writing plan (Marek Rychlik).
  • Classifying English characters with RNN in sequence-to-label mapping mode (Marek Rychlik). The code was posted to git.
  • Research updates.
67 Friday, 7-31-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Conference in Soul in 2021 (Yan Han).
  • DoD funding and grant writing plan (Marek Rychlik).
  • Classifying English characters with RNN in sequence-to-label mapping mode (Marek Rychlik).
  • Papers to read in the Git repository
  • Research updates.
66 Friday, 7-24-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • DoD funding opportunities in Machine Learning (Marek Rychlik). NOTE: I e-mailed a copy of the materials from the Webinar I attended.
  • Sharing files with iSCSI (Marek Rychlik). Abstract: The RSNA dataset is over 400GB, which is more diskspace than most laptops have. I will discuss the use of the iSCSI protocol to share a disck across the network from a server. I will briefly compare to the NSF protocal popular in the U*nix world.
  • Research updates.
65 Friday, 7-17-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Search for new funding opportunities (Marek Rychlik).
  • Updates to hardware and software resources.
  • Research updates.
64 Friday, 7-10-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Calling Python from MATLAB and improved generation of Unicode strings (Marek Rychlik).
  • New theoretical results on eigenvalues of the Hessian in multi-class logistic regression (Dwight Nwaigwe).
  • Research updates.
63 Friday, 7-3-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Medical imaging - Kaggle RSNA intracranial hemorrhage dataset (Marek Rychlik). Abstract: I will report on downloading the dataset and basic processing in MATLAB. I will discuss a preparatory step for ML: creation of a custom DICOM data store class.
  • Research updates.
62 Friday, 6-26-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Medical imaging - similarities and differences with OCR (Marek Rychlik). Abstract: A 2019 Kaggle-style challenge dealing with radiological data concluded with 3 best solutions being a combination of CNN and LSTM/GRU, things commonly used in OCR. The major difference is that we deal with Big Data. I will introduce a half-a-terabyte training dataset, used in the Kaggle competition.
  • Determinant and eigenvalue identities and inequalities in machine learning (Dwights Nwaigwe).
  • Research updates.
61 Friday, 6-19-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Deductron implementation in Python/Keras, ROM and 15% speedup of training (Dylan Murphy).
  • A determinant lemma for sums of Kronecker products (Dwight Nwaigwe).
  • Dissertation on dynamic responsibility (Ryan Coatney).
  • Other research developments.
60 Friday, 6-12-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Deductron implementation in Python/Keras (Dylan Murphy).
  • Deductron as a replacement for LSTM in and NLP application (Marek Rychlik, Dylan Murphy).
  • A determinant lemma for sums of Kronecker products (Dwight Nwaigwe).
  • Other research developments.

Spring 2020 Schedule of Meetings and Talks

# Date Room Speaker or Topic Agenda or Title and Abstract
59 Friday, 6-5-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Deductron implementation and in Python/Keras (Dylan Murphy).
  • Keras-based OCR in Python for the Bromello font (Marek Rychlik).
  • Other research developments.
58 Friday, 5-29-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Deductron implementation and in Python/Keras (Dylan Murphy).
  • Keras-based OCR in Python for the Bromello font (Marek Rychlik).
  • Other research developments.
57 Friday, 5-22-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Future funding.
  • Research updates.
56 Friday, 5-15-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • NEH proposal submission.
  • Research updates.
55 Friday, 5-8-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Grant proposal writing.
  • White paper submission.
  • Research updates.
54 Friday, 5-1-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • White paper submission.
  • Grant proposal writing.
53 Friday, 4-23-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • White paper progress.
  • Transition to working on the next proposal.
52 Friday, 4-17-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • New code and new datasets (Marek Rychlik):
    • Drawing Chinese characters in MATLAB efficiently.
    • Breaking up pages with new LineBreakerApp into lines and characters.
  • Training deep networks on Chinese characters (Dwight Nwaigwe).
  • White paper update.
51 Friday, 4-10-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Creation of new training data for Pashto (Yan Han, Marek Rychlik).
  • The line-breaking algorithm for Farsi and Pashto based on bounding box overlaps (Marek Rychlik, Sayyed Vazirizade).
  • Chinese rendering with Python (Dwight Nwaigwe).
  • White paper progress.
50 Friday, 4-03-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • White paper and NEH reporting.
49 Friday, 3-27-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • White paper and NEH reporting (Marek Rychlik).
  • OCR on Chinese characters using multi-class logistic regression and new approaches (Dwight Nwaigwe).
  • Data augmentation approach to handling warped text.
  • Discussion of other ongoing research.
48 Friday, 3-20-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • A deep learning pipeline for OCR on Farsi (Marek Rychlik). Abstract:

    Arabic writing system (used, e.g. by Persian/Farsi) uses 3 forms of a character (initial, medial and final) to reflect its position in a ligature. In addition, the characters can be richly decodated with diacritics. We applied a deep learning workflow similar to a generic video processing pipeline to perform OCR on Farsi. Following Latin script approach, we trained the system on unigrams, bigrams and augmented characters: ordinary letters decorated with diacritics. We performed preliminary validation on OCR_GS_Data, a publicly available "gold standard" dataset. We utilized only the labels of the dataset so far, and generated synthetic data by typesetting those labels (lines of text in Farsi). The performance is as expected: the system behaves well on short ligatures not involving the medial form. It is expected that after seeing the medial form in training data, the system will attain the desired performance.

  • OCR on Chinese characters using multi-class logistic regression and new approaches (Dwight Nwaigwe).
  • OCR of Latin script font Bromello achieves 100% accuracy (progress report).
  • Perfected CTC implementation in MATLAB (progress report). The use of parallelism and improved visual progress tracking.
  • Discussion of other ongoing research.
47 Friday, 3-6-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Decoding Bromello font bigrams with BiLSTM and CTC (Marek Rychlik). Abstract:

    As in previous talk of 2/28, Bromello is used as an example of a Latin cursive font. We are working with bidirectional LSTM to perform sequence-to-sequence mapping (in contrast, in the last talk we performed sequence-to-label mapping). We use out home-grown implementation of CTC (Connectionist Temporal Classification) layer, to perform complete OCR. The results indicate that the syste is not capable to insert Graves's blank, i.e. it fails to produce "strongly predicted blanks". As a consequence, some characters disappear. Especially acute problem is repetitions of the same character. The working hypothesis is that the problem is fundamental, and it reflects the limitations of the Grave's probabilistic model.

  • A recent improvement of numerical stability in our CTC implementation.
  • Discussion of other ongoing research.
  • NEH reporting.
46 Friday, 2-28-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • Decoding Bromello font bigrams without CTC with high accuracy (Marek Rychlik). Abstract: Bromello is a Latin cursive font for creating decorative, script texts in English. It poses similar problems to Arabic scripts for OCR. By suitably preparing training data I am able to decode Bromello texts with bidirectional LSTM without using CTC. Furthermore, LSTM is only used in sequence-to-label mapping mode.
  • Acuisition of a new Chinese labeled character database (Dwight Nwaigwe).
  • Discussion of ongoing research.
45 Friday, 2-21-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
  • An implementation of CTC prefix search. (Marek Rychlik, Dwight Nwaigwe)
  • How to decode Latin alphabets with PCA and nearest neighbor search? (Marek Rychlik)
  • BiLSCM+CTC experiments with the Bromello font, aLatin cursive font. (Marek Rychlik)
  • Results evaluation of OCR from commercial OCR software. (Yan Han)
  • Discussion of other ongoing research.
44 Friday, 2-14-2020 Zoom Only Group Discussion Today's meeting is Zoom Only. Use this link to connect. Agenda items:
43 Friday, 2-7-2020 MATH 402 Odin Fernando Eufracio Vazquez
  • Speaker: Odin Fernando Eufracio Vazquez
  • Affiliation: Centro de Investigación en Matemáticas A.C. (CIMAT)
  • Title: Nonnegative Matrix Factorization with low-rank regularization for automatic feature extraction.
  • Room: MATH 402 (not ENR2 S375)
  • Join Zoom Meeting: ID 989-919-119
  • Abstract:

    In machine learning, Nonnegative Matrix Factorization (NMF) is a method of dimensionality reduction where the nonnegative constraints in NMF impose only additive combinations. One of the challenges in MFN is to determine the rank of the factorization; the correct choice of the rank would allow us to extract better features and thus promote a part-based representation of the data.

    In this work, we propose to include a diagonal matrix D and minimize the rank of factorization through the penalization of the elements in the diagonal of D. We derive an iterative algorithm with closed formulas by alternately minimizing local cost functions.

    We demonstrate the efficacy of our algorithm by performing experiments on synthetic data, images, texts, and gene expressions data sets. We show that our proposed algorithm not only estimates the factors with high precision but by minimizing the rank of factorization, our algorithm can learn interpretable features from the data.

42 Friday, 1-30-2020 ENR2 S375 Research, funding, collaborations Group discussion and research updates on the following topics:
  • Report on conversation with SGCI.
  • TRIPODS 2 Translational Lab proposal.
  • Current research.
41 Friday, 1-23-2020 ENR2 S375 Organizational meeting. Group discussion and research updates on the following topics:
  • A review of research papers.
  • Funding strategies.

2019 Schedule of Meetings and Talks

# Date Room Speaker Title and Abstract
40 Friday, 12-06-2019 ENR2 S375 Semester Wrap-Up Group discussion and research updates on the following topics:
  • A review of implemented OCR-related algorithms.
  • Sequence-to-label mapping with LSTM as means of recognizing isolated characters.
  • Planning for the next semester.
40 Friday, 11-21-2019 ENR2 S375 Research Updates Group discussion and research updates on the following topics:
  • A review of best algorithms for finding distance between damaged characters.
  • Automatic differentiation in MATLAB 2019b.
  • Generating W-language samples with a chaotic dynamical system (interval map).
  • Learning parameters of chaotic Lorenz attractors with standard Deep Learning tools.
39 Friday, 11-15-2019 ENR2 S375 Research Updates
  • Marek Rychlik: New features in the MATLAB Deep Learning Toolkit. Abstract: The R2019b version of MATLAB has a number of new, impressive features for machine learning. I will focus on the automatic differentiation features.
  • Group discussion of dynamic time warping for multiple outline cycles (teleportation).
38 Friday, 11-01-2019 ENR2 S375 Research Updates
  • Marek Rychlik: Dynamic time warping with and without teleportation. Abstract: Dynamic Time Warping (DTW) may be used to align time while traversing similar data. In Ocr we need also teleporation (instant transfer to another signal space). Rudimentary C++ code was used for benchmarking (without teleporation yet).
  • New videos! Arabic characters on Youtube channel
  • Dwight Nwaigwe will report on current progress on the Hessian of the multi-class logistic regression loss.
  • Dylan Murphy will update us on the sweep implementation of the character outline algorithm, and related topics.
37 Friday, 10-25-2019 ENR2 S375 Research updates
  • Marek Rychlik: Update on current performance on Traditional Chinese, Rotated text, Latin text, font mapping. Survey of new videos on the Youtube channel.
36 Friday, 10-18-2019 ENR2 S375 Research updates
  • NOTE: Marek Rychlik will give a talk at the TRIPODS seminar on Monday, 10-21-2019.
  • Marek Rychlik will demonstrate full OCR processing based on the character outline algorithm and cross-correlation, for Latin alphabets.
35 Friday, 10-11-2019 ENR2 S375 Dylan Murphy
  • TITLE: Line-sweeping and other incremental improvements Abstract: In the cycle-detection algorithm, the expensive step is graph-traversal. One way to improve the speed of this algorithm, as we have seen, is to use a more efficient representation of the adjacency matrix in terms of linked lists. I will present another method, which avoids the graph-traversal step entirely by connecting cycles during the edge-detection step using a linesweep-style algorithm. A naive implementation of this approach in Python produced speedups similar to the linked-list approach, reducing the processing time for a page to several seconds, down from hundreds of seconds.
  • Marek Rychlik will report on a modified algorithm capable of processing many Chinese pages in real (MATLAB) time, thanks to non-uniform sampling of character outlines. If there is time, I will discuss Dynamic Time Warping (DTW).
34 Friday, 10-04-2019 ENR2 S375 Marek Rychlik, Group Discussion
  • TITLE: Multi-page clustering of Traditional Chinese text Abstract: I will report on recent advances of Chinese text processing. The main advance is increasing the speed of computing character outlines by a factor of 100 as compared to the last report. This allows processing of entire books in acceptable time, without expensive hardware resources.
  • Research and software development updates.
34 Friday, 9-27-2019 ENR2 S375 Raymundo Navarette
  • TITLE: Bias reduction in multi-class logistic regression and the problem of separation. Abstract: We will go over the basics of Firth's method for bias reduction in maximum likelihood estimates and apply them to multi-class logistic regression. We will discuss how this and other penalization approaches remove the problem of separation, which occurs when sample sizes are small or when all classes can be separated with linear classifiers, and leads to non-existent (infinite) optimal parameters.
  • Research and software development updates.
33 Friday, 9-20-2019 ENR2 S375 Marek Rychlik, Group
  • TITLE: Latin and Chinese character outlines as means of extracting features Abstract: An important idea of OCR present in Tesseract and research papers is that character outlines are the source of features for character classification. We will discuss Fourier transform and splines as means of smoothing and approximating character boundaries. An inherent instability due to character damage will be described, and ways to address it.
  • Research and software development updates.
32 Friday, 9-13-2019 ENR2 S375 Dwight Nwaigwe
  • TITLE: Character Matching and some problems with topological classification. Abstract: Ray Smith's overview of Tesseract mentions the use of feature in classification, noting that topological classification is not robust. We briefly go into some examples. Further, we note that his paper discusses some of the mechanics of character matching, which is based on clustering. We compare a clustering method devised for cursive fonts.
31 Friday, 9-6-2019 ENR2 S375 Group
  • Discussion of Tesseract architecture based on Ray Smith's paper.
    • The use of polygonal approximations
    • Maximally chopped characters
    • (x,y,theta) as features
    • Re-assembling chopped characters
  • Character outline calculation with MATLAB.
  • Top-down, bottom-up, adaptive processing.
  • Research updates.
30 Friday, 8-30-2019 ENR2 S375 Group
  • Planning for the semester; papers to read. One paper: Overview of Tesseract OCR Engine by Ray Smith.
  • New MATLAB code updates: a parameter-tweaking GUI.
  • Research updates.
  • MATLAB image datastores - the MATLAB ways to prepare training data.
30 Friday, 8-23-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Research updates (Marek Rychlik): Report on 20% improvement on digits 0-3 of the MNIST dataset. Abstract: Using the new regularization technique in the "Patternnet" paper I was able to reduce the number of errors from 610 to about 490, which is approximately 20%. The regularization shares some features of the LASSO method in statistics.
  • New MATLAB code updates: a parameter-tweaking GUI.
  • Planning for the new semester.
29 Friday, 8-16-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Research updates (Marek Rychlik): Simplifying the "Patternnet" approach; linear programming;
  • New MATLAB code updates: a parameter-tweaking GUI.
  • Planning for the new semester.
28 Friday, 8-09-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Research updates: Connection between the "Patternnet" multi-class logistic regression and Support Vector Machines, and Linear Programming aspects.
  • Tesseract 5 new features (Yan Han).
27 Friday, 8-02-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Update on multi-logistic regression paper (Marek Rychlik).
  • Tesseract 5 new features (Yan Han).
26 Friday, 7-26-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Caching Unicode character images in MATLAB (Marek Rychlik). Abstract: Rendering Traditional Chinese characters from Unicode glyphs is a time-consuming operation, especially in MATLAB. We need the characters to be at least 60 pixels in size, which generates about 4,000 bits per character (one of 60,000+). Therefore, rather than generating characters on the fly, we reuse generated images by caching them. Two caching strategies are implemented: in RAM and in an SQLite database. Packing bits is used as a form of compression for the database implementation. A speedup achieved is 5-10 fold.
25 Friday, 7-19-2019 ENR2 S375 Marek Rychlik, Group Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • MATLAB MEX interface to Tesseract 4 (Marek Rychlik). Abstract: I wrote C++ code which effectively wraps Tesseract 4 in MATLAB. It entirely bypasses the Vision Toolbox toolkit wrapper, which only supports version 3 of Tesseract. I will explain the implementation and capabilities. I will also demonstrate its application to provide a complete OCR system for Traditional Chinese, using custom page segmentation (class PageScan) and LSTM-based character recognition.
  • Update on Sayyed's work (Sayyed Vazirizade) on Arabic/Persian/Pashto page segmentation.
  • Review of OCR papers listed at the Tesseract site (Marek Rychlik).
  • Commentary on the "Patternnet" paper: the existence question.
  • Forthcoming meeting with Rep. Grijalva's office.
24 Friday, 7-12-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Update on Sayyed's work.
  • New code for Chinese page segmentation (Marek Rychlik).
  • Commentary on the "Patternnet" paper: the existence question.
  • Forthcoming meeting with Rep. Grijalva's office.
23 Friday, 7-05-2019 ENR2 S375 NO MEETING. 4th of July break.
22 Friday, 6-28-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. Themes to be covered:
  • Preparing for the meeting with Rep. Grijalva's office.
  • Update on Sayyed's work.
  • MATLAB techniques for preparing training datasets: structures, cell arrays, .mat files.
  • Unsupervised classification of Chinese characters.
22 Friday, 6-21-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. This meeting will be devoted to:
  • Updates on the multi-class logistic regression research. Kronecker product and Hadamard product.
  • Image processing training.
  • Sayyed's adaptive thresholding code.
  • Unsupervised classification of Chinese characters.
21 Friday, 6-14-2019 ENR2 S375 Group discussion Note that for the rest of the summer the rooom is ENR2 S375. This meeting will be devoted to:
  • Image processing training.
  • Sayyed's adaptive thresholding code.
  • Unsupervised classification of Chinese characters.
  • Serializaton and using .mat files.
20 Friday, 6-7-2019 ENR2 S395 Group discussion This meeting will be devoted to:
  • Image processing training.
  • Review of current segmentation code for Pashto/Persian. MATLAB 'system' call under Windows.
  • MATLAB code for image processing and page segmentation algorithms.
  • Unsupervised classification of Chinese characters.
  • Serializaton and using .mat files.
19 Friday, 5-31-2019 ENR2 S375 Group discussion This meeting will be devoted to:
  • Image processing training.
  • MATLAB code for image processing.
  • Page segmentation algorithms.
18 Friday, 5-24-2019 ENR2 S395 Group discussion This meeting will be devoted to:
  • Image processing training.
  • MATLAB code for image processing.
  • Page segmentation algorithms.
17 Friday, 5-17-2019 ENR2 S395 Group discussion This first meeting of Summer 2019 will be devoted to:
  • Summer research themes; everyone is invited to talk for a few minutes about their plans and problems;
  • Review the new collection at the Library of Congress of Chinese documents, needing OCR.
16 Friday, 5-11-2019 ENR2 S395 NO MEETING NO MEETING.
15 Friday, 5-03-2019 ENR2 S395 Group discussion Due to Exam Session, the agenda is tentative:
  • Research progress (Marek Rychlik, Dwight Nwaigwe, Aaron Peterson, Ryan Coatney)
  • New document choices (Yan Han)
  • Tesseract training (Dylan Murphy)
14 Friday, 4-26-2019 ENR2 S395 Group discussion We will cover a variety of topics, including:
  • The problem of "loose clouds" in Chinese text (Yan Han, Marek Rychlik)
  • Building Web applications with MATLAB (Marek Rychlik)
  • MATLAB as an application delivery system: in-MATLAB, standalone with free MATLAB Runtime, Web with MATLAB Web application server
  • Ideas and obstacles of the Tier 2 proposal
  • Research and development ideas for the summer
13 Friday, 4-19-2019 ENR2 S395 Group discussion We will cover a variety of topics, including:
  • Chinese character experiments (Yan Han and his students)
  • Building standalone applications with MATLAB (Marek Rychlik)
  • GitHub limitation on file size
  • New installer site on BitBucket
  • Conversation with an NEH Officer regarding Tier 2 proposal
12 Friday, 4-12-2019 ENR2 S395 Marek Rychlik Research talk, and a group discussion of current topics:
  • Marek Rychlik (University of Arizona) TITLE: Character recognition with multi-class logistic regression. ABSTRACT: Multi-class logistic regression network is simple and it has a good training algorithm and quite impressive accuracy on the MNIST dataset. I will discuss my recent paper on this subject https://arxiv.org/abs/1903.12600 and the implementation in our GitHub repository.
11 Friday, 4-5-2019 ENR2 S395 Group discussion, Marek Rychlik, Yan Han We will cover a variety of topics:
  • NEH Grant proposal Tier 2 proposal content.
  • Report on meeting with Clayton Morrison's group (Marek Rychlik).
  • Tier~2 proposal and the ISO standard (Yan Han).
  • Report on a program im2latex (Marek Rychlik); this program can convert images to math equations.
10 Friday, 3-29-2019 ENR2 S395 Group discussion We will discuss various on-going efforts. This includes:
  • The mechanics of training Tesseract.
  • Preparations for Tesseract training runs on multi-core computer(s).
  • NEH Grant proposal writing - Tier 2.
9 Friday, 3-22-2019 ENR2 S395 Group discussion We will discuss various on-going efforts. This includes:
  • New code on GitHub: an implementation of CTC in MATLAB
  • The page segmentation algorithm
  • Training Tesseract
  • NEH Grant proposal writing - Tier 2
8 Friday, 3-15-2019 ENR2 S395 Marek Rychlik
  • Marek Rychlik (University of Arizona) will speak on CTC. ABSTRACT: CTC (Connectionist Temporal Classification) is a probability model for segmentation of outputs of recurrent neural networks into meaningful chunks. The technique is used for handwriting and script language segmentation, and for natural speech. In Deep Learning, CTC is simply a kind of rather sophisticated loss function. I will discuss the paper of Alex Graves (currently at Deep Mind) who introduced this method about 10 years ago to handwriting recognition and OCR.
7 Friday, 3-1-2019 ENR2 S395 Marek Rychlik
  • Marek Rychlik (University of Arizona) will talk about the Tesseract C++ API. Some C++ programming examples will be discussed. One example allows translation of Pashto ligatures to Unicode.
6 Friday, 2-22-2019 ENR2 S395 Dylan Murphy
  • Dylan Murphy (University of Arizona) will discuss the OCR software. Open source packages Kraken and Tesseract will be discussed. The talk will cover use and system architecture of these systems, as well as the process of training for new language recognition.
5 Friday, 2-15-2019 ENR2 S395 Sayyed Vazirizade
  • Sayyed Vazirizade (University of Arizona) will review Persian OCR software.
4 Friday, 2-8-2019 ENR2 S395 Ryan Coatney, Yan Han
  • Ryan Coatney (University of Arizona) will continue talking about a paper by Kobus et. all, applying Gaussian processes to modeling 1-dimensional structures (plants), and potential applications to OCR (est. 25 min).
  • Yan Han (University of Arizona) will talk about APIs for embedding text in PDF (est. 25 min).
3 Friday, 2-1-2019 ENR2 S395 Mike Maizels, Ryan Coatney
  • Mike Maizels (Harvard and University of Arkansas) will discuss an arts-related project involving OCR (15 min).
  • Ryan Coatney (University of Arizona) will talk about a paper by Kobus et. all, applying Gaussian processes to modeling 1-dimensional structures (plants), and potential applications to OCR (30 min).
2 Friday, 1-25-2019 ENR2 S395 Marek Rychlik A method for Chinese OCR using Hough and Fourier transforms. I will explain the algorithm published in our GitHub repository. Also, I will briefly describe the selected papers which we can collectively study. The slides of this talk are available.
1 Friday, 1-18-2019 ENR2 S395 Organizational meeting. Agenda will include:
  • Introductions
  • Description of the NEH grant research
  • Resources for Pashto and Chinese

Zoom recordings

They are available on the restricted page of this website. However, you need to ask the organizers for the credentials to access this page.

The organizers