Multi-lingual Optical Character Recognition Seminar

Fall 2023 Schedule of Meetings and Talks

Semeste themes: medical forms, NLP, Large Language Models, ML

Usual meeting time: Tuesday, 11:00am - 12am, Room ENR2 S395.
Meetings will be dual mode (live and on Zoom).
D2L website is set up for the seminar, with useful information:
- Link to the seminar D2L site
The Zoom meetings will be on either a public Zoom or a HIPAA-compliant Zoom for only those who are authorized to view private data are enrolled into the Seminar course on D2L. Below are the Zoom links:
- public Zoom link
- UITS HIPAA-Zoom login - start HIPAA-compliant Zoom here
- HIPAA-compliant Zoom link

#	Date	Room	Speaker or Topic	Agenda or Title and Abstract
192	Tuesday, 12-05-2023	ENR2 S395, public Zoom link	Group Discussion	Agenda items: An OCR challenge: membership directory data (updates) - Yan Han, Marek Rychlik Project updates
191	Tuesday, 11-28-2023	ENR2 S395, public Zoom link	Group Discussion	Agenda items: What happened this week? (in the world of AI) An OCR challenge: membership directory data (updates) - Yan Han, Marek Rychlik How to mount Soteria? - Marek Rychlik Abstract: "Mounting" is the UNIX (Linux, Mac OS) term that refers to transparently accessing the storage of a remote system on a local computer. In recent days I explored a particular way of accessing soteria using SSHFS ("Secure Shell File System"). This file system uses an encrypted data protocol (SFTP) to access files on the remote system. There is a great advantage of "mounting" over explicit file transfers, as the workflow for handling remote files is exactly the same as for remote files. There is a version of SSHFS for Windows, too. In this talk I will demonstrate a mount of Soteria file system.
191	Tuesday, 11-21-2023	ENR2 S395, public Zoom link	Group Discussion	Agenda items: Java code updates - Laasya Nellore An OCR challenge: membership directory data (updates) - Yan Han, Marek Rychlik Liver data synchronization in PDF PDF attachments and CT scans - Marek Rychlik, Duncan Bennett Fuzzy matching approaches for correcting OCR errors - Marek Rychlik Abstract: Finding strings in text obtained by OCR leads to the problem of "fuzzy matching". One variant can be formulated as follows: Given a string X an a text T, find all substrings of T that approximately match X. The most suitable definition of matching is based on the edit distance (Levenshtein metric). It appears that there is no good solution to the problem in terms of optimal computational complexity. The best solution I found is through Wagner–Fischer algorithm which for free gives the distance from all prefixes of a string. A demo of processing medical forms will be shown.
190	Tuesday, 11-14-2023	ENR2 S395, public Zoom link	Group Discussion	Agenda items (tentative): An OCR challenge: membership directory data (updates) - Yan Han, Marek Rychlik UITS shared storage (5TB) to house the data (Marek) First look at the data (Marek, Yan) Using U of A network for data transfers: data rates, VPN (Marek) Java code updates - Laasya Nellore Liver data synchronization in PDF PDF attachments and CT scans - Marek Rychlik, Duncan Bennett
189	Tuesday, 11-07-2023	ENR2 S395, public Zoom link	Group Discussion	Agenda items (tentative): An OCR challenge: membership directory data (updates) - Yan Han Putting OCR and LLM on a drone (updates) - Marek Rychlik, David Ryan, Jack Stevens Liver data synchronization in PDF PDF attachments and CT scans - Marek Rychlik Grok, PromptIDE ChatGPT 'Advanced Data Analysis' examples Updates on various projects
188	Tuesday, 10-31-2023	ENR2 S395, public Zoom link	Group Discussion	Agenda items (tentative): An OCR challenge: membership directory data (updates) - Yan Han Putting OCR and LLM on a drone (updates) - Marek Rychlik, David Ryan, Jack Stevens Liver data synchronization in PDF PDF attachments and CT scans - Marek Rychlik Abstract: This data is our next goal towards extraction of DD data from "PDF attachments". The data features most of the difficulties that we have studied so far. Most of the data require OCR. It features tables with hand-written digits and gesture input (manually encircled text). Thus, this content requires upscaling the techniques we have developed to a larger dataset (est. 15,000 records). Updates on various projects
187	Tuesday, 10-24-2023	ENR2 S395, public Zoom link	Group Discussion	Agenda items (tentative): An OCR challenge: membership directory data (continuation) - Yan Han Putting OCR and LLM on a drone - Marek Rychlik Abstract: A newly funded project with participation of an undergraduate, a graduate student, aims to do OCR on a mobile platform - a drone! Scheduled to be finished by the end of the Spring semester, the project will explore the ability to incorporate OCR and LLM (such as ChatGPT) into the control loop of a drone. The "Advanced Data Analysis" feature of Chat GPT - Marek Rychlik Abstract: This feature offers a low effort way to play with neural networks, statistics and mathematical modeling. This flavor of ChatGPT performs computations directly, using Python (including Pytorch, etc.). I will show how one can solve a classical puzzle in AI, the XOR problem, without any programming. The ChatGPT session can be accessed here: Train 3-Layer Perceptron Network Updates on various projects
186	Tuesday, 10-17-2023	ENR2 S395, public Zoom link	Group Discussion	Agenda items: An OCR challenge: membership directory data - Yan Han How to write your own ChatGPT plugin? - Marek Rychlik Abstract: The 9-27-2023 version of ChatGPT features a plugin architecture. Hundreds of plugins are already available, from hundreds of vendors. With the assistance of ChatGPT, we explored the path to writing our own plugin, thus extending ChatGPT functionality to our liking. I will describe the plugin architecture and show some sample code in Python. Updates on various projects
185	Tuesday, 10-10-2023	ENR2 S395, HIPAA-Zoom	Group Discussion	Agenda items: OCR and the case for custom language models - Marek Rychlik Abstract: Medical forms provide interesting examples in the area of OCR. In this segment, I would like to discuss an example in which the English language model results in bad page segmentation (rendering it useless for data extraction) while a custom language model (English + some Unicode characters) fixes the problem. Thus, seemingly independent tasks of spotting characters and words and segmenting text into lines, in reality are closely depending on each other. Sample codes written in MATLAB with the aid from ChatGPT will be discussed. Updates on various projects
184	Tuesday, 10-03-2023	ENR2 S395, public Zoom link	Duncan Bennett, Marek Rychlik	Agenda items: Report on ChatGPT version of 9-27-2023 - Duncan Bennett and Marek Rychlik Abstract: This version of ChatGPT vastly expands the capabilities through the use of plugins. The chats interact with real-time data by using Web browser plugins. The OCR plugins allow inputing scanned documents. Also, documents can be generated in a variety of formats (PowerPoint, image, diagram). A tutorial and examples will be given. In addition, this capability may be similar to a LLM called a Toolformer. The Toolformer learns to use API calls as tools via the finetuning of a standard LLM. These API calls can range from a simple database query, the query of Wolfram Alpha or a specialized LLM. OCR and the case for custom language models - Marek Rychlik Abstract: Medical forms provide interesting examples in the area of OCR. In this segment, I would like to discuss an example in which the English language model results in bad page segmentation (rendering it useless for data extraction) while a custom language model (English + some Unicode characters) fixes the problem. Thus, seemingly independent tasks of spotting characters and words and segmenting text into lines, in reality are closely depending on each other. Sample codes written in MATLAB with the aid from ChatGPT will be discussed.
183	Tuesday, 09-28-2023	ENR2 S395, Public Zoom	Duncan Bennett	Abstract: Much of the recent progress in large language models can be attributed to larger models and larger dataset. For example, the first 3 iterations of the Generative Pre-trained Transformer (GPT) have had near identical architecture with increasing model size (100 million to 100 billion parameters) and increasing training set size (not disclosed). However, this type of scaling cannot go on forever and recent speculation (or leaks) in the AI community indicate the change in architecture of GPT4. In this talk, we'll discuss the history of the GPT model and some speculated changes in the GPT4 model and training. These speculated changes in GPT4 includes; master of experts (MoE), instruction finetuning and speculative decoding.
182	Tuesday, 09-19-2023	ENR2 S395, HIPAA-Zoom	Group discussion	Agenda items: Updates on various projects (Marek Rychlik, Yan Han) Chat GPT 4 as a force multiplier (Marek Rychlik)

History of meetings

The seminar has met continuously since 2019 and we covered a variety of topics. Currently unmaintained page prior meetings has a record of these meetings.

Zoom recordings

Recordings of public meetings are available for a limited time (180 days) in D2L.

The organizers

Marek Rychlik - Department of Mathematics (rychlik@arizona.edu)
Yan Han - Library Science