Multi-lingual Optical Character Recognition Seminar

Fall 2023 Schedule of Meetings and Talks

Semeste themes: medical forms, NLP, Large Language Models, ML

# Date Room Speaker or Topic Agenda or Title and Abstract
192 Tuesday, 12-05-2023 ENR2 S395, public Zoom link Group Discussion Agenda items:
  • An OCR challenge: membership directory data (updates) - Yan Han, Marek Rychlik
  • Project updates
191 Tuesday, 11-28-2023 ENR2 S395, public Zoom link Group Discussion Agenda items:
  • What happened this week? (in the world of AI)
  • An OCR challenge: membership directory data (updates) - Yan Han, Marek Rychlik
  • How to mount Soteria? - Marek Rychlik

    Abstract: "Mounting" is the UNIX (Linux, Mac OS) term that refers to transparently accessing the storage of a remote system on a local computer. In recent days I explored a particular way of accessing soteria using SSHFS ("Secure Shell File System"). This file system uses an encrypted data protocol (SFTP) to access files on the remote system. There is a great advantage of "mounting" over explicit file transfers, as the workflow for handling remote files is exactly the same as for remote files. There is a version of SSHFS for Windows, too. In this talk I will demonstrate a mount of Soteria file system.

191 Tuesday, 11-21-2023 ENR2 S395, public Zoom link Group Discussion Agenda items:
  • Java code updates - Laasya Nellore
  • An OCR challenge: membership directory data (updates) - Yan Han, Marek Rychlik
  • Liver data synchronization in PDF PDF attachments and CT scans - Marek Rychlik, Duncan Bennett
  • Fuzzy matching approaches for correcting OCR errors - Marek Rychlik

    Abstract: Finding strings in text obtained by OCR leads to the problem of "fuzzy matching". One variant can be formulated as follows: Given a string X an a text T, find all substrings of T that approximately match X. The most suitable definition of matching is based on the edit distance (Levenshtein metric). It appears that there is no good solution to the problem in terms of optimal computational complexity. The best solution I found is through Wagner–Fischer algorithm which for free gives the distance from all prefixes of a string. A demo of processing medical forms will be shown.

190 Tuesday, 11-14-2023 ENR2 S395, public Zoom link Group Discussion Agenda items (tentative):
  • An OCR challenge: membership directory data (updates) - Yan Han, Marek Rychlik
    • UITS shared storage (5TB) to house the data (Marek)
    • First look at the data (Marek, Yan)
    • Using U of A network for data transfers: data rates, VPN (Marek)
  • Java code updates - Laasya Nellore
  • Liver data synchronization in PDF PDF attachments and CT scans - Marek Rychlik, Duncan Bennett
189 Tuesday, 11-07-2023 ENR2 S395, public Zoom link Group Discussion Agenda items (tentative):
  • An OCR challenge: membership directory data (updates) - Yan Han
  • Putting OCR and LLM on a drone (updates) - Marek Rychlik, David Ryan, Jack Stevens
  • Liver data synchronization in PDF PDF attachments and CT scans - Marek Rychlik
  • Grok, PromptIDE
  • ChatGPT 'Advanced Data Analysis' examples
  • Updates on various projects
188 Tuesday, 10-31-2023 ENR2 S395, public Zoom link Group Discussion Agenda items (tentative):
  • An OCR challenge: membership directory data (updates) - Yan Han
  • Putting OCR and LLM on a drone (updates) - Marek Rychlik, David Ryan, Jack Stevens
  • Liver data synchronization in PDF PDF attachments and CT scans - Marek Rychlik

    Abstract: This data is our next goal towards extraction of DD data from "PDF attachments". The data features most of the difficulties that we have studied so far. Most of the data require OCR. It features tables with hand-written digits and gesture input (manually encircled text). Thus, this content requires upscaling the techniques we have developed to a larger dataset (est. 15,000 records).

  • Updates on various projects
187 Tuesday, 10-24-2023 ENR2 S395, public Zoom link Group Discussion Agenda items (tentative):
  • An OCR challenge: membership directory data (continuation) - Yan Han
  • Putting OCR and LLM on a drone - Marek Rychlik

    Abstract: A newly funded project with participation of an undergraduate, a graduate student, aims to do OCR on a mobile platform - a drone! Scheduled to be finished by the end of the Spring semester, the project will explore the ability to incorporate OCR and LLM (such as ChatGPT) into the control loop of a drone.

  • The "Advanced Data Analysis" feature of Chat GPT - Marek Rychlik

    Abstract: This feature offers a low effort way to play with neural networks, statistics and mathematical modeling. This flavor of ChatGPT performs computations directly, using Python (including Pytorch, etc.). I will show how one can solve a classical puzzle in AI, the XOR problem, without any programming. The ChatGPT session can be accessed here: Train 3-Layer Perceptron Network

  • Updates on various projects
186 Tuesday, 10-17-2023 ENR2 S395, public Zoom link Group Discussion Agenda items:
  • An OCR challenge: membership directory data - Yan Han
  • How to write your own ChatGPT plugin? - Marek Rychlik

    Abstract: The 9-27-2023 version of ChatGPT features a plugin architecture. Hundreds of plugins are already available, from hundreds of vendors. With the assistance of ChatGPT, we explored the path to writing our own plugin, thus extending ChatGPT functionality to our liking. I will describe the plugin architecture and show some sample code in Python.

  • Updates on various projects
185 Tuesday, 10-10-2023 ENR2 S395, HIPAA-Zoom Group Discussion Agenda items:
  • OCR and the case for custom language models - Marek Rychlik

    Abstract: Medical forms provide interesting examples in the area of OCR. In this segment, I would like to discuss an example in which the English language model results in bad page segmentation (rendering it useless for data extraction) while a custom language model (English + some Unicode characters) fixes the problem. Thus, seemingly independent tasks of spotting characters and words and segmenting text into lines, in reality are closely depending on each other. Sample codes written in MATLAB with the aid from ChatGPT will be discussed.

  • Updates on various projects
184 Tuesday, 10-03-2023 ENR2 S395, public Zoom link Duncan Bennett, Marek Rychlik Agenda items:
  • Report on ChatGPT version of 9-27-2023 - Duncan Bennett and Marek Rychlik

    Abstract: This version of ChatGPT vastly expands the capabilities through the use of plugins. The chats interact with real-time data by using Web browser plugins. The OCR plugins allow inputing scanned documents. Also, documents can be generated in a variety of formats (PowerPoint, image, diagram). A tutorial and examples will be given.

    In addition, this capability may be similar to a LLM called a Toolformer. The Toolformer learns to use API calls as tools via the finetuning of a standard LLM. These API calls can range from a simple database query, the query of Wolfram Alpha or a specialized LLM.

  • OCR and the case for custom language models - Marek Rychlik

    Abstract: Medical forms provide interesting examples in the area of OCR. In this segment, I would like to discuss an example in which the English language model results in bad page segmentation (rendering it useless for data extraction) while a custom language model (English + some Unicode characters) fixes the problem. Thus, seemingly independent tasks of spotting characters and words and segmenting text into lines, in reality are closely depending on each other. Sample codes written in MATLAB with the aid from ChatGPT will be discussed.

183 Tuesday, 09-28-2023 ENR2 S395, Public Zoom Duncan Bennett Abstract: Much of the recent progress in large language models can be attributed to larger models and larger dataset. For example, the first 3 iterations of the Generative Pre-trained Transformer (GPT) have had near identical architecture with increasing model size (100 million to 100 billion parameters) and increasing training set size (not disclosed). However, this type of scaling cannot go on forever and recent speculation (or leaks) in the AI community indicate the change in architecture of GPT4. In this talk, we'll discuss the history of the GPT model and some speculated changes in the GPT4 model and training. These speculated changes in GPT4 includes; master of experts (MoE), instruction finetuning and speculative decoding.
182 Tuesday, 09-19-2023 ENR2 S395, HIPAA-Zoom Group discussion Agenda items:
  • Updates on various projects (Marek Rychlik, Yan Han)
  • Chat GPT 4 as a force multiplier (Marek Rychlik)

History of meetings

The seminar has met continuously since 2019 and we covered a variety of topics. Currently unmaintained page prior meetings has a record of these meetings.

Zoom recordings

Recordings of public meetings are available for a limited time (180 days) in D2L.

The organizers