Introduction: Why Proof-of-Delivery Receipt Recognition is a Tough Nut to Crack for Printing Industry Digitalization
The production process in the printing industry heavily relies on the circulation of paper documents. From work orders issued by sales, factory-side receipts (delivery confirmations, shipment notes, on-site process verification forms), to proof-of-delivery receipts from logistics, these documents carry critical information such as order specifications, quantities, delivery dates, and responsibility attribution. When printing factories attempt to digitalize scheduling, capacity, and accounting, receipt recognition often becomes the first and most easily failed hurdle. The difficulty lies not in 'reading the characters' but in the inconsistent layout of these documents, varied formats from different vendors, frequent handwritten remarks and alterations, and inconsistent scan quality from on-site photography [1]
In recent years, the maturation of generative AI and multimodal models has popularized the notion that 'the OCR problem has long been solved.' However, directly applying Vision Language Models (VLMs) in a real production environment is a fundamentally different proposition from achieving high scores on clean datasets. A study on a dataset constructed from receipts photographed by mobile devices in Japan indicated that even with specialized fine-tuning for structured invoice data extraction, model performance still heavily depends on the representativeness and layout diversity of the dataset [2]. In other words, benchmark numbers cannot be directly extrapolated to the specific document types of any given factory
The research questions of this article are:
・Three:
・First, what are the generations of evolution that proof-of-delivery receipt recognition technology has undergone, and what are the applicable boundaries of each generation?
・Second, why might 'the latest model' not necessarily be 'the most suitable solution,' and what are the determining factors behind technology selection?
・Third, for small and medium-sized Taiwanese printing factories with limited resources, what architectural principles and workflow split logic should be followed to implement a functional proof-of-delivery receipt recognition system? This article uses a firsthand account of a Taiwanese engineer's successful OCR implementation for proof-of-delivery receipts as a core case study [1], critically integrating literature on invoice OCR and AI adoption governance
The contribution of this article is that it reframes proof-of-delivery receipt recognition not as a simple model selection problem but as a system engineering challenge involving a collaborative three-layer architecture of 'recognition, structured extraction, and review.' It also proposes actionable workflow split principles. For printing factories evaluating the digitalization of work order processes, this article provides a rare local implementation perspective

Literature and Current Status Review: The Shift from Model-Centric to System-Centric Discourse
Existing discussions on document recognition can be divided into three clusters based on their core concerns, with clear tensions in their stances
The first cluster is the model capability-centric view. This approach focuses on how to achieve higher scores for a single model in invoice extraction tasks. The aforementioned Japanese mobile receipt study falls into this category, constructing a labeled dataset of approximately 1.3K scale and fine-tuning VLM to output structured receipt fields, demonstrating that 'dataset quality combined with targeted fine-tuning' can significantly improve structured extraction accuracy [2][4]. The value of this type of research lies in providing reproducible methodologies and quantitative benchmarks, but its implicit premise is 'relatively consistent data distribution.' When faced with the long-tail distribution common in printing factories, where each vendor has a unique format and new formats are constantly added, the maintenance cost and generalization ability of a single fine-tuned model are challenged
The second cluster is tools and engineering practice. With the popularization of AI coding agents, developers can connect OCR, LLM, and backend logic at a lower cost. Relevant practical literature records the collaborative modes and limitations of AI coding agents in real development scenarios, indicating that they can accelerate the generation of boilerplate code and tool integration, but human intervention is still required for judgments involving domain knowledge [5]. There are also package implementations that integrate AI coding agents into specific analysis environments (such as RStudio), showing that 'assisting data processing pipelines with agents' has become a viable engineering paradigm [3]. This cluster shifts the focus from 'how strong the model is' to 'how the system is built,' forming a complementary rather than a replacement relationship with the first cluster
The third cluster is AI adoption governance. This approach steps outside technical details to explore how organizations should 'wisely manage AI.' Related research emphasizes that the success or failure of AI systems depends not only on algorithmic accuracy but also on the division of responsibilities between humans and the system, and the institutionalized handling of uncertainty [6]. This perspective is especially critical for proof-of-delivery receipt recognition: when a model cannot reliably interpret a poor-quality photo, system designers must decide in advance 'who handles this situation and what fallback process is used,' rather than expecting the model to achieve an impossible 100% accuracy
Synthesizing the three clusters reveals a trend of discursive shift: early discussions leaned towards model capability-centric views, assuming that a strong enough model would solve the problem; recent discussions have gradually shifted towards system and governance-centric views, acknowledging the limitations of models and recognizing that the success of implementation is truly determined by the design of preprocessing, workflow splitting mechanisms, and human review. However, existing literature mostly remains within its own cluster: model research rarely discusses the long-tail and fallback scenarios in production environments, engineering practices rarely discuss quantitative accuracy boundaries, and governance research tends to be abstract and lacks concrete technical implementation details. This article's analysis suggests that the intersection of these three is precisely the research gap in the discussion of proof-of-delivery receipt recognition implementation, and a complete local implementation record can fill this gap [1]

Three Generations of Evolution: Each Still Alive, Differentiated by Scenario
The technological evolution of proof-of-delivery receipt recognition can be divided into three generations. The key is to understand that this is not a linear 'who replaces whom,' but rather a coexistence where each generation remains relevant, depending on the scenario and security requirements [1]
The first generation is the OCR plus Regular Expression (Regex) approach. Its method is to first convert images into text using traditional OCR engines (such as Tesseract, Google Document AI), and then extract fields one by one using Python regular expressions: where the order number is, what the date format is, what rules the address conforms to [1]. The advantages of this approach are clear: low cost, offline capability, fast speed, very stable and predictable when formats are fixed, easy to debug, requires no LLM, and has no token cost [1]. However, its fragility is equally clear: the format changes and it breaks; a different type of document requires rewriting a new set of regexes; if OCR misrecognizes or misses a single character, the entire regex match fails; the more clients and the more diverse the formats, the longer and more brittle the regex becomes, ultimately leading to a maintenance nightmare. This article argues that the fundamental limitation of the first generation is its complete lack of semantic understanding, relying solely on rigid string matching, and thus unable to cope with the long-tail of document formats in the printing industry
The second generation is the OCR plus text LLM approach. This also starts by converting images to text using OCR, but instead of hardcoding regexes, the OCR output text is fed to a text-based LLM, which understands the semantics, extracts fields, and fills in gaps [1]. According to a firsthand account, this method significantly improved accuracy immediately due to four reasons: format changes do not require rewriting regexes as the LLM self-learns semantics; it can use context to recover characters missed by OCR; it can recognize synonymous or alias fields ('order number,' 'consignment number' can both be identified); development is fast, and maintenance costs are greatly reduced [1]. More critically, both OCR and text LLMs have mature on-premise solutions, ensuring data does not leave the company, which is a decisive advantage for personal data and sensitive documents [1]. This point echoes the 'data sovereignty and responsibility boundaries' emphasized in AI adoption governance literature [6]
However, the ceiling of the second generation is locked by the preceding OCR stage. If OCR misreads first, the LLM receives incorrect text, leading to 'garbage in, garbage out'; layout and color information are lost during the OCR process—red/blue ink, table structures, and handwritten lines all disappear, leaving the LLM completely unaware; handwritten content, signatures, and alterations, which can only be understood by looking at the image, become distorted once converted to text [1]. This article argues that the value and limitations of the second generation are two sides of the same coin: it solves the regex pain point and can run entirely on-premise, but at the cost of the entire pipeline's recognition ceiling being constrained by the quality of the initial OCR layer
The third generation is direct judgment by Vision LLM. The latest approach bypasses OCR, directly feeding proof-of-delivery receipt images to multimodal models (such as GPT-4o, Claude), allowing them to simultaneously 'see' the image and understand semantics, outputting structured fields in one step [1]. Its value lies in directly addressing most pain points of the first two generations: it understands layouts, tables, colors, and handwritten lines; it can interpret handwriting, alterations, checkboxes, signatures, and red/blue ink; it can use logic and context to differentiate similar-looking characters (1 and l, O and 0) and infer semantics; it is template-free, regex-free, and can handle format changes [1]. This is consistent with the research conclusions of fine-tuning VLMs specifically for structured invoice data extraction, which also confirmed the advantage of multimodal models in processing real invoices with complex layouts [2]
But the cost of the third generation lies elsewhere: slow inference speed—image input and heavy inference are significantly slower than pure text processes; high vision token costs, which are very noticeable at scale; powerful vision models are mostly cloud-based, and achieving fully on-premise operation with data not leaving the company is still difficult, which is why the second generation remains valuable today; moreover, it still cannot achieve 100% accuracy—poor-quality photos due to moisture or casual mobile phone shots simply don't capture the information, and the model cannot salvage it [1]. This article argues that the limitations of the third generation precisely confirm the core proposition of governance literature: model uncertainty is structurally inherent and must be absorbed by institutional frameworks and processes, rather than expecting the model to eliminate it on its own [6]

Toolbox and Selection Logic: The Trilemma of Cost, On-Premise Capability, and Recognition Accuracy
The abstract three-generation evolution, when applied to concrete tools, reveals a clear trade-off triangle: cost, on-premise capability, and recognition accuracy are difficult to achieve simultaneously. Selection is essentially prioritizing these three dimensions based on the scenario
In the traditional OCR engine layer (the front-end of the first and second generations), the practical record lists three solutions that have been used [1]. Tesseract is the oldest open-source engine, purely on-premise, free, and has many language packs. Its advantages are stability, offline capability, and a large community, but it struggles with Chinese, handwriting, and complex layouts. Recognition rates significantly drop for distorted or poor-quality photos taken on-site. It is suitable for clean formats, primarily printed text, as a baseline [1]. PaddleOCR, open-sourced by Baidu, can be deployed on-premise (supporting NVIDIA GPU, Intel CPU, and various other hardware backends) and supports over 100 languages. Its greatest value lies in its exceptional performance with Chinese and tables, making it superior to Tesseract for scenarios like receipts with mixed Traditional Chinese and tables. It has also integrated the entire pipeline from 'PDF or image to structured JSON or Markdown,' including layout analysis. If a fully on-premise solution for Chinese documents is needed, PaddleOCR is almost the preferred baseline [1]. Google Cloud Vision or Document AI offers high recognition rates, mature layout analysis, easy API integration, and handles handwriting and complex documents well, providing an excellent development experience. However, its major drawback is that it's a cloud service, meaning data must leave the company, which inherently conflicts with the need for 'sensitive documents to be processed on-premise' [1]
In the on-premise Vision LLM layer (third generation), the open-source community has rapidly caught up, with several models from 2025 to 2026 worth noting [1]. Qwen:
・2.5-VL (Alibaba) has parameter sizes from 7B to 72B, achieving 95.7 in DocVQA, strong in handwriting, tables, and multilingual document parsing, with the most mature ecosystem. It is a leading candidate for general documents and receipts [1]. PaddleOCR-VL (Baidu)'s latest version has approximately
・0.9B parameters, scoring over 96% in OmniDocBench v
・1.6, outperforming many cutting-edge large models in native OCR benchmarks, supporting 109 languages. It is suitable for purely on-premise scenarios aiming for OCR accuracy and lightweight deployment [1]. dots.ocr (rednote) has approximately
・1.7B parameters, integrating layout detection and content recognition into a single process, supporting over 100 languages, and has been officially integrated by vLLM. It is considered SOTA among small models [1]. MiniCPM-V
・2.6 has about 8B parameters, with a size of approximately
・5.5GB, easily fitting onto a single card or even edge devices, with excellent OCR performance. It is suitable for scenarios with limited resources requiring on-premise deployment on small machines [1]. olmOCR 2 (AllenAI) has approximately 7B parameters, trained with RLVR, and fully open-source (including data and code) [1]
This article's analysis suggests that this toolbox reveals a selection logic different from the model capability-centric view: the problem is not 'which model scores highest,' but 'which dimension is non-negotiable for your scenario.' If sensitive data cannot leave the company, on-premise capability is a hard constraint, and selection converges directly to PaddleOCR plus text LLM or on-premise Vision LLM. If handwriting and alterations are frequent, and data can be uploaded to the cloud, then recognition accuracy takes precedence, making cloud Vision LLM a reasonable choice [1]. The aforementioned research on fine-tuning VLMs also indirectly supports this judgment: datasets and models must align with the target scenario, and discussing model superiority in isolation from the scenario has limited meaning [2][4]
A more pragmatic conclusion is that the two are often used in combination: clear documents follow an inexpensive on-premise process, while difficult ones are sent to Vision LLM [1]. This hybrid approach is essentially a cost-splitting strategy, reserving expensive high-level inference resources for the few truly necessary difficult cases, rather than indiscriminately applying the heaviest model to every document

Architectural Philosophy: Minimize Recognition, Maximize System, Defer to Human for Uncertainty
The practical record condenses the lessons learned into an architectural philosophy: minimize recognition, maximize system, and defer to human for uncertainty [1]. This article believes that this statement can be broken down into three layers of system design principles, which theoretically echo the governance literature
The first layer is preprocessing standardization. A significant portion of proof-of-delivery receipt recognition failures does not occur in the model but in the input. Information is simply not fully captured in damp, distorted, or poorly photographed images, and even the most powerful model cannot conjure something from nothing [1]. Therefore, the system's first engineering task is to standardize the input as much as possible before recognition: deskewing, cropping, contrast enhancement, and filtering out images of unacceptable quality. This article argues that the design philosophy of this layer is to 'intercept uncertainty upfront.' Rather than allowing poor input to pollute the entire pipeline, it's better to route it out at the entry point. The issue of dataset layout diversity emphasized by the Japanese mobile receipt study is essentially a reminder that input variations must be systematically handled, rather than entirely offloaded to the model [2]
The second layer is LLM structured extraction. This layer corresponds to the spirit of 'minimize recognition': not demanding the model to complete all judgments at once, but rather allowing it to focus on converting layout content into structured fields. Whether using the second-generation text LLM or the third-generation Vision LLM, the core is to map unstructured images or text to a clear schema (order number, item name, quantity, delivery date, receipt status, etc.) [1]. This article argues that the benefits of schema-fying extraction tasks are:
・Two:
・First, the output can be directly consumed by downstream systems, reducing post-processing costs
・Second, the schema provides a verifiable anchor, allowing the system to determine whether a field has been reliably extracted. AI coding agents can particularly accelerate development at this layer, automating integration and templating logic, allowing engineers to focus on schema and validation rule design [5][3]
The third layer is the human review gateway. This is the key to the entire architecture and the institutionalized manifestation of 'defer to human for uncertainty.' The extraction of each field by the model should be accompanied by a confidence score or validation result. When the confidence level falls below a threshold, or logical inconsistencies arise between fields (e.g., quantity and amount do not match), the system should not automatically approve but should route the document for manual review [1]. This article argues that this layer's design transforms the model's structural uncertainty into a manageable human workflow, precisely embodying the 'wise management of AI' advocated by governance literature: the system does not pretend to be perfect but designs in advance the allocation of responsibility and fallback paths for uncertain situations [6]
Considering the three layers together, a typical workflow splitting scenario can be inferred. Suppose a printing factory receives 1000 proof-of-delivery receipts daily. Approximately 80% are clean, printed documents that can be processed quickly and cost-effectively by on-premise OCR plus text LLM; about 15% are moderately difficult documents with handwriting or alterations, routed to Vision LLM; the remaining approximately 0.5% are of excessively poor quality or contradictory, sent directly for manual review [1]. In this estimated scenario, the most expensive cloud Vision LLM only needs to process about 1.5% of the volume, and human effort is focused only on the most challenging few cases. This article argues that this layered splitting is not only an optimization of accuracy but also an optimization of cost structure, allowing the system's marginal cost to grow with the distribution of difficulty rather than linearly with total volume

Implications for Taiwan's Design and Printing Industry
The architectural philosophy outlined above has stratified, actionable implications for different roles within Taiwan's design and printing industry
For small and medium-sized printing factories, the most important revelation is not to treat proof-of-delivery receipt recognition as a procurement problem of 'buying a model and it's solved,' but as a process problem of 'building a workflow splitting system.' Specifically, it is recommended to use PaddleOCR combined with an on-premise text LLM as a baseline to automate clear, high-volume, regular documents. This part incurs almost no token cost and keeps data within the company, addressing the concerns of most printing factories regarding the sensitivity of customer orders [1]. Building on this, for difficult documents with intensive handwriting and alterations, selectively integrate cloud Vision LLM, and be sure to set confidence thresholds and human review gateways [1]. This article argues that with this gradual introduction timeline, manufacturers can get the baseline running within weeks to handle 80% of the volume, and then gradually increase the automation rate for difficult cases, rather than aiming for full automation from the start
For designers, the digitalization of receipts and work orders means that specification information (dimensions, paper type, special processing) can be more reliably transferred from paper to digital systems, reducing specification errors caused by manual transcription. This article argues that when the recognition system can stably extract structured fields, the alignment of specifications between the design and production ends will be more immediate, and communication costs for proofing and revisions can be expected to decrease. Furthermore, if designers understand the recognition system's preference for 'clean layouts,' they can adopt fixed fields and print-first layouts when designing work order templates, thereby inversely reducing the difficulty of backend recognition
For brands, the significance of receipt digitalization lies in supply chain visibility and accountability. When every receipt and shipment note is structured and recorded, brands can track the status of orders flowing through the printing supply chain and retrieve credible digital proofs in case of disputes. This article argues that this also echoes the core of AI adoption governance literature: the value of a system is not just in automation efficiency, but in how it reallocates responsibility and trust boundaries between humans and systems [6]. When implementing, brands should pay particular attention to whether the audit trail of the review gateway is complete, to ensure that automation does not come at the expense of accountability
A common point for all roles is the trade-off between information security and on-premise capabilities. Taiwan's printing industry handles a large volume of documents containing personal and commercial sensitive information (such as bill printing, member data, financial report printing), which often makes 'data not leaving the company' an uncompromising constraint. This article argues that this is precisely why the second-generation OCR plus text LLM approach is particularly important in the Taiwanese industrial context: it preserves data sovereignty through on-premise deployment with acceptable recognition capabilities, which current pure cloud Vision LLM solutions struggle to achieve [1]
Conclusion and Limitations
This article, centered on a real-world case study of a Taiwanese printing factory implementing OCR for proof-of-delivery receipts, responds to the three research questions posed in the introduction:
・First, proof-of-delivery receipt recognition has undergone three generations of evolution: OCR plus regex, OCR plus text LLM, and direct judgment by Vision LLM. These three generations are not mutually exclusive but coexist based on scenario and security requirements [1]
・Second, the latest model is not necessarily the most suitable; the determining factors for selection are the prioritized trade-offs among cost, on-premise capability, and recognition accuracy, rather than a single benchmark score [1][2]
・Third, implementation success depends on the synergy of a three-layered architecture ('preprocessing standardization, LLM structured extraction, human review gateway') and the workflow split principle of 'minimize recognition, maximize system, and defer to human for uncertainty' [1]. The core argument of this article is that proof-of-delivery receipt recognition should shift from a model-centric mindset to a system and governance-centric mindset [6]
This research has several limitations that must be honestly disclosed. Firstly, the core case study is a firsthand account by a single engineer, and while its context (Taiwanese printing factory receipts) is representative, benchmark data (e.g., DocVQA:
・95
・7, OmniDocBench over 96%) are cited from public model claims and have not been independently replicated in the target scenario of this article; caution should be exercised when extrapolating [1]. Secondly, the invoice OCR literature cited in this article focuses on Japanese mobile receipts, which differ from Traditional Chinese printing factory receipts in language and layout, and the portability of its conclusions requires further verification [2][4]
・Third, the aforementioned '1000 receipts workflow split' scenario is an estimation based on the principles in this article, and the proportions are illustrative; actual distributions vary by factory and have not been empirically measured
Future research directions include:
・Three:
・First, constructing a labeled dataset of Traditional Chinese printing industry receipts to replace extrapolation with localized benchmarks, which can be cross-referenced with the methodology of the Japanese receipt dataset study [2]
・Second, quantitatively evaluating the cost-effectiveness of the three-layered architecture in a real production environment, especially the optimal threshold settings for the human review gateway
・Third, concretizing the AI adoption governance framework into actionable audit and responsibility division guidelines for the printing industry, bridging the gap between technological implementation and organizational governance [6][5]
Key Takeaways
The three generations of recognition technology for proof-of-delivery receipts (OCR+Regex, OCR+Text LLM, Vision LLM) are not mutually exclusive but coexist, dependent on scenario and security requirements
Selection is determined by the prioritized trade-offs among cost, on-premise capability, and accuracy, not by a single benchmark score; the latest model is not necessarily the most suitable
Implementation success depends on the synergy of a three-layered architecture ('preprocessing standardization, structured extraction, human review gateway'), not on the strength of a single model
'Minimize recognition, maximize system, defer to human for uncertainty' is the core philosophy for transforming the model's structural uncertainty into manageable processes
For sensitive document scenarios in Taiwan, the on-premise OCR+Text LLM approach is particularly important as it preserves data sovereignty, which pure cloud Vision LLM solutions currently struggle to balance with
Further Thoughts
For printing manufacturing, the real leverage of receipt OCR lies not in the model but in system design: first using low-cost on-premise processes to handle 80% of regular documents, then utilizing cloud Vision LLM and human review for long-tail difficult cases, allows marginal costs to grow with difficulty rather than total volume. For the design end, this means work order templates should be designed with fixed fields and printed text prioritized, inversely reducing recognition difficulty. For AI adoption and SaaS providers, the opportunity lies in packaging the 'three-layered architecture plus workflow splitting engine plus audit trail' into a product directly usable by the printing industry, rather than just selling model APIs. Unresolved questions include: the lack of localized benchmarks for Traditional Chinese printing receipts, the lack of empirical evidence for optimal human review gateway settings, and how to balance automation and accountability at the governance level
References
[2] Nathan S. (2025). Japanese-Mobile-Receipt-OCR-1.3K: A Comprehensive Dataset Analysis and Fine-tuned Vision-Language Model for Structured Receipt Data Extraction. DOI: 10.36227/techrxiv.175616889.90325672/v1
[3] Rodriguez J. (2025). myownrobs: AI Coding Agent for 'RStudio'. CRAN: Contributed Packages. DOI: 10.32614/cran.package.myownrobs
[4] Nathan S. (2025). Japanese-Mobile-Receipt-OCR-1.3K: A Comprehensive Dataset Analysis and Fine-tuned Vision-Language Model for Structured Receipt Data Extraction. DOI: 10.21203/rs.3.rs-7357197/v1
[5] Wienholt N. (2025). Using an AI Coding Agent. GitHub Copilot and AI Coding Tools in Practice. DOI: 10.1007/979-8-8688-1784-7_2
[6] Waardenburg L., Huysman M., Agterberg M. (2021). Introduction to managing AI wisely. Managing AI Wisely. DOI: 10.4337/9781800887671.00010
FAQ
- Is the latest Vision LLM mandatory for proof-of-delivery receipt OCR?
- Not necessarily. While Vision LLMs can interpret handwriting and alterations, they are slow, costly, and powerful models are mostly cloud-based, making full on-premise deployment difficult. For sensitive documents that cannot leave the company, on-premise OCR plus text LLM is more suitable. A common approach is to combine the two, routing by difficulty
- Why can't proof-of-delivery receipt recognition achieve 100% accuracy?
- Because damp, distorted, or poorly photographed images might not have captured the information at all; no model can conjure something from nothing. The correct design is to absorb this uncertainty with confidence thresholds and human review gateways, rather than expecting the model to achieve perfection on its own
- What is the three-layered architecture for proof-of-delivery receipt OCR?
- It refers to preprocessing standardization (deskewing, enhancing, filtering poor images), LLM structured extraction (mapping content to a clear schema), and a human review gateway (routing low-confidence or logically contradictory documents to humans). The synergy of these three layers is key to implementation, not just a single model
- Where should small and medium-sized Taiwanese printing factories begin with proof-of-delivery receipt OCR implementation?
- It is recommended to start with PaddleOCR plus an on-premise text LLM as a baseline to automate clear, high-volume, regular documents. This incurs almost no token cost and keeps data within the company. Then, gradually integrate Vision LLM for difficult handwritten or altered documents, ensuring a human review is in place
- Why is on-premise deployment important for the printing industry?
- Because the printing industry handles a large volume of documents containing personal and commercial sensitive information, keeping data within the company is often an uncompromising constraint. This makes mature on-premise solutions like OCR plus text LLM particularly valuable in the Taiwanese industrial context, as pure cloud Vision LLM currently struggles to ensure data sovereignty
