The Basics - PyMuPDF 1.24.10 documentation
Excerpt
PyMuPDF is a high-performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Opening a File
To open a file, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"a.pdf"</span><span>)</span> <span># open a document</span>
Merging PDF files
To merge PDF files, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc_a</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"a.pdf"</span><span>)</span> <span># open the 1st document</span>
<span>doc_b</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"b.pdf"</span><span>)</span> <span># open the 2nd document</span>
<span>doc_a</span><span>.</span><span>insert_pdf</span><span>(</span><span>doc_b</span><span>)</span> <span># merge the docs</span>
<span>doc_a</span><span>.</span><span>save</span><span>(</span><span>"a+b.pdf"</span><span>)</span> <span># save the merged document with a new filename</span>
Merging PDF files with other types of file
With Document.insert_file()
you can invoke the method to merge supported files with PDF. For example:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc_a</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"a.pdf"</span><span>)</span> <span># open the 1st document</span>
<span>doc_b</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"b.svg"</span><span>)</span> <span># open the 2nd document</span>
<span>doc_a</span><span>.</span><span>insert_file</span><span>(</span><span>doc_b</span><span>)</span> <span># merge the docs</span>
<span>doc_a</span><span>.</span><span>save</span><span>(</span><span>"a+b.pdf"</span><span>)</span> <span># save the merged document with a new filename</span>
Note
Taking it further
It is easy to join PDFs with Document.insert_pdf()
& Document.insert_file()
. Given open PDF documents, you can copy page ranges from one to the other. You can select the point where the copied pages should be placed, you can revert the page sequence and also change page rotation. This Wiki article contains a full description.
The GUI script join.py uses this method to join a list of files while also joining the respective table of contents segments. It looks like this:
API reference
Working with Coordinates
There is one mathematical term that you should feel comfortable with when using PyMuPDF - âcoordinatesâ. Please have a quick look at the Coordinates section to understand the coordinate system to help you with positioning objects and understand your document space.
Adding a watermark to a PDF
To add a watermark to a PDF file, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"document.pdf"</span><span>)</span> <span># open a document</span>
<span>for</span> <span>page_index</span> <span>in</span> <span>range</span><span>(</span><span>len</span><span>(</span><span>doc</span><span>)):</span> <span># iterate over pdf pages</span>
<span>page</span> <span>=</span> <span>doc</span><span>[</span><span>page_index</span><span>]</span> <span># get the page</span>
<span># insert an image watermark from a file name to fit the page bounds</span>
<span>page</span><span>.</span><span>insert_image</span><span>(</span><span>page</span><span>.</span><span>bound</span><span>(),</span><span>filename</span><span>=</span><span>"watermark.png"</span><span>,</span> <span>overlay</span><span>=</span><span>False</span><span>)</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"watermarked-document.pdf"</span><span>)</span> <span># save the document with a new filename</span>
Note
Taking it further
Adding watermarks is essentially as simple as adding an image at the base of each PDF page. You should ensure that the image has the required opacity and aspect ratio to make it look the way you need it to.
In the example above a new image is created from each file reference, but to be more performant (by saving memory and file size) this image data should be referenced only once - see the code example and explanation on Page.insert_image()
for the implementation.
API reference
Adding an image to a PDF
To add an image to a PDF file, for example a logo, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"document.pdf"</span><span>)</span> <span># open a document</span>
<span>for</span> <span>page_index</span> <span>in</span> <span>range</span><span>(</span><span>len</span><span>(</span><span>doc</span><span>)):</span> <span># iterate over pdf pages</span>
<span>page</span> <span>=</span> <span>doc</span><span>[</span><span>page_index</span><span>]</span> <span># get the page</span>
<span># insert an image logo from a file name at the top left of the document</span>
<span>page</span><span>.</span><span>insert_image</span><span>(</span><span>pymupdf</span><span>.</span><span>Rect</span><span>(</span><span>0</span><span>,</span><span>0</span><span>,</span><span>50</span><span>,</span><span>50</span><span>),</span><span>filename</span><span>=</span><span>"my-logo.png"</span><span>)</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"logo-document.pdf"</span><span>)</span> <span># save the document with a new filename</span>
Note
Taking it further
As with the watermark example you should ensure to be more performant by only referencing the image once if possible - see the code example and explanation on Page.insert_image()
.
API reference
Rotating a PDF
To add a rotation to a page, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"test.pdf"</span><span>)</span> <span># open document</span>
<span>page</span> <span>=</span> <span>doc</span><span>[</span><span>0</span><span>]</span> <span># get the 1st page of the document</span>
<span>page</span><span>.</span><span>set_rotation</span><span>(</span><span>90</span><span>)</span> <span># rotate the page</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"rotated-page-1.pdf"</span><span>)</span>
Cropping a PDF
To crop a page to a defined Rect, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"test.pdf"</span><span>)</span> <span># open document</span>
<span>page</span> <span>=</span> <span>doc</span><span>[</span><span>0</span><span>]</span> <span># get the 1st page of the document</span>
<span>page</span><span>.</span><span>set_cropbox</span><span>(</span><span>pymupdf</span><span>.</span><span>Rect</span><span>(</span><span>100</span><span>,</span> <span>100</span><span>,</span> <span>400</span><span>,</span> <span>400</span><span>))</span> <span># set a cropbox for the page</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"cropped-page-1.pdf"</span><span>)</span>
Attaching Files
To attach another file to a page, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"test.pdf"</span><span>)</span> <span># open main document</span>
<span>attachment</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"my-attachment.pdf"</span><span>)</span> <span># open document you want to attach</span>
<span>page</span> <span>=</span> <span>doc</span><span>[</span><span>0</span><span>]</span> <span># get the 1st page of the document</span>
<span>point</span> <span>=</span> <span>pymupdf</span><span>.</span><span>Point</span><span>(</span><span>100</span><span>,</span> <span>100</span><span>)</span> <span># create the point where you want to add the attachment</span>
<span>attachment_data</span> <span>=</span> <span>attachment</span><span>.</span><span>tobytes</span><span>()</span> <span># get the document byte data as a buffer</span>
<span># add the file annotation with the point, data and the file name</span>
<span>file_annotation</span> <span>=</span> <span>page</span><span>.</span><span>add_file_annot</span><span>(</span><span>point</span><span>,</span> <span>attachment_data</span><span>,</span> <span>"attachment.pdf"</span><span>)</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"document-with-attachment.pdf"</span><span>)</span> <span># save the document</span>
Note
Taking it further
When adding the file with Page.add_file_annot()
note that the third parameter for the filename
should include the actual file extension. Without this the attachment possibly will not be able to be recognized as being something which can be opened. For example, if the filename
is just âattachmentâ when view the resulting PDF and attempting to open the attachment you may well get an error. However, with âattachment.pdfâ this can be recognized and opened by PDF viewers as a valid file type.
The default icon for the attachment is by default a âpush pinâ, however you can change this by setting the icon
parameter.
API reference
Embedding Files
To embed a file to a document, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"test.pdf"</span><span>)</span> <span># open main document</span>
<span>embedded_doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"my-embed.pdf"</span><span>)</span> <span># open document you want to embed</span>
<span>embedded_data</span> <span>=</span> <span>embedded_doc</span><span>.</span><span>tobytes</span><span>()</span> <span># get the document byte data as a buffer</span>
<span># embed with the file name and the data</span>
<span>doc</span><span>.</span><span>embfile_add</span><span>(</span><span>"my-embedded_file.pdf"</span><span>,</span> <span>embedded_data</span><span>)</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"document-with-embed.pdf"</span><span>)</span> <span># save the document</span>
Deleting Pages
To delete a page from a document, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"test.pdf"</span><span>)</span> <span># open a document</span>
<span>doc</span><span>.</span><span>delete_page</span><span>(</span><span>0</span><span>)</span> <span># delete the 1st page of the document</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"test-deleted-page-one.pdf"</span><span>)</span> <span># save the document</span>
To delete a multiple pages from a document, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"test.pdf"</span><span>)</span> <span># open a document</span>
<span>doc</span><span>.</span><span>delete_pages</span><span>(</span><span>from_page</span><span>=</span><span>9</span><span>,</span> <span>to_page</span><span>=</span><span>14</span><span>)</span> <span># delete a page range from the document</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"test-deleted-pages.pdf"</span><span>)</span> <span># save the document</span>
What happens if I delete a page referred to by bookmarks or hyperlinks?
-
A bookmark (entry in the Table of Contents) will become inactive and will no longer navigate to any page.
-
A hyperlink will be removed from the page that contains it. The visible content on that page will not otherwise be changed in any way.
Note
Taking it further
The page index is zero-based, so to delete page 10 of a document you would do the following doc.delete_page(9)
.
Similarly, doc.delete_pages(from_page=9, to_page=14)
will delete pages 10 - 15 inclusive.
API reference
Re-Arranging Pages
To change the sequence of pages, i.e. re-arrange pages, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"test.pdf"</span><span>)</span> <span># open a document</span>
<span>doc</span><span>.</span><span>move_page</span><span>(</span><span>1</span><span>,</span><span>0</span><span>)</span> <span># move the 2nd page of the document to the start of the document</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"test-page-moved.pdf"</span><span>)</span> <span># save the document</span>
Copying Pages
To copy pages, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"test.pdf"</span><span>)</span> <span># open a document</span>
<span>doc</span><span>.</span><span>copy_page</span><span>(</span><span>0</span><span>)</span> <span># copy the 1st page and puts it at the end of the document</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"test-page-copied.pdf"</span><span>)</span> <span># save the document</span>
Selecting Pages
To select pages, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"test.pdf"</span><span>)</span> <span># open a document</span>
<span>doc</span><span>.</span><span>select</span><span>([</span><span>0</span><span>,</span> <span>1</span><span>])</span> <span># select the 1st & 2nd page of the document</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"just-page-one-and-two.pdf"</span><span>)</span> <span># save the document</span>
Note
Taking it further
With PyMuPDF you have all options to copy, move, delete or re-arrange the pages of a PDF. Intuitive methods exist that allow you to do this on a page-by-page level, like the Document.copy_page()
method.
Or you alternatively prepare a complete new page layout in form of a Python sequence, that contains the page numbers you want, in the sequence you want, and as many times as you want each page. The following may illustrate what can be done with Document.select()
<span></span><span>doc</span><span>.</span><span>select</span><span>([</span><span>1</span><span>,</span> <span>1</span><span>,</span> <span>1</span><span>,</span> <span>5</span><span>,</span> <span>4</span><span>,</span> <span>9</span><span>,</span> <span>9</span><span>,</span> <span>9</span><span>,</span> <span>0</span><span>,</span> <span>2</span><span>,</span> <span>2</span><span>,</span> <span>2</span><span>])</span>
Now letâs prepare a PDF for double-sided printing (on a printer not directly supporting this):
The number of pages is given by len(doc)
(equal to doc.page_count
). The following lists represent the even and the odd page numbers, respectively:
<span></span><span>p_even</span> <span>=</span> <span>[</span><span>p</span> <span>in</span> <span>range</span><span>(</span><span>doc</span><span>.</span><span>page_count</span><span>)</span> <span>if</span> <span>p</span> <span>%</span> <span>2</span> <span>==</span> <span>0</span><span>]</span>
<span>p_odd</span> <span>=</span> <span>[</span><span>p</span> <span>in</span> <span>range</span><span>(</span><span>doc</span><span>.</span><span>page_count</span><span>)</span> <span>if</span> <span>p</span> <span>%</span> <span>2</span> <span>==</span> <span>1</span><span>]</span>
This snippet creates the respective sub documents which can then be used to print the document:
<span></span><span>doc</span><span>.</span><span>select</span><span>(</span><span>p_even</span><span>)</span> <span># only the even pages left over</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"even.pdf"</span><span>)</span> <span># save the "even" PDF</span>
<span>doc</span><span>.</span><span>close</span><span>()</span> <span># recycle the file</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>doc</span><span>.</span><span>name</span><span>)</span> <span># re-open</span>
<span>doc</span><span>.</span><span>select</span><span>(</span><span>p_odd</span><span>)</span> <span># and do the same with the odd pages</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"odd.pdf"</span><span>)</span>
For more information also have a look at this Wiki article.
The following example will reverse the order of all pages (extremely fast: sub-second time for the 756 pages of the Adobe PDF References):
<span></span><span>lastPage</span> <span>=</span> <span>doc</span><span>.</span><span>page_count</span> <span>-</span> <span>1</span>
<span>for</span> <span>i</span> <span>in</span> <span>range</span><span>(</span><span>lastPage</span><span>):</span>
<span>doc</span><span>.</span><span>move_page</span><span>(</span><span>lastPage</span><span>,</span> <span>i</span><span>)</span> <span># move current last page to the front</span>
This snippet duplicates the PDF with itself so that it will contain the pages 0, 1, âŠ, n, 0, 1, âŠ, n (extremely fast and without noticeably increasing the file size!):
<span></span><span>page_count</span> <span>=</span> <span>len</span><span>(</span><span>doc</span><span>)</span>
<span>for</span> <span>i</span> <span>in</span> <span>range</span><span>(</span><span>page_count</span><span>):</span>
<span>doc</span><span>.</span><span>copy_page</span><span>(</span><span>i</span><span>)</span> <span># copy this page to after last page</span>
API reference
Adding Blank Pages
To add a blank page, do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>...</span><span>)</span> <span># some new or existing PDF document</span>
<span>page</span> <span>=</span> <span>doc</span><span>.</span><span>new_page</span><span>(</span><span>-</span><span>1</span><span>,</span> <span># insertion point: end of document</span>
<span>width</span> <span>=</span> <span>595</span><span>,</span> <span># page dimension: A4 portrait</span>
<span>height</span> <span>=</span> <span>842</span><span>)</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"doc-with-new-blank-page.pdf"</span><span>)</span> <span># save the document</span>
Note
Taking it further
Use this to create the page with another pre-defined paper format:
<span></span><span>w</span><span>,</span> <span>h</span> <span>=</span> <span>pymupdf</span><span>.</span><span>paper_size</span><span>(</span><span>"letter-l"</span><span>)</span> <span># 'Letter' landscape</span>
<span>page</span> <span>=</span> <span>doc</span><span>.</span><span>new_page</span><span>(</span><span>width</span> <span>=</span> <span>w</span><span>,</span> <span>height</span> <span>=</span> <span>h</span><span>)</span>
The convenience function paper_size()
knows over 40 industry standard paper formats to choose from. To see them, inspect dictionary paperSizes
. Pass the desired dictionary key to paper_size()
to retrieve the paper dimensions. Upper and lower case is supported. If you append â-Lâ to the format name, the landscape version is returned.
Here is a 3-liner that creates a PDF: with one empty page. Its file size is 460 bytes:
<span></span><span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>()</span>
<span>doc</span><span>.</span><span>new_page</span><span>()</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"A4.pdf"</span><span>)</span>
API reference
-
paperSizes
Inserting Pages with Text Content
Using the Document.insert_page()
method also inserts a new page and accepts the same width
and height
parameters. But it lets you also insert arbitrary text into the new page and returns the number of inserted lines.
<span></span><span>import</span> <span>pymupdf</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>...</span><span>)</span> <span># some new or existing PDF document</span>
<span>n</span> <span>=</span> <span>doc</span><span>.</span><span>insert_page</span><span>(</span><span>-</span><span>1</span><span>,</span> <span># default insertion point</span>
<span>text</span> <span>=</span> <span>"The quick brown fox jumped over the lazy dog"</span><span>,</span>
<span>fontsize</span> <span>=</span> <span>11</span><span>,</span>
<span>width</span> <span>=</span> <span>595</span><span>,</span>
<span>height</span> <span>=</span> <span>842</span><span>,</span>
<span>fontname</span> <span>=</span> <span>"Helvetica"</span><span>,</span> <span># default font</span>
<span>fontfile</span> <span>=</span> <span>None</span><span>,</span> <span># any font file name</span>
<span>color</span> <span>=</span> <span>(</span><span>0</span><span>,</span> <span>0</span><span>,</span> <span>0</span><span>))</span> <span># text color (RGB)</span>
Note
Taking it further
The text parameter can be a (sequence of) string (assuming UTF-8 encoding). Insertion will start at Point (50, 72), which is one inch below top of page and 50 points from the left. The number of inserted text lines is returned.
API reference
Splitting Single Pages
This deals with splitting up pages of a PDF in arbitrary pieces. For example, you may have a PDF with Letter format pages which you want to print with a magnification factor of four: each page is split up in 4 pieces which each going to a separate PDF page in Letter format again.
<span></span><span>import</span> <span>pymupdf</span>
<span>src</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"test.pdf"</span><span>)</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>()</span> <span># empty output PDF</span>
<span>for</span> <span>spage</span> <span>in</span> <span>src</span><span>:</span> <span># for each page in input</span>
<span>r</span> <span>=</span> <span>spage</span><span>.</span><span>rect</span> <span># input page rectangle</span>
<span>d</span> <span>=</span> <span>pymupdf</span><span>.</span><span>Rect</span><span>(</span><span>spage</span><span>.</span><span>cropbox_position</span><span>,</span> <span># CropBox displacement if not</span>
<span>spage</span><span>.</span><span>cropbox_position</span><span>)</span> <span># starting at (0, 0)</span>
<span>#--------------------------------------------------------------------------</span>
<span># example: cut input page into 2 x 2 parts</span>
<span>#--------------------------------------------------------------------------</span>
<span>r1</span> <span>=</span> <span>r</span> <span>/</span> <span>2</span> <span># top left rect</span>
<span>r2</span> <span>=</span> <span>r1</span> <span>+</span> <span>(</span><span>r1</span><span>.</span><span>width</span><span>,</span> <span>0</span><span>,</span> <span>r1</span><span>.</span><span>width</span><span>,</span> <span>0</span><span>)</span> <span># top right rect</span>
<span>r3</span> <span>=</span> <span>r1</span> <span>+</span> <span>(</span><span>0</span><span>,</span> <span>r1</span><span>.</span><span>height</span><span>,</span> <span>0</span><span>,</span> <span>r1</span><span>.</span><span>height</span><span>)</span> <span># bottom left rect</span>
<span>r4</span> <span>=</span> <span>pymupdf</span><span>.</span><span>Rect</span><span>(</span><span>r1</span><span>.</span><span>br</span><span>,</span> <span>r</span><span>.</span><span>br</span><span>)</span> <span># bottom right rect</span>
<span>rect_list</span> <span>=</span> <span>[</span><span>r1</span><span>,</span> <span>r2</span><span>,</span> <span>r3</span><span>,</span> <span>r4</span><span>]</span> <span># put them in a list</span>
<span>for</span> <span>rx</span> <span>in</span> <span>rect_list</span><span>:</span> <span># run thru rect list</span>
<span>rx</span> <span>+=</span> <span>d</span> <span># add the CropBox displacement</span>
<span>page</span> <span>=</span> <span>doc</span><span>.</span><span>new_page</span><span>(</span><span>-</span><span>1</span><span>,</span> <span># new output page with rx dimensions</span>
<span>width</span> <span>=</span> <span>rx</span><span>.</span><span>width</span><span>,</span>
<span>height</span> <span>=</span> <span>rx</span><span>.</span><span>height</span><span>)</span>
<span>page</span><span>.</span><span>show_pdf_page</span><span>(</span>
<span>page</span><span>.</span><span>rect</span><span>,</span> <span># fill all new page with the image</span>
<span>src</span><span>,</span> <span># input document</span>
<span>spage</span><span>.</span><span>number</span><span>,</span> <span># input page number</span>
<span>clip</span> <span>=</span> <span>rx</span><span>,</span> <span># which part to use of input page</span>
<span>)</span>
<span># that's it, save output file</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"poster-"</span> <span>+</span> <span>src</span><span>.</span><span>name</span><span>,</span>
<span>garbage</span><span>=</span><span>3</span><span>,</span> <span># eliminate duplicate objects</span>
<span>deflate</span><span>=</span><span>True</span><span>,</span> <span># compress stuff where possible</span>
<span>)</span>
Example:
Combining Single Pages
This deals with joining PDF pages to form a new PDF with pages each combining two or four original ones (also called â2-upâ, â4-upâ, etc.). This could be used to create booklets or thumbnail-like overviews.
<span></span><span>import</span> <span>pymupdf</span>
<span>src</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>"test.pdf"</span><span>)</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>()</span> <span># empty output PDF</span>
<span>width</span><span>,</span> <span>height</span> <span>=</span> <span>pymupdf</span><span>.</span><span>paper_size</span><span>(</span><span>"a4"</span><span>)</span> <span># A4 portrait output page format</span>
<span>r</span> <span>=</span> <span>pymupdf</span><span>.</span><span>Rect</span><span>(</span><span>0</span><span>,</span> <span>0</span><span>,</span> <span>width</span><span>,</span> <span>height</span><span>)</span>
<span># define the 4 rectangles per page</span>
<span>r1</span> <span>=</span> <span>r</span> <span>/</span> <span>2</span> <span># top left rect</span>
<span>r2</span> <span>=</span> <span>r1</span> <span>+</span> <span>(</span><span>r1</span><span>.</span><span>width</span><span>,</span> <span>0</span><span>,</span> <span>r1</span><span>.</span><span>width</span><span>,</span> <span>0</span><span>)</span> <span># top right</span>
<span>r3</span> <span>=</span> <span>r1</span> <span>+</span> <span>(</span><span>0</span><span>,</span> <span>r1</span><span>.</span><span>height</span><span>,</span> <span>0</span><span>,</span> <span>r1</span><span>.</span><span>height</span><span>)</span> <span># bottom left</span>
<span>r4</span> <span>=</span> <span>pymupdf</span><span>.</span><span>Rect</span><span>(</span><span>r1</span><span>.</span><span>br</span><span>,</span> <span>r</span><span>.</span><span>br</span><span>)</span> <span># bottom right</span>
<span># put them in a list</span>
<span>r_tab</span> <span>=</span> <span>[</span><span>r1</span><span>,</span> <span>r2</span><span>,</span> <span>r3</span><span>,</span> <span>r4</span><span>]</span>
<span># now copy input pages to output</span>
<span>for</span> <span>spage</span> <span>in</span> <span>src</span><span>:</span>
<span>if</span> <span>spage</span><span>.</span><span>number</span> <span>%</span> <span>4</span> <span>==</span> <span>0</span><span>:</span> <span># create new output page</span>
<span>page</span> <span>=</span> <span>doc</span><span>.</span><span>new_page</span><span>(</span><span>-</span><span>1</span><span>,</span>
<span>width</span> <span>=</span> <span>width</span><span>,</span>
<span>height</span> <span>=</span> <span>height</span><span>)</span>
<span># insert input page into the correct rectangle</span>
<span>page</span><span>.</span><span>show_pdf_page</span><span>(</span><span>r_tab</span><span>[</span><span>spage</span><span>.</span><span>number</span> <span>%</span> <span>4</span><span>],</span> <span># select output rect</span>
<span>src</span><span>,</span> <span># input document</span>
<span>spage</span><span>.</span><span>number</span><span>)</span> <span># input page number</span>
<span># by all means, save new file using garbage collection and compression</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>"4up.pdf"</span><span>,</span> <span>garbage</span><span>=</span><span>3</span><span>,</span> <span>deflate</span><span>=</span><span>True</span><span>)</span>
Example:
PDF Encryption & Decryption
Starting with version 1.16.0, PDF decryption and encryption (using passwords) are fully supported. You can do the following:
-
Check whether a document is password protected / (still) encrypted (
Document.needs_pass
,Document.is_encrypted
). -
Gain access authorization to a document (
Document.authenticate()
). -
Set encryption details for PDF files using
Document.save()
orDocument.write()
and-
decrypt or encrypt the content
-
set password(s)
-
set the encryption method
-
set permission details
-
Note
A PDF document may have two different passwords:
-
The owner password provides full access rights, including changing passwords, encryption method, or permission detail.
-
The user password provides access to document content according to the established permission details. If present, opening the PDF in a viewer will require providing it.
Method Document.authenticate()
will automatically establish access rights according to the password used.
The following snippet creates a new PDF and encrypts it with separate user and owner passwords. Permissions are granted to print, copy and annotate, but no changes are allowed to someone authenticating with the user password.
<span></span><span>import</span> <span>pymupdf</span>
<span>text</span> <span>=</span> <span>"some secret information"</span> <span># keep this data secret</span>
<span>perm</span> <span>=</span> <span>int</span><span>(</span>
<span>pymupdf</span><span>.</span><span>PDF_PERM_ACCESSIBILITY</span> <span># always use this</span>
<span>|</span> <span>pymupdf</span><span>.</span><span>PDF_PERM_PRINT</span> <span># permit printing</span>
<span>|</span> <span>pymupdf</span><span>.</span><span>PDF_PERM_COPY</span> <span># permit copying</span>
<span>|</span> <span>pymupdf</span><span>.</span><span>PDF_PERM_ANNOTATE</span> <span># permit annotations</span>
<span>)</span>
<span>owner_pass</span> <span>=</span> <span>"owner"</span> <span># owner password</span>
<span>user_pass</span> <span>=</span> <span>"user"</span> <span># user password</span>
<span>encrypt_meth</span> <span>=</span> <span>pymupdf</span><span>.</span><span>PDF_ENCRYPT_AES_256</span> <span># strongest algorithm</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>()</span> <span># empty pdf</span>
<span>page</span> <span>=</span> <span>doc</span><span>.</span><span>new_page</span><span>()</span> <span># empty page</span>
<span>page</span><span>.</span><span>insert_text</span><span>((</span><span>50</span><span>,</span> <span>72</span><span>),</span> <span>text</span><span>)</span> <span># insert the data</span>
<span>doc</span><span>.</span><span>save</span><span>(</span>
<span>"secret.pdf"</span><span>,</span>
<span>encryption</span><span>=</span><span>encrypt_meth</span><span>,</span> <span># set the encryption method</span>
<span>owner_pw</span><span>=</span><span>owner_pass</span><span>,</span> <span># set the owner password</span>
<span>user_pw</span><span>=</span><span>user_pass</span><span>,</span> <span># set the user password</span>
<span>permissions</span><span>=</span><span>perm</span><span>,</span> <span># set permissions</span>
<span>)</span>
Note
Taking it further
Opening this document with some viewer (Nitro Reader 5) reflects these settings:
Decrypting will automatically happen on save as before when no encryption parameters are provided.
To keep the encryption method of a PDF save it using encryption=pymupdf.PDF_ENCRYPT_KEEP
. If doc.can_save_incrementally() == True
, an incremental save is also possible.
To change the encryption method specify the full range of options above (encryption
, owner_pw
, user_pw
, permissions
). An incremental save is not possible in this case.
API reference
Getting Page Links
Links can be extracted from a Page to return Link objects.
<span></span><span>import</span> <span>pymupdf</span>
<span>for</span> <span>page</span> <span>in</span> <span>doc</span><span>:</span> <span># iterate the document pages</span>
<span>link</span> <span>=</span> <span>page</span><span>.</span><span>first_link</span> <span># a `Link` object or `None`</span>
<span>while</span> <span>link</span><span>:</span> <span># iterate over the links on page</span>
<span># do something with the link, then:</span>
<span>link</span> <span>=</span> <span>link</span><span>.</span><span>next</span> <span># get next link, last one has `None` in its `next`</span>
Getting All Annotations from a Document
Annotations (Annot) on pages can be retrieved with the page.annots()
method.
<span></span><span>import</span> <span>pymupdf</span>
<span>for</span> <span>page</span> <span>in</span> <span>doc</span><span>:</span>
<span>for</span> <span>annot</span> <span>in</span> <span>page</span><span>.</span><span>annots</span><span>():</span>
<span>print</span><span>(</span><span>f</span><span>'Annotation on page: </span><span>{</span><span>page</span><span>.</span><span>number</span><span>}</span><span> with type: </span><span>{</span><span>annot</span><span>.</span><span>type</span><span>}</span><span> and rect: </span><span>{</span><span>annot</span><span>.</span><span>rect</span><span>}</span><span>'</span><span>)</span>
Redacting content from a PDF
Redactions are special types of annotations which can be marked onto a document page to denote an area on the page which should be securely removed. After marking an area with a rectangle then this area will be marked for redaction, once the redaction is applied then the content is securely removed.
For example if we wanted to redact all instances of the name âJane Doeâ from a document we could do the following:
<span></span><span>import</span> <span>pymupdf</span>
<span># Open the PDF document</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>'test.pdf'</span><span>)</span>
<span># Iterate over each page of the document</span>
<span>for</span> <span>page</span> <span>in</span> <span>doc</span><span>:</span>
<span># Find all instances of "Jane Doe" on the current page</span>
<span>instances</span> <span>=</span> <span>page</span><span>.</span><span>search_for</span><span>(</span><span>"Jane Doe"</span><span>)</span>
<span># Redact each instance of "Jane Doe" on the current page</span>
<span>for</span> <span>inst</span> <span>in</span> <span>instances</span><span>:</span>
<span>page</span><span>.</span><span>add_redact_annot</span><span>(</span><span>inst</span><span>)</span>
<span># Apply the redactions to the current page</span>
<span>page</span><span>.</span><span>apply_redactions</span><span>()</span>
<span># Save the modified document</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>'redacted_document.pdf'</span><span>)</span>
<span># Close the document</span>
<span>doc</span><span>.</span><span>close</span><span>()</span>
Another example could be redacting an area of a page, but not to redact any line art (i.e. vector graphics) within the defined area, by setting a parameter flag as follows:
<span></span><span>import</span> <span>pymupdf</span>
<span># Open the PDF document</span>
<span>doc</span> <span>=</span> <span>pymupdf</span><span>.</span><span>open</span><span>(</span><span>'test.pdf'</span><span>)</span>
<span># Get the first page</span>
<span>page</span> <span>=</span> <span>doc</span><span>[</span><span>0</span><span>]</span>
<span># Add an area to redact</span>
<span>rect</span> <span>=</span> <span>[</span><span>0</span><span>,</span><span>0</span><span>,</span><span>200</span><span>,</span><span>200</span><span>]</span>
<span># Add a redacction annotation which will have a red fill color</span>
<span>page</span><span>.</span><span>add_redact_annot</span><span>(</span><span>rect</span><span>,</span> <span>fill</span><span>=</span><span>(</span><span>1</span><span>,</span><span>0</span><span>,</span><span>0</span><span>))</span>
<span># Apply the redactions to the current page, but ignore vector graphics</span>
<span>page</span><span>.</span><span>apply_redactions</span><span>(</span><span>graphics</span><span>=</span><span>0</span><span>)</span>
<span># Save the modified document</span>
<span>doc</span><span>.</span><span>save</span><span>(</span><span>'redactied_document.pdf'</span><span>)</span>
<span># Close the document</span>
<span>doc</span><span>.</span><span>close</span><span>()</span>
Warning
Once a redacted version of a document is saved then the redacted content in the PDF is irretrievable. Thus, a redacted area in a document removes text and graphics completely from that area.
Note
Taking it further
The are a few options for creating and applying redactions to a page, for the full API details to understand the parameters to control these options refer to the API reference.
API reference
Converting PDF Documents
We recommend the pdf2docx library which uses PyMuPDF and the python-docx library to provide simple document conversion from PDF to DOCX format.