Open PDF and Get the Text

Text to Speech: For C# and VB NET Students

 

In the previous lesson, you opened a text file and had the Speech Synthesizer read it out. In this lesson, you'll open up a PDF file, get the text, and have it read out for you. We need the NuGet Package Manager for this lesson.

Manipulating a PDF file with code is quite advanced stuff. That's why there are lots of third-party plugins you can use. Instead of writing the code ourselves, we'll use a plugin called GemBox. It's free to use, as long as your PDF is no more than 2 pages. Longer than that, and you need a license. 2 pages is plenty for our purpose, though. You can get the plugin using the NuGet Package Manger in Visual Studio. Let's see how it works.

Click on the Tools menu at the top of Visual Studio. From the Tools menu, click on NuGet Package Manager. From the Sub menu, select Manage NuGet Packages for Solution:

You'll see a new tab appear in your main window. This one:

Click on the Browse at the top. In the Search box, type GemBox:

Select the GemBox PDF item. You can scroll down on the area on the right to see the license details. Make sure you check the box on the right for Project. The Install button will then become available. Click Install, agree to the license details, and you're good to go.

Now have a look at the Solution Explorer in Visual Studio. Expand the References section and you should see one for GemBox.pdf:

We can now write the code to open up a PDF and grab the text.

Create a new private void method in C#. Call it GetPdfFile. Pass in a file path and your method will look like this:

private void GetPdfFile(string filePath)
{
}

In VB Net, create a Private Sub. Again, pass in a file path. Here's what you Sub should look like:

Private Sub GetPdfFile(filePath As String)
End Sub

Before you can use GemBox, you need to add a using or Import statement at the top of your code. Add this in C#:

using GemBox.Pdf;

And this in VB Net:

Imports GemBox.Pdf

Back to your GetPdfFile Sub/method. Clear the text box as the first line with this (delete the semicolon on the end in VB):

txtSpeechText.Text = "";

So that you can use GemBox, you need to set a license. Add this line (again, delete the semicolon in VB):

ComponentInfo.SetLicense("FREE-LIMITED-KEY");

If you actually had a license, you would enter the key between the round brackets of SetLicense. For the free version, you can just enter "FREE-LIMITED-KEY".

There is a Class in GemBox called PdfDocument. You use this to Load a pdf. Add this line in C#:

PdfDocument doc = PdfDocument.Load(filePath);

And this one in VB Net:

Dim doc As PdfDocument = PdfDocument.Load(filePath)

The line sets up a variable call doc. The doc variable has a Pages collection we can loop through. We can get the contents of each page.

In the free version, though, you need to check if the 2-page limit has been reached. Otherwise, your program will crash

Add this code in C#:

int numOfPages = doc.Pages.Count;

if (numOfPages <= 2)
{
}
else
{

MessageBox.Show("Too many pages");

}

And this in VB Net:

Dim numOfPages As Integer = doc.Pages.Count

If numOfPages <= 2 Then

Else

MessageBox.Show("Too many pages")

End If

We're just setting up an integer variable called numOfPages. The pages count from the PDF is stored inside of this variable.

Inside of your if statement, you can add a for each loop. Here's the loop to add in C#:

foreach (var page in doc.Pages)
{

txtSpeechText.Text += page.Content.ToString();

}

And here it is in VB Net:

For Each page In doc.Pages

txtSpeechText.Text += page.Content.ToString()

Next

So we're just looping round a page and placing its contents into the text box.

Finally, we can close the doc. Add this line just outside of the for each loop (delete the semicolon in VB Net):

doc.Close();

And that's it for the GetPdfFile Sub/method. Go back to the if statement in your Open File button. Delete the MessageBox.Show("PDF") line and replace it with a call to your new Sub/method (again, no semicolon on the end in VB):

GetPdfFile(fileName);

Your Open File button and GetPdfFile method should look like this in C#:

C# code to open up a PDF file and have the text read out by the Speech Synthesizer

And here's the VB Net code:

VB Net code to open up a PDF file and have the text read out by the Speech Synthesizer

Give it a try. Open a PDF on your computer. (It's probably easier if you create a Word document of 1 or 2 pages then save it as a PDF.) If it has more than 2 pages, you'll get the message box. If it's under 2 pages the text from the PDF will appear in the text box, ready for a voice to read it out.

Let's open a Word file, now. If you haven't got Microsoft Word on your computer, you can skip this lesson.

Get Word Files >>

Back to the C# NET Contents Page

Back to the VB NET Contents Page