Oops, Design Flaw!

I ran into a couple of problems. Remember the code we used to loop through the PDFs in a folder? Here it is.

DirectoryInfo ds = new DirectoryInfo(toolFolderName.Text);
foreach (FileInfo fi in ds.EnumerateFiles("*.pdf"))
{
    //process
}

The first problem is that EnumerateFiles does not guarantee the order of files returned. It seems that the order was by file name, but I found exceptions, and I can’t tell if the exceptions are consistent or varied.

Then I discovered on my corporate environment at work that the code returned some hidden, non-existent files that popped up in every system directory, like Downloads, Desktop, Picture, Documents, etc. This only happened in system folders, but then again, people don’t think of Desktop and Documents as system folders. They used them as ordinary folders.

Finally, it occurred to me that the code might possibly catch files that ended in “.pdf.something-else.” The search pattern is not guaranteed to search to the end, I don’t think. So, different code has to be used. Here’s what I came up with.

//using System.Linq;

var dir = new DirectoryInfo(toolFolderName.Text);
IOrderedEnumerable<FileInfo> files = dir.GetFiles().
    Where(s => s.Extension.ToLower() == ".pdf" &&
        (s.Attributes & FileAttributes.Hidden) != 
            FileAttributes.Hidden).
    OrderBy(s => s.Name);
foreach (FileInfo file in files)
{
    //process
}

We’re going to venture into LINQ. This lets you use SQL-like queries on non-database data. We start with an instance of DirectoryInfo, but instead of looping through EnumerateFiles we create an array of FileInfo objects with GetFiles. And that’s where LINQ comes into play. As in SQL, first we have a where clause, and then we have an order-by clause.

Let’s look at the where clause. What is that “s =>” thing? It’s something called a Lambda expression. I can’t claim to know all that much about them myself, but s basically represents each FileInfo object. You could call it s or anything else. In fact, I should find a more descriptive name for it. But the point is that it can be any label.

Then we see that the file extension must be “.pdf.” Note how we convert to lower case to catch files that end in “.PDF.” We also want only files that are not hidden. To accomplish this, we bit-and the attributes to see if the hidden attribute is not set. Note the single ampersand for bit operations rather than the usual double ampersand. Finally, we order by file name, and only then do we start to process each FileInfo object in the array.

You might wonder how I knew about IOrderedEnumerable. I didn’t. I first started the code with just var, like this.

//using System.Linq;

var dir = new DirectoryInfo(toolFolderName.Text);
var files = dir.GetFiles().
    Where(s => s.Extension.ToLower() == ".pdf" &&
        (s.Attributes & FileAttributes.Hidden) != 
            FileAttributes.Hidden).
    OrderBy(s => s.Name);
foreach (FileInfo file in files)
{
    //process
}

I hovered the mouse over files to discover its actual type, and then I edited the code to reflect that.

I also made some changed to the code that calculates the number of pages. It now looks like this.

private void toolCalcPages_Click(object sender, EventArgs e)
{
    int counter = 0;
    foreach (DataRow dr in dataSet1.Files.Rows)
    {
        var pdf = new PdfDocument(new PdfReader(dr.Field<string>(dataSet1.Files.FullNameColumn)));
        dr.SetField(dataSet1.Files.PageCountColumn, pdf.GetNumberOfPages());
        if (counter++ % 10 == 0) Application.DoEvents();
    }
}

What’s different is that we’re adding a counter, and every ten files processed, we call Application.DoEvents() in order to make the interface responsive. This way, the window won’t lock up when processing large numbers of files. However, DoEvents() has drawbacks that will probably lead us to replace it with something better in the future.

Click here to download.

Leave a Reply

Your email address will not be published. Required fields are marked *