atashbahar.com

thoughts, ideas and ...

Can we build a document-sharing website like Scribd with .Net?

This is a question I had in my mind for a while. I was very interested to know how document-sharing websites work in general and if it is possible to build one using .Net and other free and open source technologies.

After spending one Internet, readay searching through dig bnlogs and forums I came to the conclusion that it is possible! At least in theory it is possible.

What websites like Scribd do is that they allow their users to upload documents in various formats, then this documents will be shown in a control that is implemented using Adobe Flash. So they convert documents from various formats to Adobe Flash but how they do it and what goes on behind the scene?

I will try to explain different steps of the process and show you with which tool you can do the same with .Net. Just before we start this is not something official about how Scribd or any other websites of this type work, this is only how I think they work and how it is possible to do the same.

Step 1: Convert documents from various formats to PDF

OpenOffice supports various formats (including Microsoft Office formats) and it can export them to PDF format which is what we need.

So we have the tool but how to call it from our code. Luckily there is a great article in CodeProject which solves this problem in an elegant way. This article provides solution alongside the necessary code to easily call OpenOffice to convert documents to PDF format.

Step 2: Index document contents to be able to search through them

Another aspect of document-sharing sites is search. They allow their users to search for documents using keywords. They store hundreds of thousands of documents so having a proficient searching mechanism is very important.

To index documents we need to extract text from the documents. In first step we convert all documents to PDF, so we need to find a way to extract text from PDF files. One way to do this is to use the IFilter interface which was designed by Microsoft for use in its Indexing Service. You can find a great article in CodeProject named Using IFilter in C# which describes how you can use them to extract text from various document formats (which PDF is one of them).

After extracting text we need a tool for index the extracted text and also be able to search through this index in a proficient way. I have been using Lucene.Net which is a port of famous Lucene library for a while and I can say it is amazingly fast and also easy to use. Lucene.Net is exactly what we want here. It has all the necessary features to index the extracted text from documents and search through them very fast.

Step 3: Create thumbnails for the documents

This may look unnecessary but all of document-sharing websites have it and I think it's cool. There are commercial libraries for .Net that allow you to create an image from pages of a PDF document. But we are looking for a free solution here so using commercial components is not an option.

Ghostscript has the ability to raster PDF files. What we need is a .Net wrapper that can call it. There are many implementations that can be easily found by doing a simple search using your favorite search engine. GhostscriptSharp is one of them.

Step4: Convert the PDF files to Adobe Flash

SWFTools is a collection of utilities which one of them is PDF2SWF. PDF2SWF is a command line utility which simply converts your PDF to SWF. I couldn't find a .Net wrapper for this utility but calling a command line program from .Net is very easy.

After the conversion is done we get a SWF file of our PDF that simply is a collection of pages with no navigation. PDF2SWF provides the option to combine a simple navigation UI with the generated SWF but I think it would be better to use a more advanced viewer like SWF Document Viewer and customize it based on your needs.

Conclusion

I think the answer to this article title is yes and we can build such a website. I have not tried to do so but it seems we have all the tools necessary to do so.

I would be glad to hear your thoughts about this and if anyone has tried this approach. I would also like to hear about alternate ways of doing each step in the process.