CiteSeerX DataSet


Steps for downloading the full dataset from CiteSeerX:

Gregor has kindly provided CiteSeerX fetcher that works much better than the old approach.  Thanks :)


P.S.  I am in the process of setting up a new web server at the lab; so I will try make a whole dump available one of these days.  If you need it sooner email me.



Great thanks to this blog post

I have added slightly more details, and made some minor changes (that seemed to work better in my case)

  1. Download and extract the "Demo" from
  2. download xerces and place xerces.jar it in the same directory as Demo
  3. Go to the Demo directory, type the following command (all in one line) to download the full dataset of CiteSeerX to the file "citeseerx_alldata.xml"
    java -classpath .:oaiharvester.jar:xerces.jar org.acme.oai.OAIReaderRawDump -o citeseerx_alldata.xml

    Note most likely you will not see anything right away; so you may check the file citeseerx_alldata.xml to make sure that things are being added to it.  The size of the data is about 520 MB, so it may take a while.

For more information you may also want to look at the original website for the oai harverster here

