Java: Download/Extract all images from website URL in java


        In this tutorial let’s see how to get all the images from website URL. This can be achieved using HTMLEditorkit and HTMLDocument classes in java. Let’s see this example step by step and that will give a clear idea of what is done here.
Step 1: Connect to URL and get the Input Stream

                First create a communication link between the application and URL using URLConnection class, where its instance can be used to read the html content of the URL. The connection object is created by invoking openconnection method on a URL.

 String webUrl = "http://www.hdwallpapers.in/";
URL url = new URL(webUrl);
URLConnection connection = url.openConnection();
          
The connection object created above gets the input stream from the connection using getInputStream. Then the program creates a BufferedReader on the input stream and reads it.

InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);

Step 2: Read HTML content from Input Stream by making use of HTMLEditorKit class

                Create an empty HTML document using HTMLDocument class and fetch the BufferReader instance and load the document with website HTML content using HTMLEditorKit.ParserCallback method.

HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
HTMLEditorKit.Parser parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
parser.parse(br, callback, true);

Step 3: Iterate the HTML document for IMG tags and download all images

                Iterate the HTML document and search for IMG tags and the SRC attribute from the tag.
for (HTMLDocument.Iterator iterator = htmlDoc.getIterator(HTML.Tag.IMG); iterator.isValid(); iterator.next()) {
         

   AttributeSet attributes = iterator.getAttributes();
            String imgSrc = (String) attributes.getAttribute(HTML.Attribute.SRC);

Then we need to check whether the SRC attribute ends with any of the image formats, if so then those images need to be saved using ImageIO.read(URL) and ImageIO.write(Image, imageFormat, FilePath) and continue the loop until last IMG tag is parsed.
I have written a condition to form the absolute image URL. At times in SRC attribute only relative path of the URL will be mentioned, in those cases we need to form the absolute path from the website URL to download the images using the below code snippet.

if (!(imgSrc.startsWith("http"))) 
{
    url = url + imgSrc;
} 
else 
{
    url = imgSrc;
 }

To save the images in the same name as in the website, just trim the name from SRC attribute and form the filepath. Make sure to change the imgpath to your local machine.

Download Files
ExtractAllImages.java

public class ExtractAllImages {
    public static void main(String args[]) throws Exception {

        String webUrl = "http://www.hdwallpapers.in/";
        URL url = new URL(webUrl);
        URLConnection connection = url.openConnection();
        InputStream is = connection.getInputStream();
        InputStreamReader isr = new InputStreamReader(is);
        BufferedReader br = new BufferedReader(isr);

        HTMLEditorKit htmlKit = new HTMLEditorKit();
        HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
        HTMLEditorKit.Parser parser = new ParserDelegator();
        HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
        parser.parse(br, callback, true);

        for (HTMLDocument.Iterator iterator = htmlDoc.getIterator(HTML.Tag.IMG); iterator.isValid(); iterator.next()) {
            AttributeSet attributes = iterator.getAttributes();
            String imgSrc = (String) attributes.getAttribute(HTML.Attribute.SRC);

            if (imgSrc != null && (imgSrc.endsWith(".jpg") || (imgSrc.endsWith(".png")) || (imgSrc.endsWith(".jpeg")) || (imgSrc.endsWith(".bmp")) || (imgSrc.endsWith(".ico")))) {
                try {
                    downloadImage(webUrl, imgSrc);
                } catch (IOException ex) {
                    System.out.println(ex.getMessage());
                }
            }
        }
    }
    private static void downloadImage(String url, String imgSrc) throws IOException {
        BufferedImage image = null;
        try {
            if (!(imgSrc.startsWith("http"))) {
                url = url + imgSrc;
            } else {
                url = imgSrc;
            }
            imgSrc = imgSrc.substring(imgSrc.lastIndexOf("/") + 1);
            String imageFormat = null;
            imageFormat = imgSrc.substring(imgSrc.lastIndexOf(".") + 1);
            String imgPath = null;
            imgPath = "C:/Users/Machine2/Desktop/CTE/Java-WebsiteRead/" + imgSrc + "";
            URL imageUrl = new URL(url);
            image = ImageIO.read(imageUrl);
            if (image != null) {
                File file = new File(imgPath);
                ImageIO.write(image, imageFormat, file);
            }
        } catch (Exception ex) {
            ex.printStackTrace();
        }

    }
}

Download java file here

 



Reactions:

4 comments :

  1. it keeps returning this error to me:

    Exception in thread "main" java.lang.NullPointerException
    at ExtractAllImages.main(ExtractAllImages.java:34)

    help :<

    ReplyDelete
    Replies
    1. Hi Felica,
      We couldn't replicate this error, can you give us some more details to debug? What URL are you trying to access? You can also write to us at - cte.opinion@gmail.com

      Delete
  2. This are for those with Tags of IMG. However, I could not find a way to extract background images. Would that be possible?

    ReplyDelete