6 Outstanding Papers Presented at NeurIPS 2023 

During the ongoing Neural Information Processing Systems (NeurIPS) annual conference, reviewers and chairpersons are currently evaluating tens of thousands of submissions.

Out of the 13,321 papers submitted by authors and researchers worldwide, the top of the lot have won the outstanding awards this year. Here are the 6 outstanding papers announced by NeurIPS in 2023:

Outstanding Main Track Papers

Privacy Auditing with One (1) Training Run

Steinke, Nasr, and Jagielski propose an efficient auditing scheme for assessing the privacy of differentially private machine learning (ML) systems in a single training run. They leverage the parallelism of adding or removing multiple training examples independently. They avoid the computational cost of group privacy by analysing the connection between differential privacy and statistical generalisation.

Their approach works in both black-box and white-box settings, requiring minimal assumptions about the algorithm. They demonstrate the effectiveness of their framework on DP-SGD, achieving meaningful privacy bounds with just one model, while standard methods would need hundreds of models.

Are Emergent Abilities of Large Language Models a Mirage?

Schaeffer, Miranda, and Koyejo challenge the idea that large language models (LLMs) exhibit true emergent abilities. They propose that perceived emergent abilities are often a result of the researcher’s metric choices rather than fundamental changes in model behaviour with scale. They support this with a mathematical model and three analyses:

  1. Confirming predictions on metric effects using InstructGPT/GPT-3
  2. Validating predictions in a meta-analysis on BIG-Bench
  3. Demonstrating how metric choices can create apparent emergent abilities in vision tasks across different networks.

Their findings suggest that alleged emergent abilities may vanish with different metrics, questioning the notion that they are intrinsic to scaled AI models.

Runner-Ups

Scaling Data-Constrained Language Models

In the paper, researchers explored scaling language models in data-limited scenarios, given the potential constraint on internet text data. They conducted extensive experiments, varying data repetition and computed budgets of up to 900 billion tokens and 9 billion parameters. Results showed that with limited data and a fixed computing budget, up to 4 epochs of repeated data had minimal impact on loss. However, further repetition diminished the value of additional compute.

They proposed a scaling law for compute optimality, considering the declining value of repeated tokens and excess parameters. Additionally, they tested methods to alleviate data scarcity, such as augmenting with code data or removing common filters.

Models and datasets from 400 training runs are freely available on GitHub.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Here, researchers introduced Direct Preference Optimization (DPO) as a streamlined alternative to Reinforcement Learning from Human Feedback (RLHF) for controlling large unsupervised language models. Unlike RLHF, DPO avoids the complexity and instability of fitting reward models and fine-tuning. Leveraging a mapping between reward functions and optimal policies, DPO directly optimises a single-stage policy training process, solving a classification problem on human preference data.

The experiments demonstrate that DPO can effectively align language models with human preferences, outperforming RLHF in sentiment control and improving response quality in summarization and dialogue. Notably, DPO is more straightforward to implement and train.

Datasets and Benchmarks Papers:

ClimSim: A large multi-scale dataset for hybrid physics-ML climate emulation

Machine learning experts have introduced ClimSim, the largest hybrid ML-physics dataset, co-created by climate scientists and ML researchers. With 5.7 billion pairs of input-output vectors, it isolates the impact of high-resolution physics on macro-scale climate states. Global and spanning multiple years, the dataset facilitates emulators compatible with operational climate simulators.

The data and code are released openly to support the development of hybrid ML-physics and high-fidelity climate simulations.

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

With the rise of GPT models, practitioners are considering using them for sensitive applications like healthcare and finance, but research reveals undisclosed vulnerabilities. GPT models, including GPT-4, can produce biassed, toxic outputs and unintentionally leak private information.

Despite GPT-4’s generally improved trustworthiness, it exhibits vulnerability to jailbreaking systems or misleading user prompts. This study highlights previously unrecognised trustworthiness gaps in GPT models.

The benchmark is publicly available on GitHub.

The post 6 Outstanding Papers Presented at NeurIPS 2023 appeared first on Analytics India Magazine.

Top 7 Vision Models Transforming the Future of AI in 2023

In the ever-evolving landscape of artificial intelligence, we have witnessed the launch of groundbreaking vision models that pushed the boundaries of computer vision. These cutting-edge models harnessed advanced neural network architectures, sophisticated training techniques, and unprecedented data sets to redefine the capabilities of visual perception — from enhanced object recognition to nuanced scene understanding.

Here is a list of the top 7 vision models launched this year.

DINOv2

Meta AI has developed DINOv2, an innovative method for training high-performance computer vision models that delivers exceptional performance and does not require fine-tuning. As a result, it is well-suited to serve as a backbone for various computer vision tasks.

Thanks to its self-supervised learning approach, DINOv2 can learn from any image collection and acquire features that the current standard approach cannot, such as depth estimation.

In 2023, DINOv2 was open-sourced, becoming the first method for training computer vision models to use self-supervised learning to achieve results that match or surpass the standard approach used in the field. Self-supervised learning is a potent and adaptable way to train AI models since it does not require vast amounts of labelled data.

The models can be trained on any image collection, regardless of whether they have associated metadata, and learn from all images they are given. This approach expands the potential applications of computer vision models, making them more versatile and powerful than ever before.

YOLOv8

The YOLO (You Only Look Once) series of models are very well-known in the computer vision world. YOLO is famous because it achieves high accuracy while having a small model size. It can be trained on a single GPU. Machine learning practitioners can deploy it on edge hardware or in the cloud at a low cost.

YOLOv8 is the latest YOLO model that uses advanced technology to detect objects, classify images, and segment instances. It was created by Ultralytics, the same team that developed the highly influential YOLOv5 model. YOLOv8 brings various architectural and developmental improvements over its predecessor YOLOv5.

As of now, Ultralytics is actively developing YOLOv8 and working on new features while considering feedback from the community. The organisation ensures that their models receive long-term support and collaborates with the community to improve the model’s performance.

EfficientViT

Vision transformers are a widely used framework in computer vision, as they provide high computational capabilities and superior performance. However, while these models continue to improve in accuracy and performance, they also come with higher operational costs and computational overhead. The EfficientNet model was developed to address this issue.

It seeks to enhance the performance of vision transformers and determine the principles for designing efficient and effective transformer-based framework architectures. The EfficientViT model is built on existing vision transformer frameworks such as Swim and DeiT. It analyses three critical factors that affect the speed of model interference, including computation redundancy, memory access, and parameter usage.

SWIN Transformer

The field of medical image segmentation can be challenging due to the need for large amounts of pre-training data, which is difficult to acquire. However, recent advancements in large-scale Vision Transformers have led to significant progress in improving pre-trained models for this purpose.

One such advancement is the Masked Multi-view with Swin Transformers (SwinMM), a novel multi-view pipeline that enables accurate and data-efficient self-supervised medical image analysis. SwinMM utilises a masked multi-view encoder during the pre-training phase to train masked multi-view observations through image reconstruction, rotation, contrastive learning, and a novel task that capitalises on the consistency between predictions from various perspectives.

During the fine-tuning stage, a cross-view decoder aggregates the multi-view information through a cross-attention block. SwinMM outperforms the previous state-of-the-art self-supervised learning method. SwinMM shows great potential for future applications in medical imaging.

SimCLR

The SimCLR vision model is designed to learn image representations from unlabeled data by generating positive and negative image samples using image augmentation. It then minimises the contrastive loss function to explore more underlying structural information. SimCLR-Inception model, the new version launched in 2023, achieves better results and the accuracy is at least 4% higher than the other compared models such as LeNet, VGG16, Inception V3, and EfficientNet V2, indicating that this model could work better for robot vision.

SimCLR maps the previous conceptual components onto a deep neural network architecture, inspired by residual neural networks (ResNet). Initially, SimCLR randomly selects examples from the original dataset, transforming each example twice using a combination of simple augmentations (random cropping, random colour distortion, and Gaussian blur), creating two sets of corresponding views.

StyleGAN3 by NVIDIA

The progress in generating artificial images has been impressive, thanks to the StyleGAN architecture, which is skilled in creating realistic facial images. In 2023, a group of researchers from NVIDIA and Aalto University introduced StyleGAN 3, which addressed a significant weakness in current generative models. This breakthrough has created numerous possibilities for using these models in video and animation.

Developing this new model was made easier by the well-documented and organised code base and the high level of compatibility with previous versions. It didn’t take long before people began to guide the outputs of StyleGAN3 with CLIP, resulting in beautiful outcomes. In StyleGAN3, each feature’s precise sub-pixel location is solely passed down from the underlying coarse features, resulting in a more natural transformation hierarchy.

MUnit

MUnit is a testing framework for Mule applications that enables you to create automated tests for your APIs and integrations with ease. It offers a comprehensive set of integration and unit testing capabilities and comes fully integrated with Maven and Surefire, making it ideal for use in continuous deployment environments.

In MUnit3, it is assumed that images can be broken down into a content code that is invariant across domains, and a style code that captures domain-specific properties. To translate an image to another domain, the content code is combined with a random style code sampled from the target domain’s style space.

Finally, the latest version makes it possible for users to control the style of translation outputs by providing an example style image.

The post Top 7 Vision Models Transforming the Future of AI in 2023 appeared first on Analytics India Magazine.

Oracle Shares Plummet Over 9% Over Q2 Results

Larry Lulls Everyone into Generative AI

Oracle Corporation witnessed a significant drop of more than 9% in its shares during extended trading on Monday. The decline follows the release of the software giant’s fiscal second-quarter results, which not only fell short of Wall Street expectations but also prompted a downward revision in quarterly revenue guidance.

The financial figures, when compared to consensus estimates from LSEG (formerly Refinitiv), revealed several key areas of concern. According to the reported results:

  • Earnings per Share: Oracle posted adjusted earnings of $1.34 per share, slightly exceeding the expected $1.32 per share.
  • Revenue: The company’s revenue amounted to $12.94 billion, falling short of the anticipated $13.05 billion.

Despite a 5% year-over-year growth in revenue for the quarter ending November 30, Oracle reported a substantial 44% increase in net income to $2.5 billion, equivalent to 89 cents per share, compared to $1.74 billion or 63 cents per share in the same period a year ago.

On a positive note, Oracle reported a remarkable 52% increase in cloud infrastructure revenue, reaching $1.6 billion. Notable clients during this period included Elon Musk’s artificial intelligence startup xAI, Halliburton, and Samsung.

However, Oracle faced challenges meeting the demand from Musk’s company, which sought more AI chips than Oracle could supply.

Read: How Oracle is Fuelling Musk’s Ambitions

During the earnings call with analysts, Oracle co-founder Larry Ellison acknowledged the chip shortage, stating that the company had to choose between a smaller buildout to recognise revenue in the quarter or a larger buildout with a wait for capacity availability, leading to the current constraints.

However, the company’s guidance for the fiscal third quarter raised eyebrows, with adjusted net income projected to range between $1.35 and $1.39 per share and a modest 6% to 8% revenue growth. Analysts polled by LSEG had expected $1.37 in adjusted earnings per share and $13.34 billion in revenue, reflecting a 7.6% growth.

Read: Oracle’s Symbiotic Connection with AMD and NVIDIA

Oracle’s performance in key segments further contributed to the market’s reaction. Notably, revenue from cloud services and license support reached $9.64 billion, showing a 12% increase but still falling short of the StreetAccount consensus of $9.71 billion.

Revenue from cloud and on-premises licenses saw an 18% decline to $1.18 billion, slightly below the $1.21 billion consensus. Additionally, services revenue amounted to $1.37 billion, missing the consensus of $1.40 billion.

Oracle’s strategic moves during the quarter included securing cloud business from larger rival Microsoft and announcing the availability of its database software on Microsoft’s Azure public cloud. The company plans to activate 20 data centers connected with Azure in the coming months.

Read: Oracle’s Grand Multicloud Gamble

Despite these efforts, Oracle’s shares are now up only about 41% for the year, a significant drop from previous performance, still outperforming the S&P 500 index, which has gained 20% during the same period. Investors are left to assess the impact of these results on Oracle’s future prospects and market standing.

The post Oracle Shares Plummet Over 9% Over Q2 Results appeared first on Analytics India Magazine.

Elon Musk’s Grok Copies OpenAI’s ChatGPT

Elon Musk’s Grok Copies OpenAI’s ChatGPT

Elon Musk’s xAI, and its chatbot Grok, recently has come under fire for alleged bugs. Public attention was drawn to a snapshot shared by ChatGPT’s X account that a security tester posted of Grok rejecting a query, citing OpenAI’s use case guidelines.

To this, Musk replied, “Well, son, since you scraped all the data from this platform for your training, you ought to know.”

Well, son, since you scraped all the data from this platform for your training, you ought to know

— Elon Musk (@elonmusk) December 9, 2023

Igor Babuschkin, an xAI representative, recognised the problem and blamed it on ChatGPT outputs that were inadvertently included when Grok was being trained on web data.

The explanation presented doubts to specialists, implying that Grok may have been purposefully adjusted using OpenAI model output data.

This is not the first time that an AI model has been trained on OpenAI’s output. The practice of fine-tuning AI models with synthetic data, generated by other language models, has become more common. Most of this is via ShareGPT, where people share their responses while talking to ChatGPT. It allows models like Grok to specialise in specific tasks, such as coding.

Despite claims that the issue is rare, some experts question the likelihood of it being an unintentional accident, suggesting that Grok’s behaviour was trained deliberately. Might be that Grok just wanted to mess around with ChatGPT, after all it’s a funny and “based” chatbot.

On the other hand, in the recent podcast with Lex Fridman, Elon Musk suggested that he likes the idea of open source AI, and would probably open source Grok. “I am generally in favour of open sourcing, like biassed towards open sourcing,” he said.

He is also planning to double the compute power at xAI every month. Currently, Grok is trained on 8,000 NVIDIA A100 GPUs.

The spat between Musk and OpenAI has been long going. Some weeks back, Sam Altman posted a photo of building Grok with a single prompt, to which Musk replied with a funny poem on how GPT-4 is boring.

GPT-4? More like GPT-Snore!
When it comes to humor, GPT-4 is about as funny as a screendoor on a submarine.
Humor is clearly banned at OpenAI, just like the many other subjects it censors.
That’s why it couldn't tell a joke if it had a goddamn instruction manual. It's like…

— Elon Musk (@elonmusk) November 10, 2023

The post Elon Musk’s Grok Copies OpenAI’s ChatGPT appeared first on Analytics India Magazine.

What Happens at AWS’ $100M Generative AI Innovation Center

Since generative AI has become omnipresent among the big tech, the companies have been funnelling money not just to develop AI models but also help their customers ride the tide. In one such attempt, in June, Amazon’s cloud unit invested a hundred million dollars to assist their clientele to develop and implement solutions through the new Generative AI Innovation Center.

At the recent AWS re: Invent AIM caught up with Sri Elaprolu, the Global Head to learn about the recently launched innovation centre’s nitty gritty.

“From ideating, identifying and building a solution that proves the value and meets requirements, that’s what the team does.” pinpointed Elaprolu. More than 1,000 customers have come in across all industries since the project started five months ago.

Elaborating on AWS clientele, he mentioned that the customer base has both global enterprises and public sectors. For instance, the Singapore Ministry of Education is a customer.

Elaprolu also mentioned that AWS is helping its clients understand what to build and how to build it at an enterprise level. “It’s now getting more into the core of the enterprise, not just on the chatbot style application,” he elaborated.

Diving deeper geographically, Elaprolu mentioned that usually, the US tends to be relatively high on the list. This is a trend usually seen in the past.

But generative AI has seen a different beginning. “The thing that’s different this time”, said the industry veteran, “is that it has started to be hot across and not just one active geography. We’re supporting customers across all areas, including Latin America, Africa, and the Middle East — locations that we normally don’t see a jump right into emerging tech.”

Amazon has been at the forefront of AI, he highlighted by stating the examples of AI in Amazon through retail, AWS, and Alexa. “GenAI is new. But in our view, it’s an extension and not a completely radical thing. It’s just that until now, you’ve been able to predict, but now you can create. But very few companies know what that means for them,” he said.

A Broad Spectrum

Speaking about companies moving towards generative AI under the industry’s pressure, Ela said, “Maybe early on, it was a lot of experimentation. Now, there are two ways you can think of how companies are leveraging hardware.”

One is how a business process can be improved and optimised by bringing generative AI. For example, Bridgewater Associates is the world’s largest hedge fund manager. They have investment analysis, which needs intensive maths crunching. “The company is leveraging Amazon’s Bedrock and Anthropic’s Claude models to simplify many things that took weeks and now take hours. They’ve also built investment analysts that allow junior analysts to move quickly. That’s more of an internal optimisation of what you already do,” Elaprolu said.

In terms of cutting down costs time, you can look at customers who are building new capabilities that did not exist before. “That’s another area where we see a lot of customers,” the AWS employee stated. He gave the example of Lonely Planet building a solution for personalised itinerary planning so that their customers can get a much better experience than what they perhaps were doing previously with manual clicking and dragging. “So it’s a wide spectrum of how companies are evolving,” Elaprolu said in conclusion.

Stepwise Breakdown

Building and maintaining a client base of over 1,000 customers in less than six months is not a piece of cake. AWS is following a strategy that is helping them work with many enterprises.

“The first step that we do there is working with the business leadership and technical leadership to understand their vision, pain points, and opportunity areas for their business. Then we work backwards from there to identify what we can do now with AI and methods that are still traditionally valid. Then we do a business impact analysis to check for positive returns,” Elaprolu revealed.

He further explained that this process is done with the customer at the table, not AWS going off and doing their work. “We then move into understanding their environment, data, technical abilities, and various requirements. If you’re operating in a regulated world, you have compliance regimes to meet,” he added.

Once Elaprolu’s team knows the details, they build that solution. That usually can go roughly 4-8 weeks, on average. AWS then demonstrates the value of that solution with the customer data with their users in mind. “That gives them a lot more confidence now because now it’s not just theory,” Elaprolu stated.

The post What Happens at AWS’ $100M Generative AI Innovation Center appeared first on Analytics India Magazine.

Google Chrome will soon let users build custom AI-generated themes, including US cities

Google Chrome logo on phone

The introduction of custom AI-generated wallpapers was a big development on the Google Pixel 8 and Pixel 8 Pro, and now Google is giving a little AI love to its Chrome browser.

It was just a few days ago that Google announced a "Help Me Write" AI feature was headed to Chrome. But now it appears Chrome users will soon have the ability to create a custom AI-generated theme for their browser. The feature was first spotted by X, formerly Twitter, user Leopeva64, who dove deep into the code of the latest unreleased Canary version of the browser.

Also: Why Google's cheaper Pixel 8 is the real star of its Android phone lineup

Like on the Pixel 8, the feature starts off by asking the user to choose a theme. But the themes are quite a bit more robust than what's offered for Google's flagship phones. Under the subjects tab, the X post shows, there are categories like Buildings, Food, Everyday Objects, Nature, Space, US Cities & Parks, and more.

Those categories expand into further options to choose from. Buildings, for example, breaks down into Airport, Cafe, Castle, Lighthouse, Office, and so on. Everyday Objects shows dozens of household objects that a theme can be built around. Under Space, you can build a theme around Constellations, Satellites, Moon, Sun, Stars, Solar system, Spaceships, and more.

US cities is the category I'm most excited to see. A glance shows options for Arches National Park, Chicago, the Grand Canyon, Houston, Los Angeles, New York City, Philadelphia, Phoenix, San Diego, San Francisco, Seattle, and the Everglades among others.

Also: Google's Gemini continues the dangerous obfuscation of AI technology

Once that theme is selected, the user can even further fine-tune their theme with color and mood options — say, a steampunk sad Chicago in blue hues or an expressionist romantic airport with red tones.

Since the feature isn't actually live yet, we don't have an idea of what the wallpapers might look like. But I found myself creating dozens of wallpapers with the Pixel 8 Pro, and it appears Chrome's version will only be better.

Given how long things usually take from first appearing in Chrome's code to actually being deployed for use, it seems likely we'll see a full rollout of this feature within the coming months.

Featured

Calm’s new sleep story is ‘narrated’ by Jimmy Stewart, and it’s spookily effective

wonderful-sleep-story

Jimmy Stewart in "It's a Wonderful Sleep Story"

Last night, I felt restless, so what did I do? I opened up my favorite meditation and sleep application, Calm, and picked a sleep story to play on the Bluetooth speaker. I typically go for one of my favorites, such as "Daddy Pig Reads the Wonderful World of Concrete." But to my surprise, I found a new one, narrated by the great American actor Jimmy Stewart.

Jimmy Stewart… Didn't he die in 1997?

Also: I fact-checked ChatGPT with Bard, Claude, and Copilot — and this AI was the most confidently incorrect

No, Calm did not find an old recording to adapt to its app, as it has done with Bob Ross. Instead, the company enlisted the help of Respeecher, a voice cloning software company that involved a voice actor and AI technology to bring the narration to life.

The company is also known for recreating the voices of Luke Skywalker for "The Mandalorian" TV series and Darth Vader for the series "Obi-Wan Kenobi." Stewart's family and estate, whose likeness is managed by CMG Worldwide, fully approved of the AI voice adaptation.

The 45-minute Calm story, "It's a Wonderful Sleep Story," has Jimmy Stewart narrating a tale about a downtrodden technology entrepreneur, Stanley J. Montgomery — not unlike George Bailey, the small-town banker, Stewart portrays in the 1946 Christmas classic, "It's a Wonderful Life." The synthetic Stewart recites the tale in rhyme and verse to the backdrop of calming, sleepy, floating music. Calm posted this preview to Instagram below.

It's hypnotizing and, dare I say, spookily effective. I don't remember listening to the entire thing, and I know it knocked my wife out pretty quickly.

Also: AI in 2023: A year of breakthroughs that left no human thing unchanged

There's certainly a very uncanny valley feeling hearing the voice of someone who has departed. However, Stewart has been gone long enough that many people listening to it will likely feel nostalgia rather than shock.

I'm unsure how realistic the Stewart voice is, and the cadence is odd. But it works, given the rhyme and intonation of the story and desired application. Still, these AI voice actors have a lot of potential for further exploration, especially by Calm. After Stewart, what is next, Burl Ives, with a new Rudolph story? Boris Karloff with a modern Grinch tale? I'm looking forward to finding out.

Midjourney vs. Dall-E 2: AI Image Generators Comparison

One remarkable development in AI has been the advancement of realistic AI image generators. These tools can generate hyper-realistic images from text descriptions, allowing you to create images based on your imagination.

Midjourney and Dall-E 2 are some of the best AI image generators. Midjourney uses generative adversarial networks and the diffusion technique to create realistic images via Discord’s interface. On the other hand, Dall-E 2 is an AI tool designed by OpenAI that allows you to create realistic images from natural language text prompts.

In this comparison, we will delve into the features and performance of Midjourney and Dall-E 2, exploring their strengths and weaknesses to help you determine which AI art generator is best for you.

Jump to:

  • Midjourney vs. Dall-E 2: Comparison table
  • Midjourney and Dall-E 2 pricing
  • Feature comparison: Midjourney vs. Dall-E 2
  • Midjourney pros and cons
  • Dall-E 2 pros and cons
  • Should your organization use Midjourney or Dall-E 2?
  • Review Methodology

Midjourney vs. Dall-E 2: Comparison table

Features Midjourney Dall-E 2
Text-to-image generation Yes Yes
Supported resolution Resolution formats up to 4096×4096 1024×1024, 512×512 and 256×256 resolutions
Prompting generation Discord DALL-E 2
Creativity High level of image manipulation Fewer controls to alter generated image
Image upscaling Upscale images up to 16x larger No
Free plan No No
Discount 20% discount on the Annual plan No
Starting price (Monthly) $10 per month $0.016 per image

Midjourney and Dall-E 2 pricing

Midjourney and Dall-E 2 offer a wide range of pricing options. With Midjorney, you can buy a monthly or annual payment license. On the other hand, Dall-E 2 offers a pay-as-you-use plan by charging per image generated, or you may choose to buy credits that allow you to generate images up to a specific limit.

Note that neither Midjourney nor Dall-E 2 offer free versions or trials. You will need to purchase a subscription to access their services.

Midjourney pricing

A major factor in pricing for Midjourney is the GPU. The GPU determines how fast your image is generated. In Midjourney, images are generated based on a queue system. When you pay for higher plans, you get access to faster GPU, more jobs and a waiting queue to process your request.

The pricing plans for Midjourney are broken down into four, each with a range of features and accessibility. All the plans below allow you to purchase extra GPU time at $4 per hour:

  • Basic: $10 per month or $96 per year.
  • Standard: $30 per month or $288 per year.
  • Pro plan: $60 per month or $576 per year.
  • Mega plan: $120 per month or $1,152 per year.

Dall-E 2 pricing

Dall-E 2 pricing structure is different from that of Midjourney. With Dall-E 2, you use credit to generate images. Dall-E 2 offers 115 credits for $15, with the max purchasable credit from the Dall-E 2 page capped at 11,500 credits costing $1,500. Credits are further charged based on the image resolution:

  • 1024×1024 resolution costs $0.020 per image.
  • 512×512 resolution costs $0.018 per image.
  • 256×256 resolution costs $0.016 per image.

Feature comparison: Midjourney vs. Dall-E 2

Easy of use

Compared to Midjourney, Dall-E 2 has a much easier interface and onboarding. The user simply signs up for an OpenAI account and gets access to the Dall-E 2 prompt interface, where they can type in their prompt and generate an image (Figure A).

Figure A

Screenshot of Dall-E 2 text prompt interface.
Dall-E 2 text prompt interface. Image: Aminu Abdullahi/TechRepublic

Midjourney onboarding is a bit tricky. You will be required to sign up using a Discord account, which gives you access to a newcomer room where you can drop your image generation prompts in the public space (Figure B). Compared to Dall-E 2, the Midjourney prompt user interface is not intuitive enough for new users to figure out how to generate images easily. You must use the /imagine command on the Midjourney discord server to generate an image. Images are also generated based on queues in the public space; low-tier plan users may experience some delay before getting an output.

Figure B

Screenshot of Midjourney’s getting started channel for new users on Discord.
Midjourney’s getting started channel for new users on Discord. Image: Aminu Abdullahi/TechRepublic

Wide resolution

Generated images can only be in one of three sizes when using Dall-E 2: 256×256, 512×512 and the highest resolution of 1024×1024 pixels. Also, the price per resolution varies. Midjounrey, on the other hand, offers a wide selection of resolutions and the pricing per different resolution is the same.

Midjourney’s default resolution is 1024×1024 pixels. But Midjourney’s upscale tool can increase the file size to 2048 x 2048 or 4096 x 4096 pixels, which is three times higher than the Dall-E 2 resolution. This allows for greater detail and clarity in the generated images.

Realistic result

Midjourney and Dall-E 2 both create interesting results. With Dall-E 2, the prompt needs to be well-described to get a more realistic image (Figure C). But when comparing the creativity of both tools side-by-side, it can be noted that Midjourney does more creative image generation when given the same prompt as Dall-E 2.

Figure C

Image of a business meeting with Boss, HR, and 3 employees.
Promt: a business meeting with Boss, HR, and 3 employees, round table conference, professional, 8k ultra high quality, realistic, detailed, clear lines –ar 4:5 –v 5 –q 2 –s 750. Image: Avinash via Prompt Engineering Institute

Use cases

Text-to-image generation is a basic function of both tools. Like every other product that has a specific use case, we could break down the best use cases for both Midjourney and Dall-E 2 into where they produce the best results.

PREMIUM: Explore these top artificial intelligence use cases.

Midjourney does the job better when it comes to generating creative illustrations, concept art and fan art (Figure D). Upscaling low-resolution images can also be achieved with Midjourney to give a better result.

Figure D

Picture of MidJourney generated award-winning “Théâtre D’opéra Spatial” artwork.
MidJourney generated award-winning “Théâtre D’opéra Spatial” artwork. Image: Jason Allen

By comparison, Dall-E 2 is better at generating photorealistic images for product renders or marketing materials and fine-tuning the style or composition of images.

Community support

While Dall-E 2 has a good community, when compared with Midjourney, we discover that Midjourney has a greater community than Dall-E 2. This is probably because Midjourney uses a social media channel as the gateway to delivering their services. Hence, it is much easier to find support from other users while using Midjourney.

Midjourney pros and cons

Pros of Midjourney

  • Quality images.
  • Provide prompts that could be used to adjust image parameters.
  • A wide range of resolutions available.

Cons of Midjourney

  • Requires third-party application integration.
  • Difficult to use for a start.
  • Requires a lot of learning — how to use Discord and commands.

Visit Midjourney

Dall-E 2 pros and cons

Pros of Dall-E 2

  • Easy to use.
  • It doesn’t require the use of a third-party application interface.
  • Uses direct English prompts — little learning curve on how to achieve the best result.

Cons of Dall-E 2

  • Limited resolution.
  • Low accuracy from prompts.

Visit Dall-E 2

Should your organization use Midjourney or Dall-E 2?

Picking the right tool for your organization depends on various factors, from pricing to ease of use and the technical know-how of users.

Choose Dall-E 2 if you are looking for a straightforward, out-of-the-box, easy-to-use tool with a low learning curve for writing the right prompt to give your desired result.

Midjourney, on the other hand, has a wide range of features that could be added to a prompt to help you generate realistic and desired images. If you’re already familiar with Discord, this may be a more suitable option.

As always, it’s important to ensure the tool you select considers your organization’s needs and skills.

Review methodology

In our evaluation, we collected information about features and pricing plans from both the Dall-E 2 and Midjourney websites. We analyzed users’ feedback on reputable product review sites to understand the experiences and opinions of actual users.

Based on this research, we provided an overview of each platform’s features, highlighting their key differences in terms of resolution options, image quality and pricing plans to offer insights into their performance and user satisfaction.

I fact-checked ChatGPT with Bard, Claude, and Copilot — and this AI was the most confidently incorrect

Abstract AI room with colorful lights on the walls

Generative artificial intelligence (AI) is notoriously prone to factual errors. So, what do you do when you've asked ChatGPT to generate 150 presumed facts and you don't want to spend an entire weekend confirming each by hand?

Also: AI in 2023: A year of breakthroughs that left no human thing unchanged

Well, in my case, I turned to other AIs. In this article, I'll explain the project, consider how each AI performed in a fact-checking showdown, and provide some final thoughts and cautions if you also want to venture down this maze of twisty, little passages that are all alike.

The project

Last week, we published a very fun project where we had DALL-E 3, running inside ChatGPT, generate 50 picturesque images that it thought represented each US state. I also had ChatGPT list "the three most interesting facts you know about the state". The results were, as my editor put in the article's title, "gloriously strange".

ChatGPT put the Golden Gate Bridge somewhere in Canada. The tool put Lady Liberty both in the midwest US and somewhere on Manhattan island. And it generated two Empire State Buildings. In short, ChatGPT got its abstract expressionism funk on, but the results were pretty cool.

Also: I asked DALL-E 3 to create a portrait of every US state, and the results were gloriously strange

As for the individual facts, they were mostly on target. I'm pretty good with US geography and history, and thought that few of ChatGPT's generated facts stood out as wildly wrong. But I didn't do any independent fact checking. I just read the results over and pronounced them good enough.

But what if we really want to know the accuracy of those 150 fact bullets? That kind of question seems like an ideal project for an AI.

Methodology

So here's the thing. If GPT-4, the OpenAI large language model (LLM) used by ChatGPT Plus, generated the fact statements, I wasn't entirely convinced it should be checking them. That's like asking high school students to write a history paper without using any references, and then self-correct their work. They're already starting with suspect information — and then you're letting them correct themselves? No, that doesn't sound right to me.

Also: Two breakthroughs made 2023 tech's most innovative year in over a decade

But what if we fed those facts to other LLMs inside of other AIs? Both Google's Bard and Anthropic's Claude have their own LLMs. Bing uses GPT-4, but I figured I'd test its responses just to be completionist.

As you'll see, I got the best feedback from Bard, so I fed its responses back into ChatGPT in a round-robin perversion of the natural order of the universe. It was a cool project.

Anthropic Claude

Claude uses the Claude 2 LLM, which is also used inside of Notion's AI implementation. Claude allowed me to feed it a PDF containing the full set of facts (without the pictures). Here's what I got back:

Overall, Claude found the fact list to be mostly accurate, but it did have some clarifications for three items. I limited how long the ChatGPT facts could be, and that limit inhibited nuance in the fact descriptions. Claude's fact check took issue with some of that lack of nuance.

Overall, it was an encouraging response.

Copilot… or nopilot?

Then we get to Microsoft's Copilot, the renamed Bing Chat AI. Copilot doesn't allow PDFs to be uploaded, so I tried pasting in the text from all 50 state facts. This approach failed immediately, because Copilot only accepts prompts up to 2,000 characters:

I asked Copilot the following:

The following text contains state names followed by three facts for each state. Please examine the facts and identify any that are in error for that state

Here's what I got back:

It pretty much repeated the fact data I asked it to check. So, I tried to guide it with a more forceful prompt:

Once again, it gave me back the data I asked it to verify. I found this output very odd because Copilot uses the same LLM as ChatGPT. Clearly, Microsoft has tuned it differently than ChatGPT.

I gave up, and moved onto Bard.

Bard

Google has just announced their new Gemini LLM. I don't yet have access to Gemini, so I ran these tests on Google's PaLM 2 model.

Also: What is Gemini? Everything you should know about Google's new AI model

By comparison to Claude and Copilot, Bard knocked it out of the park, or, more Shakespearianish, it "doth bestride the narrow world like a Colossus."

Check out the results below:

It's important to note that many state facts aren't even agreed upon by the states or there are nuances. As I'll show you in the next section, I fed this list back to ChatGPT and it found two discrepancies in the Alaska and Ohio answers.

But there are other misses here. In some ways, Bard overcompensated for the assignment. For example, Bard correctly stated that other states besides Maine produce lobsters. But Maine goes all-in on its lobster production. I've never been to another state that has miniature lobster traps as one of the most popular tourist trap trinkets.

Also: I spent a weekend with Amazon's free AI courses, and highly recommend you do too

Or let's pick Nevada and Area 51. ChatGPT said, "Top-secret military base, rumored UFO sightings." Bard tried to correct, saying "Area 51 isn't just rumored to have UFO sightings. It's a real top-secret military facility, and its purpose is unknown." They're saying pretty much the same thing. Bard just missed the nuance that comes from having a tight word limit.

Another place Bard picked on ChatGPT without understanding context was Minnesota. Yes, Wisconsin has a lot of lakes, too. But Bard didn't claim Minnesota had the most lakes. It just described Minnesota as the "Land of 10,000 lakes," which is one of Minnesota's most common slogans.

Bard got hung up on Kansas as well. ChatGPT said Kansas is "Home to the geographic center of the contiguous US." Bard claimed it was South Dakota. And that would be true if you factor in Alaska and Hawaii. But ChatGPT said "contiguous," and that honor goes to a point near Lebanon, Kansas.

Also: These are the jobs most likely to be taken over by AI

I could go on, and I will in the next section, but you get the point. Bard's fact-checking seems impressive, but it often misses the point and gets things just as wrong as any other AI.

Before we move on to ChatGPT's limited fact check of Bard's fact check, let me point out that most of Bard's entries were either wrong or wrong-headed. And yet, Google puts its AI answers in front of most search results. Does that concern you? It sure worries me.

Such a wonder, my lords and ladies, is not to be spoken of.

ChatGPT

Right off the top, I could tell Bard got one of its facts wrong — Alaska is far bigger than Texas. So, I thought, let's see if ChatGPT can fact-check Bard's fact check. For a moment, I thought this bit of AI tail chasing might knock the moon out of Earth's orbit, but then I decided that I would risk the entire structure of our universe because I knew you'd want to know what happened:

Here's what I fed ChatGPT:

And here's what ChatGPT said (and, for clarity, the moon did remain in orbit):

As you can see, ChatGPT took issue with Bard's erroneous claim that Texas is the biggest state. It also had a bit of tizzy over Ohio vs. Kansas as the birth of aviation, which is more controversial than most schools teach.

Also: 7 ways to make sure your data is ready for generative AI

It's commonly accepted that Wilbur and Orville Wright flew the first aircraft (actually in Kitty Hawk, North Carolina), although they built their Wright Flyer in Dayton, Ohio. That said, Sir George Cayley (1804), Henri Giffard (1852), Félix du Temple (1874), Clément Ader (1890), Otto Lilienthal (1891), Samuel Langley (1896), Gustave Whitehead (1901), and Richard Pearse (1902) — from New Zealand, the UK, France, Germany, and other parts of the US — all have somewhat legitimate claims to being the first in flight.

But we'll give the point to ChatGPT, because it only has 10 words to make a claim, and Ohio was where the Wright Brothers had their bike shop.

Conclusions and caveats

Let's get something out of the way upfront: if you're turning in a paper or a document where you need your facts to be right, do your own fact-checking. Otherwise, your Texas-sized ambitions might get buried under an Alaska-sized problem.

As we saw in our tests, the results (as with Bard) can look quite impressive, but be completely or partially wrong. Overall, it was interesting to ask the various AIs to crosscheck each other, and this is a process I'll probably explore further, but the results were only conclusive in how inconclusive they were.

Copilot gave up completely, and simply asked to go back to its nap. Claude took issue with the nuance of a few answers. Bard hit hard on a whole slew of answers — but, apparently, to err is not only human, it's AI as well.

Also: These 5 major tech advances of 2023 were the biggest game-changers

In conclusion, I must quote the real Bard and say, "Confusion now hath made his masterpiece!"

What do you think? What sort of egregious errors have you seen from your favorite AI? Are you content in trusting the AIs for facts, or will you now do your own fact-checking processes? Let us know in the comments below.

You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter on Substack, and follow me on Twitter at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

Featured

Google’s Multimodal AI Gemini – A Technical Deep Dive

Google's First Multimodal Model: Gemini

Sundar Pichai, Google's CEO, along with Demis Hassabis from Google DeepMind, have introduced Gemini in December 2023. This new large language model is integrated across Google's vast array of products, offering improvements that ripple through services and tools used by millions.

Gemini, Google's advanced multimodal AI, is birthed from the collaborative efforts of the unified DeepMind and Brain AI labs. Gemini stands on the shoulders of its predecessors, promising to deliver a more interconnected and intelligent suite of applications.

The announcement of Google Gemini, nestled closely after the debut of Bard, Duet AI, and the PaLM 2 LLM, marks a clear intention from Google to not only compete but lead in the AI revolution.

Contrary to any notions of an AI winter, the launch of Gemini suggests a thriving AI spring, teeming with potential and growth. As we reflect on a year since the emergence of ChatGPT, which itself was a groundbreaking moment for AI, Google's move indicates that the industry's expansion is far from over; in fact, it may just be picking up pace.

What is Gemini?

Google's Gemini model is capable of processing diverse data types such as text, images, audio, and video. It comes in three versions—Ultra, Pro, and Nano—each tailored for specific applications, from complex reasoning to on-device use. Ultra excels in multifaceted tasks and will be available on Bard Advanced, while Pro offers a balance of performance and resource efficiency, already integrated into Bard for text prompts. Nano, optimized for on-device deployment, comes in two sizes and features hardware optimizations like 4-bit quantization for offline use in devices like the Pixel 8 Pro.

Gemini's architecture is unique in its native multimodal output capability, using discrete image tokens for image generation and integrating audio features from the Universal Speech Model for nuanced audio understanding. Its ability to handle video data as sequential images, interweaved with text or audio inputs, exemplifies its multimodal prowess.

Gemini supports sequences of text, image, audio, and video as inputs

Gemini supports sequences of text, image, audio, and video as inputs

Accessing Gemini

Gemini 1.0 is rolling out across Google's ecosystem, including Bard, which now benefits from the refined capabilities of Gemini Pro. Google has also integrated Gemini into its Search, Ads, and Duet services, enhancing user experience with faster, more accurate responses.

For those keen on harnessing the capabilities of Gemini, Google AI Studio and Google Cloud Vertex offer access to Gemini Pro, with the latter providing greater customization and security features.

To experience the enhanced capabilities of Bard powered by Gemini Pro, users can take the following straightforward steps:

  1. Navigate to Bard: Open your preferred web browser and go to the Bard website.
  2. Secure Login: Access the service by signing in with your Google account, assuring a seamless and secure experience.
  3. Interactive Chat: You can now use Bard, where Gemini Pro's advanced features can be opted.

Power of Multimodality:

At its core, Gemini utilizes a transformer-based architecture, similar to those employed in successful NLP models like GPT-3. However, Gemini's uniqueness lies in its ability to process and integrate information from multiple modalities, including text, images, and code. This is achieved through a novel technique called cross-modal attention, which allows the model to learn relationships and dependencies between different types of data.

Here's a breakdown of Gemini's key components:

  • Multimodal Encoder: This module processes the input data from each modality (e.g., text, image) independently, extracting relevant features and generating individual representations.
  • Cross-modal Attention Network: This network is the heart of Gemini. It allows the model to learn relationships and dependencies between the different representations, enabling them to “talk” to each other and enrich their understanding.
  • Multimodal Decoder: This module utilizes the enriched representations generated by the cross-modal attention network to perform various tasks, such as image captioning, text-to-image generation, and code generation.

Gemini model isn't just about understanding text or images—it's about integrating different kinds of information in a way that's much closer to how we, as humans, perceive the world. For instance, Gemini can look at a sequence of images and determine the logical or spatial order of objects within them. It can also analyze the design features of objects to make judgments, such as which of two cars has a more aerodynamic shape.

But Gemini's talents go beyond just visual understanding. It can turn a set of instructions into code, creating practical tools like a countdown timer that not only functions as directed but also includes creative elements, such as motivational emojis, to enhance user interaction. This indicates an ability to handle tasks that require a blend of creativity and functionality—skills that are often considered distinctly human.

Gemini's capabilities : Spatial Reasoning

Gemini's capabilities : Spatial Reasoning (Source)

Gemini's capabilities extend to executing programming tasks

Gemini's capabilities extend to executing programming tasks(Source)

Gemini sophisticated design is based on a rich history of neural network research and leverages Google’s cutting-edge TPU technology for training. Gemini Ultra, in particular, has set new benchmarks in various AI domains, showcasing remarkable performance lifts in multimodal reasoning tasks.

With its ability to parse through and understand complex data, Gemini offers solutions for real-world applications, especially in education. It can analyze and correct solutions to problems, like in physics, by understanding handwritten notes and providing accurate mathematical typesetting. Such capabilities suggest a future where AI assists in educational settings, offering students and educators advanced tools for learning and problem-solving.

Gemini's has been leveraged to create agents like AlphaCode 2, which excels at competitive programming problems. This showcases Gemini's potential to act as a generalist AI, capable of handling complex, multi-step problems.

Gemini Nano brings the power of AI to everyday devices, maintaining impressive abilities in tasks like summarization and reading comprehension, as well as coding and STEM-related challenges. These smaller models are fine-tuned to offer high-quality AI functionalities on lower-memory devices, making advanced AI more accessible than ever.

The development of Gemini involved innovations in training algorithms and infrastructure, using Google’s latest TPUs. This allowed for efficient scaling and robust training processes, ensuring that even the smallest models deliver exceptional performance.

The training dataset for Gemini is as diverse as its capabilities, including web documents, books, code, images, audio, and videos. This multimodal and multilingual dataset ensures that Gemini models can understand and process a wide variety of content types effectively.

Gemini and GPT-4

Despite the emergence of other models, the question on everyone's mind is how Google's Gemini stacks up against OpenAI's GPT-4, the industry's benchmark for new LLMs. Google's data suggest that while GPT-4 may excel in commonsense reasoning tasks, Gemini Ultra has the upper hand in almost every other area.

Gemini VS GPT-4

Gemini VS GPT-4

The above benchmarking table shows the impressive performance of Google's Gemini AI across a variety of tasks. Notably, Gemini Ultra has achieved remarkable results in the MMLU benchmark with 90.04% accuracy, indicating its superior understanding in multiple-choice questions across 57 subjects.

In the GSM8K, which assesses grade-school math questions, Gemini Ultra scores 94.4%, showcasing its advanced arithmetic processing skills. In coding benchmarks, with Gemini Ultra attaining a score of 74.4% in the HumanEval for Python code generation, indicating its strong programming language comprehension.

The DROP benchmark, which tests reading comprehension, sees Gemini Ultra again leading with an 82.4% score. Meanwhile, in a common-sense reasoning test, HellaSwag, Gemini Ultra performs admirably, though it does not surpass the extremely high benchmark set by GPT-4.

Conclusion

Gemini's unique architecture, powered by Google's cutting-edge technology, positions it as a formidable player in the AI arena, challenging existing benchmarks set by models like GPT-4. Its versions—Ultra, Pro, and Nano—each cater to specific needs, from complex reasoning tasks to efficient on-device applications, showcasing Google's commitment to making advanced AI accessible across various platforms and devices.

The integration of Gemini into Google's ecosystem, from Bard to Google Cloud Vertex, highlights its potential to enhance user experiences across a spectrum of services. It promises not only to refine existing applications but also to open new avenues for AI-driven solutions, whether in personalized assistance, creative endeavors, or business analytics.

As we look ahead, the continuous advancements in AI models like Gemini underscore the importance of ongoing research and development. The challenges of training such sophisticated models and ensuring their ethical and responsible use remain at the forefront of discussion.