LAION in Open Letter to European Parliament Urge Call to Protect Open-Source AI in Europe

Spearheaded by LAION (The Large Scale Artificial Intelligence Open Network), Stability.AI, and other Renowned research institutions and developers have united to address the European Parliament in an open letter, emphasizing the possible negative side effects the draft AI Act might have on open-source research and development (R&D) within the realm of artificial intelligence (AI).

The letter underlines the essential role that open-source R&D plays in guaranteeing the safety, security, and competitiveness of AI throughout Europe, while also cautioning against the repercussions of inhibiting such groundbreaking work.

The letter addresses the following as outlined by LAION.

The Importance of Open-Source AI

The letter outlines three main reasons why open-source AI is worth protecting:

  1. Safety through transparency: Open-source AI promotes safety by enabling researchers and authorities to audit model performance, identify risks, and establish mitigations or countermeasures.
  2. Competition: Open-source AI allows small to medium enterprises to build on existing models and drive productivity, rather than relying on a few large firms for essential technology.
  3. Security: Public and private organizations can adapt open-source models for specialized applications without sharing sensitive data with proprietary firms.

Concerns with the Draft AI Act

The draft AI Act may introduce new requirements for foundation models, which could negatively impact open-source R&D in AI. The letter argues that “one size fits all” rules will stifle open-source R&D and could:

  • Entrench proprietary gatekeepers, often large firms, to the detriment of open-source researchers and developers
  • Limit academic freedom and prevent the European research community from studying models of public significance
  • Reduce competition between model providers and drive investment in AI overseas

Recommendations for the European Parliament

The open letter makes three key recommendations:

  1. Ensure open-source R&D can comply with the AI Act: The Act should promote open-source R&D and recognize the distinctions between closed-source AI models offered as a service and AI models released as open-source code. Where appropriate, the Act should exempt open-source models from regulations intended for closed-source models.
  2. Impose requirements proportional to risk: The Act should impose rules for foundation models that are proportional to their actual risk. A “one size fits all” framework could make it impossible to field low-risk and open-source models in Europe.
  3. Establish public research facilities for compute resources: The EU should establish large-scale supercomputing facilities for AI research, enabling the European research community to study open-source foundation models under controlled conditions with public oversight.

The Future of AI in Europe

The letter concludes with a call to action for the European Parliament to consider the points raised and foster a legislative environment that supports open-source R&D. This approach will promote safety through transparency, drive innovation and competition, and accelerate the development of a sovereign AI capability in Europe.

With numerous esteemed supporters, including the European Laboratory for Learning and Intelligent Systems (ELLIS), the Pan-European AI Network of Excellence, and the German AI Association (KI-Bundesverband), the letter serves as a powerful reminder of the importance of protecting open-source AI for the future of Europe.

Supporters

  • European Laboratory for Learning and Intelligent Systems (ELLIS) – Pan-European AI Network of Excellence
  • German AI Association (KI-Bundesverband) – With more than 400 companies, the largest AI network in Germany
  • Prof. Jürgen Schmidhuber: Scientific Director of the Swiss AI Lab IDSIA (USI & SUPSI), Co-Founder & Chief Scientist of NNAISENSE, Inventor of LSTM Networks
  • Prof. Sepp Hochreiter: JKU Linz, Inventor of LSTM Networks
  • Prof. Bernhard Schölkopf: Director, Max Planck Institute for Intelligent Systems and ELLIS Institute, Tübingen, Germany
  • Prof. Serge Belongie: University of Copenhagen; Director, Pioneer Centre for AI
  • Prof. Andreas Geiger: University of Tübingen and Tübingen AI Center
  • Prof. Irina Rish: Full Professor at Université de Montréal, Canada Excellence Research Chair (CERC) in Autonomous AI and Canada CIFAR AI Chair, core member of Mila – Quebec AI Institute.
  • Prof. Antonio Krüger: CEO of the German Research Center for AI (DFKI) and Professor at the Saarland University
  • Prof. Kristian Kersting: Full Professor at Technical University of Darmstadt and Co-Director, Hessian Center for AI (hessian.AI)
  • Jörg Bienert: CEO of German AI Association, CPO of Alexander Thamm GmbH
  • Patrick Schramowski: Researcher at German Center for Artificial Intelligence (DFKI) and Hessian Center for AI (hessian.AI)
  • Dr. Jenia Jitsev: Lab Leader at Juelich Supercomputing Center, Research Center Juelich, Helmholtz Association, ELLIS member
  • Dr. Sampo Pyysalo: Research Fellow at the University of Turku, Finland
  • Robin Rombach: Co-Developer of Stable Diffusion, PhD Candidate at LMU Munich
  • Prof. Michael Granitzer: Chair of Data Science University of Passau, Germany and Coordinator of OpenWebSearch.eu
  • Prof. Dr. Jens Meiler: Leipzig University, ScaDS.AI Center for Scalable Data Analytics and Artificial Intelligence
  • Prof. Dr. Martin Potthast: Leipzig University, ScaDS.AI Center for Scalable Data Analytics and Artificial Intelligence, and OpenWebSearch.EU
  • Prof. Dr. Holger Hoos: Alexander von Humboldt Professor in AI at RWTH Aachen University (Germany) and Professor of Machine Learning at Universiteit Leiden (Netherlands)
  • Prof. Dr. Henning Wachsmuth: Chair of Natural Language Processing at the Institute of Artificial Intelligence, Leibniz University Hannover
  • Prof. Dr. Wil van der Aalst: Alexander von Humboldt Professor in Process and Data Science at RWTH Aachen University and Chief Scientist at Celonis
  • Prof. Dr. Bastian Leibe: Chair of Computer Vision at RWTH Aachen University (Germany)
  • Prof. Dr. Martin Grohe: Chair for Logic and the Theory of Discrete Systems, RWTH University
  • Prof. Ludwig Schmidt: Paul G. Allen School of Computer Science & Engineering, University of Washington
  • Dr Morten Irgens: Vice Rector, Kristiania, Co-founder and board member of CLAIRE (the Confederation of Laboratories of AI Research in Europe), Adra (the AI, Data and Robotics Association) and NORA (the Norwegian AI Research Consortium)
  • Prof. Dr. Hector Geffner: Alexander von Humboldt Professor in AI at RWTH Aachen University (Germany), and Wallenberg Guest Professor in AI at Linköping University, Sweden
  • Prof. Dr. Hilde Kuehne: Goethe University Frankfurt (Germany), MIT-IBM Watson AI Lab (USA)
  • Prof. Gerhard Lakemeyer, Ph.D.: Head of the Knowledge-based Systems Group and Chair of the Computer Science Department, RWTH Aachen University, Germany
  • Sebastian Nagel: Crawl Engineer, Common Crawl, Konstanz, Germany

While not officially on the Supporters list, Unite.AI also supports this Open Letter .

Learn how to use ChatGPT for just $20 and boost your bottom line

Image: StackCommerce

TL;DR: ChatGPT is the latest AI-powered tool to improve your business operations or offer more services to clients. Learn valuable strategies in the Introduction to ChatGPT course while it’s discounted by 81% to $19.99.

Whether you’re a business owner who would like to increase customer engagement or a freelancer who wants to offer additional services to clients, you should definitely learn how to use ChatGPT. It’s one of the latest tools powered by artificial intelligence and there’s a reason everyone’s talking about it. After taking Introduction to ChatGPT, you can be part of the discussion, too. Best of all, it’s on sale right now for just $19.99.

In just nine lectures totaling 25 hours, Introduction to ChatGPT will teach you how to utilize this groundbreaking technology effectively to support your skills. This AI tool can have an enormous impact when applied to different types of work, such as client services, sales and marketing, HR management, content creation, lead generation and data analytics.

Enhancing your knowledge of ChatGPT can unlock a wealth of new economic prospects. It can help you to improve and streamline your business operations, offer more value to clients and enhance and refine your writing skills.

The Introduction to ChatGPT course is presented by International Open Academy, a leader in the online learning marketplace. IO delivers diverse, user-friendly courses that can help you make a career change or develop a wide spectrum of skills.

All of the courses are accredited by the International Council for Online Educational Standards, and you may obtain an ICOES certificate of completion to validate your new skills with clients and customers. Many courses can earn you credits for Continuing Professional Development/Continuing Education that are recognized and accepted by academic institutions and professional member bodies, both in the U.S. and abroad.

If you want to thrive in today’s rapidly-changing digital environment, you need to stay current with the latest tools. Start right now by learning how to master ChatGPT for all of your personal and professional projects.

Grab Introduction to ChatGPT today while it’s on sale for just $19.99, an 81% discount off the regular retail price of $109.

Prices and availability are subject to change.

Innovation Insider Newsletter

Catch up on the latest tech innovations that are changing the world, including IoT, 5G, the latest about phones, security, smart cities, AI, robotics, and more.

Delivered Tuesdays and Fridays Sign up today

Introducing Healthcare-Specific Large Language Models from John Snow Labs

Sponsored Post

By David Talby and Veysel Kocaman, John Snow Labs

Large language models (LLM’s) unlock new use cases in Healthcare NLP. From automating clinical data abstraction and real-world evidence to improving patient safety and clinical decision support, many use cases focused on the text modality of data stand to benefit. However, since medical systems usually require privacy, truthfulness, compliance, and freshness above what other domains do, achieving the potential of this new technology is a challenge.

As part of its ongoing commitment to always keep the healthcare & life science industries at the state of the art of natural language processing (NLP), John Snow Labs recently released a new LLM called BioGPT-JSL and capabilities tuned specifically to the medical domain. This article summarizes three things you should know about it. 

1. They cover a range of common healthcare use cases. Here are the common healthcare use cases:

  • Inquire about medical topics: Use BioGPT-JSL, the first closed-book medical Q&A LLM based on BioGPT.
  • Interpret medical research: Provide a PubMed abstract for extracting key results.
  • Produce clinical text: Offer introductory phrases for completion.
  • Condense clinical encounters: Transform clinical notes (visit summaries, discharge notes, radiology reports, or pathology reports) into a single paragraph.
  • Simplify patient inquiries: Convert patient emails, calls, or posts into one-sentence questions.

a. Ask medical questions: Try asking questions to new BioGPT-JSL  (the first ever closed-book medical question answering LLM based on BioGPT):

input: How to treat asthma ?
Output: Asthma is itself an allergic disease due to cold or dust or pollen or grass etc. irrespective of the triggering factor. You can go for pulmonary function tests if not done. Treatment is mainly symptomatic which might require inhalation steroids, beta agonists, anticholinergics as MDI or rota haler as a regular treatment. To decrease the inflammation of bronchi and bronchioles, you might be given oral antihistamines with mast cell stabilizers (montelukast) and steroids (prednisolone) with nebulization and frequently steam inhalation. To decrease the bronchoconstriction caused by allergens, you might be given oral antihistamines with mast cell stabilizers (montelukast) and steroids (prednisolone) with nebulization and frequently steam inhalation. The best way to cure any allergy is a complete avoidance of allergen or triggering factor. Consult your pulmonologist for further advice.

b. Understand medical research: Give it a PubMed abstract and ask about what the key results were:

paper_abstract: "The visual indexing theory proposed by Zenon Pylyshyn
(Cognition, 32, 65–97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene. Recent eye-tracking studies have supported this hypothesis by showing that people tend to look at empty places where requested information has been previously presented. However, it has remained unclear to what extent this behavior is related to memory performance. The aim of the present study was to explore whether the manipulation of spatial attention can facilitate memory retrieval. In two experiments, participants were asked first to memorize a set of four objects and then to determine whether a probe word referred to any of the objects. The results of both experiments indicate that memory accuracy is not affected by the current focus of attention and that all the effects of directing attention to specific locations on response times can be explained in terms of stimulus–stimulus and stimulus–response spatial compatibility."
Question: What is the effect of directing attention on memory?
Answer: The results of the present study suggest that the visual indexing theory does not fully explain the effects of spatial attention on memory performance.
Question: Does directing attention improve memory for items?
Answer: no

c. Generate clinical text: Prompt to complete.

Prompt: The patient is admitted to the clinic with a severe back pain
Text: The patient is admitted to the clinic with a severe back pain and we found that the patient had a large tumor in the lumbar area. The tumor was removed surgically. The patient was diagnosed to have a large tumor in a lumbar vertebra. The tumor was diagnosed as a chordoma based the pathological and immunohistochemical findings, but no evidence for metastasis or invasion to adjacent structures were observed during follow — up after operation.
d. Summarize clinical encounters: Ask to turn a visit summary, discharge note, radiology report, or pathology reports into one paragraph.

Input: Medical Specialty: Allergy / Immunology, Sample Name: Allergic Rhinitis
Description: A 23-year-old white female presents with complaint of allergies. (Medical Transcription Sample Report)
SUBJECTIVE: This 23-year-old white female presents with complaint of allergies. She used to have allergies when she lived in Seattle but she thinks they are worse here. In the past, she has tried Claritin, and Zyrtec. Both worked for short time but then seemed to lose effectiveness. She has used Allegra also. She used that last summer and she began using it again two weeks ago. It does not appear to be working very well. She has used over-the-counter sprays but no prescription nasal sprays. She does have asthma but doest not require daily medication for this and does not think it is flaring up.
MEDICATIONS: Her only medication currently is Ortho Tri-Cyclen and the Allegra.
ALLERGIES: She has no known medicine allergies.
OBJECTIVE: Vitals: Weight was 130 pounds and blood pressure 124/78. HEENT: Her throat was mildly erythematous without exudate. Nasal mucosa was erythematous and swollen. Only clear drainage was seen. TMs were clear. Neck: Supple without adenopathy. Lungs: Clear.
ASSESSMENT: Allergic rhinitis.
PLAN:
1- She will try Zyrtec instead of Allegra again. Another option will be to use loratadine. She does not think she has prescription coverage so that might be cheaper.
2- Samples of Nasonex two sprays in each nostril given for three weeks. A prescription was written as well.
Summary: A 23-year-old female presents with complaints of allergies, including worsening in her home state. She has tried Claritin and Zyrtec, but they haven't worked well. She has asthma and has no known medicine allergies. Her vital signs are stable, and she has mildly erythematous and swollen nasal mucosa. She will try Zyrtec instead of Allegra and use loratadine for nasal sprays. Samples of Nasonex were given for three weeks.

e. Summarize questions from patients: With 5 models for 5 contexts, MedicalSummarizer can also turn an email or post from a patient into a one-sentence question that person is asking:

Input: Hey all. I was diagnosed with pulmonary sarcoidosis in 2011 by biopsy. I'm now 54. Prednisone works so well on me it's scary, but so are the long list of issues with it. I live in Montana and forest fire smoke is already here from other states fires and the air is at moderate. Exposure kicks of my symptoms and then I spiral down. I wait until I have a histamine reaction from my immune system with itching in my ears etc to go in. I just want as much of a quality of life as I can get not revolving around docs, meds etc. I can't find a doc who will only give me prednisone when I'm in a flare where regular modalities aren't working. I'm paleo and gluten free. Almost completely decaf no dairy, don't drink smoke or use any drugs…soooo frustrating. I bought an infrared sauna to detox, it seems to help. I wonder if anyone who also has environmental triggers has used oxygen? It seems like having non-polluted air available would be wiser?? Thanks
Summary Question: What are the treatments for pulmonary sarcoidosis?

2. They’re more accurate than general-purpose LLM’s

It should come as no surprise that models trained with domain-specific data & experts outperform general-purpose models. The Python notebooks with these benchmarks are public if you need to reproduce or customize the benchmarks. We also released our own annotations (labelled datasets) for full transparency and welcome feedback and suggestions on how to improve them.

  • Clinical note summarization is 30% more accurate, in terms of Bleu and Rouge scores, than general state-of-the-art LLMs that is finetuned on summarization tasks (BART, Flan-T5, Pegasus).

XXX

Demo app: Medical Summarization
Colab notebook: Medical Summarization Notebook

  • On clinical entity recognition, these models make half of the errors that ChatGPT does. Here is a comparison of F1 scores of 12 common entities, compared to manual annotation by a team of medical doctors on sentences from the MTSamples website:

XXX

Demo app: Detect clinical entities with Spark NLP
Colab Notebook: Clinical Named Entity Recognition (NER)

  • De-Identification out-of-the-box accuracy is 93% compared to ChatGPT’s 60% on detecting PHI in clinical notes. Here is a comparison on full & partial recognition of sensitive entities from a manually curated set of medical notes:

XXX

Demo app: Detecting PHI in clinical notes
Colab Notebook: Clinical Deidentification and Obfuscation

  • Extracting ICD-10-CM codes is done with a 76% success rate versus 26% for GPT-3.5 and 36% for GPT-4. Here is a comparison of entity resolution to ICD-10-CM codes across the three large language models:

XXX

Demo app: Resolve Clinical Findings using the ICD-10-CM taxonomy
Colab Notebook: Clinical entity resolution

3. They’re production ready

Most current LLM’s are just not ready to be used in healthcare: they do have the right privacy & compliance guarantees, can return hallucinations or falsehoods with high confidence, and are not updated at the rate required for production applications. The models from John Snow Labs are designed from the ground up for high-compliance industries, and specifically they:

  • Run on your infrastructure, behind your firewall, under your security controls. No text is ever sent to any third party or cloud service.
  • No need to buy a shipload of GPU’s. The LLM’s are engineered to run on commodity hardware, which makes them both much faster and much cheaper to scale.
  • Regularly updated. LLM’s are regularly tuned as new research papers, clinical trials, guidelines and terminologies are published. Never go to production with a stale model.

These language models are available as part of John Snow Labs’ Healthcare NLP, which is a suite of Python libraries, models, and notebooks, that allows data scientists to easily put these capabilities to good use. The subscription includes access to regularly refreshed models and all new releases, which is important because models will be frequently rebuilt from scratch. Research is moving fast – we’ve gone through several model architectures in the past six months, and expect to keep rebuilding as research evolves. Because only one thing is certain about today’s state-of-the-art LLM’s: If you train one today, it will be outdated in 3-6 months.

More On This Topic

  • Introducing TPU v4: Googles Cutting Edge Supercomputer for Large Language…
  • Learn About Large Language Models
  • Top Open Source Large Language Models
  • Top Free Courses on Large Language Models
  • Best Practices for Creating Domain-Specific AI Models
  • Introducing the Testing Library for Natural Language Processing

Innovative Bio-Inspired Sensor Detects Motion and Predicts Trajectories for Various Applications

Researchers at Aalto University have developed an innovative bio-inspired sensor that can detect moving objects in a single video frame and accurately predict their future movements. Described in a Nature Communications paper, this advanced sensor has numerous potential applications in fields such as dynamic vision sensing, automatic inspection, industrial process control, robotic guidance, and autonomous driving technology.

Traditional motion detection systems require numerous components and complex algorithms that perform frame-by-frame analyses, resulting in inefficiency and high energy consumption. To address these limitations, the Aalto University team looked to the human visual system for inspiration and created a neuromorphic vision technology that unifies sensing, memory, and processing into a single device capable of detecting motion and predicting trajectories.

Photomemristors: The Core of the New Technology

The researchers’ technology is built on an array of photomemristors, electrical devices that generate electric current in response to light. Photomemristors possess a unique characteristic: the current doesn’t stop immediately when the light is turned off, but decays gradually. This feature allows photomemristors to effectively “remember” their recent exposure to light, enabling a sensor composed of an array of these devices to capture not only instantaneous information about a scene but also a dynamic memory of preceding moments.

“The unique property of our technology is its ability to integrate a series of optical images in one frame,” explains Hongwei Tan, the research fellow who led the study. “The information of each image is embedded in the following images as hidden information. In other words, the final frame in a video also has information about all the previous frames. That lets us detect motion earlier in the video by analyzing only the final frame with a simple artificial neural network. The result is a compact and efficient sensing unit.”

Demonstrating the Technology’s Capabilities

To showcase their technology, the researchers used videos that displayed the letters of a word one at a time. Although all the words ended with the letter “E,” conventional vision sensors couldn’t discern whether the “E” on the screen had followed the other letters in “APPLE” or “GRAPE.” However, the photomemristor array could utilize hidden information in the final frame to deduce which letters had preceded it and predict the word with nearly 100% accuracy.

In another experiment, the team showed the sensor videos of a simulated person moving at three different speeds. The system could not only recognize motion by analyzing a single frame, but it also accurately predicted subsequent frames.

Implications for Autonomous Vehicles and Intelligent Transport

Accurate motion detection and trajectory prediction are crucial for self-driving technology and intelligent transport systems. Autonomous vehicles rely on precise predictions of how cars, bikes, pedestrians, and other objects will move in order to make informed decisions. By incorporating a machine learning system into the photomemristor array, the researchers demonstrated that their integrated system could predict future motion based on in-sensor processing of an all-informative frame.

“Motion recognition and prediction by our compact in-sensor memory and computing solution provides new opportunities in autonomous robotics and human-machine interactions,” says Professor Sebastiaan van Dijken. “The in-frame information that we attain in our system using photomemristors avoids redundant data flows, enabling energy-efficient decision-making in real time.”

I used ChatGPT to write the same routine in these ten obscure programming languages

gettyimages-171792113

An instructor at the Boston Latin School uses an IBM 1130 computer to teach Fortran to students on October 4, 1968.

A few weeks ago, I took a look at using ChatGPT to write the same routine in a dozen of the most popular programming languages. But as a programming language geek, I wondered just how far ChatGPT would go. Would it program in a language from the 1950s? Would it program in a language that used its own character set? Could it write code in one of the languages that wrote its code?

Also: The best AI chatbots: ChatGPT and alternatives to try

And so, I dove in. I've used many of the languages I'm spotlighting here, so I'll take a little walk down memory lane and include some stories about my experience with those I've used.

While I haven't run the code itself, I've read through all the generated programs. Most look right, and show the appropriate indicators telling us that the language presented is the language I asked for.

Also: How does ChatGPT work?

I'm telling you this because the headers on all the screenshots are wrong. Most are listed as SQL. For some reason BAL is shown as VBNet, and Prolog is listed as Rust. ChatGPT didn't make this error last time, but it made today, for all the languages shown here.

And with that, let's dive in.

Fortran

Fortran (or FORTRAN, as it was depicted back then) stands for Formula Translation. It was developed primarily for scientific and engineering calculations. Even though it dates back to the 1950s, it was often the first language taught to engineering students in the 1970s and 1980s.

Also: This new technology could blow away GPT-4 and everything like it

For me, it was my fourth programming language, after BASIC, PDP-8 assembly language, and PDP-8 binary (yes, I wrote binary code so I could toggle it in on the front panel of an early minicomputer). My Dad generously drove me the hour down to Newark College of Engineering (now NJIT) so I could take their first-year programming course while I was still a sophomore in high school.

Fortran was never a favorite, although it would get most calculation-oriented jobs done. A variation of Fortran is still in use today, but it's pretty limited to specialty scientific work since many other modern languages do Fortran-level analytics, and do it better.

Here, because of the use of the implicit keyword, it looks like ChatGPT is depicting code written in the Fortran-77 variant.

Even though the label is wrong, the code is Fortran.

COBOL

I was a teenaged COBOL programmer. I didn't know COBOL at the time, but somewhere around 1980 I saw a want ad for a COBOL programmer at the Northeast Regional Data Center of International Paper in Denville, NJ. It was about 40 minutes from my parents' home, and I needed a summer job. As soon as I managed to schedule an interview, I ran to the local bookstore and spent a weekend chowing down on all the COBOL reading material I could find.

Also: How to use ChatPDF: The AI chatbot that can tell you everything about your PDF

COBOL was meant for business-oriented processing. Today, the same type of customers would be SAP or Salesforce users. But COBOL, as you can see from the COBOL-88 code below, was functional, but it sure wasn't elegant.

As it turns out, they wanted a full-time employee. But since I aced their test (apparently my weekend of study beat out the scores of the professionals who applied), they decided to put me to work for the summer. To say there was a cultural mismatch was an understatement. These were the very early days of microcomputers (think pre-PC) and I gloried in that stuff. Meanwhile, these folks were all about the suits and ties and giant data centers. I almost immediately picked up the nickname "Trouble." Hey, I was still a kid.

The defining moment for me came when one of the more seasoned programmers at the company took me aside to explain, and I quote, "COBOL is life." I took that message to heart and decided that I didn't want my life to be about COBOL. That summer was the last time I programmed anything in COBOL.

Even though the label is wrong, the code is COBOL, probably COBOL-88.

RPG (Report Program Generator)

No, it's not a rocket-propelled grenade or a role-playing game. RPG was a program for creating reports. You can see the evolution in the language (and computer language thinking, in general) between the two versions shown below. RPG II dates back to the late 1950s. RPG IV was updated in the 1990s, and is considerably more modern (but still definitely dated).

I did write a few RPG III programs for the AS/400 as a side gig while in college, but it was never anything I followed all that closely.

Two RPGs, both with incorrect labels.

Lisp

I have a long love affair with Lisp. Lisp, named for LISt Processor, was sometimes described as Lots of Insipid Stupid Parentheses. For some reason, those parentheses clicked in my brain. Lisp was my gateway drug into AI programming, and as we've learned from ChatGPT's processing of language, knowledge can be represented as strings of words, or lists.

Also: ChatGPT's intelligence is zero, but it's a revolution in usefulness, says AI expert

Back in the 1980s, I was a product manager for a RISC-based Common Lisp implementation, and coded in it for a while. Based on the defun greeting and let syntax generated by ChatGPT, it looks like the code below is also Common Lisp, a later variation of John McCarthy's original Lisp implementation of the late 1950s. McCarthy was one of the founding fathers of AI research, so it's only fitting that an AI should write in his language.

The label is wrong. That is so not SQL. It's Lisp.

Algol

ALGOL (ALGOthymic Language) was ahead of its time. It allowed the passing of functions as parameters, and allowed the use of recursion. These are two key components of modern programming that were absent from popular languages like Fortran and COBOL, which were popular at the same time of ALGOL's introduction.

If Fortran, BASIC, and assembly language taught me top-down programming, ALGOL taught me how to structure my code. It was a foundational language that set the stage for my use of C, Pascal, Java, JavaScript, and most of the modern structured languages we use today.

No, that's not SQL. It's ALGOL.

Simula

Simula, as you might imagine, was developed for simulation and modeling. Like ALGOL, Simula was also ahead of its time, supporting the use of objects and classes. Object oriented programming has since become fundamental for many of today's most effective programming languages.

The use of the keyword OUTFIX tells me that ChatGPT probably wrote the following code in Simula-67.

Not SQL at all, you're looking at some Simula.

BAL (IBM 360/370 assembly language)

BAL, which stands for "branch and link," was the assembly language for big IBM iron back in the 1970s and 1980s. Assembly language is considered more "lower level" than "higher level languages," meaning that individual instructions and keywords map to individual processor instructions. In higher level languages (like most we use today), the language is often far removed from the underlying machine architecture, and is designed more for efficiency of programming than efficient use of hardware.

Also: Generative AI is changing your technology career path. What to know

I took a semester of BAL in college. It was rough. The professor spent most days standing with his back to the students, mumbling his lecture in a monotone, while writing nonstop on the blackboard. I sat in the back row next to a girl named Laura. One day, Laura and I decided to sneak out the back window to get coffee. Once we got our java, we climbed back into the window. The professor never noticed.

It's a wonder any of us passed, but it's no wonder that virtually none of us took up BAL programming as our coding weapon of choice.

This time, the AI gave the code the label VBNET. It's not, of course. It's BAL.

APL

APL, or A Programming Language, is another programming language. One of the most striking aspects of APL is that it uses a non-standard character set for some of its operators and coding. You can see that in the image below. The little glyphs are what ChatGPT was able to pull up from the font it was using, but APL is designed to use the APL385 Unicode font, which is unique to programming in APL.

Also: How to use ChatGPT to write Excel formulas

Because of how it has a concise syntax, and manages large datasets well, it's used in a fair number of scientific, financial, and analytical projects.

It's not SQL. It doesn't even use all ASCII characters. It's APL and it marches to its own drummer.

Prolog

Prolog (which is derived from the phrase Programming in Logic) is another AI-focused language. Prolog is a declarative language, listing rules and conditions, and then running queries over those declarations. According to ChatGPT:

QUO Prolog is one of the programming languages used in the development of ChatGPT, specifically for the generation of text. It is used in the Prolog-to-English module, which translates logical forms in Prolog into natural language sentences.

Prolog is a fun language, but it's not Rust. This is Prolog.

Smalltalk

Smalltalk is special. It was developed at Xerox PARC, the same research lab that influenced the development of the mouse and windowed environments. It incorporated some of Simula's innovations in object-oriented programming and went further. It also was one of the first languages that made active use of a GUI-based interactive programming environment. Many of today's top languages, including Java, Python, and Ruby, were highly influenced by Smalltalk.

Also: Okay, so ChatGPT just debugged my code. For real

I only dabbled in Smalltalk, mostly to get to know it. Living in Silicon Valley at the time, I got to know the designers of Smaltalk, who later spun off ParcPlace Systems, in an attempt to commercialize Smalltalk outside of Xerox's corporate stodginess. I have many fond memories of talking programming theory over ribs and coleslaw at a little run-down barbecue joint in Palo Alto.

One of the VPs I worked with at the computer company where I was employed tried to talk me into starting a company with him to apply Smalltalk to the enterprise market, but I hadn't yet been bitten by the startup bug. Smalltalk never made it as a market. Still, as happy as I am with the path my life has taken, I sometimes wonder what would have happened if I had taken him up on his offer.

And, finally, this too is not SQL. It's Smalltalk.

I have thoughts

I didn't run this code, and since I don't have as much recent familiarity with these languages as I do with with PHP and JavaScript, I can't be sure that my visual review guarantees the code shown above would run successfully. As we saw earlier this week with my exploration of Bard's new programming chops, it's possible to write code without that code working.

Also: I asked ChatGPT to write a WordPress plugin I needed. It did it in less than 5 minutes

We do know there's at least one error in each example: the headers are wrong. So while the ability for the AI to generate code in a bunch of relatively obscure languages is cool beyond belief, don't assume it's doing the entire job correctly.

If you decide to incorporate ChatGPT programming into your workflow, you'll need to check, test, and debug the code it generates just as you would if you hired someone to write some basic routines for you.

You can follow my day-to-day project updates on social media. Be sure to follow me on Twitter at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

More on AI tools

Fine-Tuning OpenAI Language Models with Noisily Labeled Data

By Chris Mauck, Jonas Mueller

This article demonstrates how data-centric AI tools can improve a fine-tuned Large Language Model (LLM; a.k.a. Foundation Model). These tools optimize the dataset itself rather than altering the model architecture/hyperparameters — running the exact same fine-tuning code on the improved dataset boosts test-set performance by 37% on a politeness classification task studied here. We achieve similar accuracy gains via the same data-centric AI process across 3 state-of-the-art LLM models one can fine-tune via the OpenAI API: Davinci, Ada, and Curie. These are variants of the base LLM underpinning GPT-3/ChatGPT.

The above plot shows the test accuracy achieved for 3-class politeness classification of text by the same LLM fine-tuning code (fitting Davinci via OpenAI API) run on 3 different datasets: (1) the original dataset labeled by human annotators, (2) an auto-filtered version of this dataset in which we removed examples automatically estimated to be mislabeled via Confident Learning, (3) a cleaned version of the original data in which we manually fixed labels of examples estimated to be mislabeled (rather than filtering these examples).

Background

Fine-Tuning OpenAI Language Models with Noisily Labeled Data

Labeled data powers AI/ML in the enterprise, but real-world datasets have been found to contain between 7-50% annotation errors. Imperfectly-labeled text data hampers the training (and evaluation of) ML models across tasks like intent recognition, entity recognition, and sequence generation. Although pretrained LLMs are equipped with a lot of world knowledge, their performance is adversely affected by noisy training data (as noted by OpenAI). Here we illustrate data-centric techniques to mitigate the effect of label noise without changing any code related to model architecture, hyperparameters, or training. These data quality improvement techniques should thus remain applicable even for future advanced LLMs like GPT-10.

Why Fine-tuning?

LLMs acquire powerful generative and discriminative capabilities after being pre-trained on most text across the internet. Nonetheless, ensuring the LLM produces reliable outputs for a particular business use-case often requires additional training on actual data from this domain labeled with the desired outputs. This domain-specific training is known as fine-tuning the LLM and can be done via APIs offered by OpenAI. Imperfections in the data annotation process inevitably introduce label errors in this domain-specific training data, posing a challenge for proper fine-tuning and evaluation of the LLM.

Why Data-Centric AI?

Here are quotes from OpenAI on their strategy for training state-of-the-art AI systems:

"Since training data shapes the capabilities of any learned model, data filtering is a powerful tool for limiting undesirable model capabilities.”

“We prioritized filtering out all of the bad data over leaving in all of the good data. This is because we can always fine-tune our model with more data later to teach it new things, but it’s much harder to make the model forget something that it has already learned.”

Clearly dataset quality is a vital consideration. Some organizations like OpenAI manually handle issues in their data to produce the very best models, but this is tons of work! Data-centric AI is an emerging science of algorithms to detect data issues, so you can systematically improve your dataset more easily with automation.

Our LLM in these experiments is the Davinci model from OpenAI, which is their most capable GPT-3 model, upon which ChatGPT is based.

Overview

Here we consider a 3-class variant of the Stanford Politeness Dataset, which has text phrases labeled as: impolite, neutral, or polite. Annotated by human raters, some of these labels are naturally low-quality.

Fine-Tuning OpenAI Language Models with Noisily Labeled Data
This article walks through the following steps:

  • Use the original data to fine-tune different state-of-the-art LLMs via the OpenAI API: Davinci, Ada, and Curie.
  • Establish the baseline accuracy of each fine-tuned model on a test set with high-quality labels (established via consensus and high-agreement amongst many human annotators who rated each test example).
  • Use Confident Learning algorithms to automatically identify hundreds of mislabeled examples.
  • Remove the data with automatically-flagged label issues from the dataset, and then fine-tune the exact same LLMs on the auto-filtered dataset. This simple step reduces the error in Davinci model predictions by 8%!
  • Introduce a no-code solution to efficiently fix the label errors in the dataset, and then fine-tune the exact same LLM on the fixed dataset. This reduces the error in Davinci model predictions by 37%!

Similar gains are achieved via these same processes for the Ada and Curie models — in all cases, nothing was changed about the model nor the fine-tuning code!

Here’s a notebook you can run to reproduce the results demonstrated in this article and understand the code to implement each step.

Politeness Dataset

You can download the train and test sets here: train test

Our training dataset has 1916 examples each labeled by a single human annotator, and thus some may be unreliable. The test dataset has 480 examples each labeled by five annotators, and we use their consensus label as a high-quality approximation of the true politeness (measuring test accuracy against these consensus labels). To ensure a fair comparison, this test dataset remains fixed throughout our experiments (all label cleaning / dataset modification is only done in the training set). We reformat these CSV files into the jsonl file type required by OpenAI’s fine-tuning API.

Fine-tune and Evaluate LLM

Here’s how our code looks to fine-tune the Davinci LLM for 3-class classification and evaluate its test accuracy:

!openai api fine_tunes.create -t "train_prepared.jsonl" -v "test_prepared.jsonl" --compute_classification_metrics --classification_n_classes 3 -m davinci   --suffix "baseline"    >>> Created fine-tune: ft-9800F2gcVNzyMdTLKcMqAtJ5

Once the job completes, we query a fine_tunes.results endpoint to see the test accuracy achieved when fine-tuning this LLM on the original training dataset.

!openai api fine_tunes.results -i ft-9800F2gcVNzyMdTLKcMqAtJ5 > baseline.csv    df = pd.read_csv('baseline.csv')  baseline_acc = df.iloc[-1]['classification/accuracy']    >>> Fine-tuning Accuracy: 0.6312500238418579

Our baseline Davinci LLM achieves a test accuracy of 63% when fine-tuned on the raw training data with possibly noisy labels. Even a state-of-the-art LLM like the Davinci model produces lackluster results for this classification task, is it because the data labels are noisy?

Automatically Find Label Issues

Confident Learning is a recently developed suite of algorithms to estimate which data are mislabeled in a classification dataset. These algorithms require out-of-sample predicted class probabilities for all of our training examples and apply a novel form of calibration to determine when to trust the model over the given label in the data.

To obtain these predicted probabilities we:

  1. Use the OpenAI API to compute embeddings from the Davinci model for all of our training examples. You can download the embeddings here.
  2. Fit a logistic regression model on the embeddings and labels in the original data. We use 10-fold cross-validation which allows us to produce out-of-sample predicted class probabilities for every example in the training dataset.
# Get embeddings from OpenAI.  from openai.embeddings_utils import get_embedding    embedding_model = "text-similarity-davinci-001"  train["embedding"] = train.prompt.apply(lambda x: get_embedding(x, engine=embedding_model))  embeddings = train["embedding"].values    # Get out-of-sample predicted class probabilities via cross-validation.  from sklearn.linear_model import LogisticRegression     model = LogisticRegression()  labels = train["completion"].values  pred_probs = cross_val_predict(estimator=model, X=embeddings, y=labels, 		                   cv=10, method="predict_proba")

The cleanlab package offers an open-source Python implementation of Confident Learning. With one line of code, we can run Confident Learning using the model predicted probabilities to estimate which examples have label issues in our training dataset.

from cleanlab.filter import find_label_issues    # Get indices of examples estimated to have label issues:  issue_idx = find_label_issues(labels, pred_probs,              return_indices_ranked_by='self_confidence')  # sort indices by likelihood of label error 

Let’s take a look at a few of the label issues automatically identified in our dataset. Here’s one example that is clearly mislabeled:

  • Phrase: I'll take a look at getLogEntries when I have time. Would you mind adding me as a committer?
  • Label: impolite

Labeling errors like this are why we might be seeing poor model results.

Fine-Tuning OpenAI Language Models with Noisily Labeled Data

Caption: A few of the top errors that were automatically identified.

Note: find_label_issues is able to determine which of the given labels are potentially incorrect given only the out-of-sample pred_probs.

Filter Label Issues and Fine-tune a more Robust LLM

Now that we have the indices of potentially mislabeled examples (identified via automated techniques), let’s remove these 471 examples from our training dataset. Fine-tuning the exact same Davinci LLM on the filtered dataset achieves a test accuracy of 66% (on the same test data where our original Davinci LLM achieved 63% accuracy). We reduced the error-rate of the model by 8% using less but better quality training data!

# Remove data flagged with potential label error.   train_cl = train.drop(issue_idx).reset_index(drop=True)  format_data(train_cl, "train_cl.jsonl")    # Train a more robust classifier with less erroneous data.  !openai api fine_tunes.create -t "train_cl_prepared.jsonl" -v "test_prepared.jsonl" --compute_classification_metrics --classification_n_classes 3 -m davinci --suffix "dropped"    # Evaluate model on test data.  !openai api fine_tunes.results -i ft-InhTRQGu11gIDlVJUt0LYbEx > autofiltered.csv  df = pd.read_csv('autofiltered.csv')  dropped_acc = df.iloc[-1]['classification/accuracy']    >>> 0.6604166626930237

Fixing the Label Errors

Instead of fixing the auto-detected label issues automatically via filtering, the smarter (yet more complex) way to improve our dataset would be to correct the label issues by hand. This simultaneously removes a noisy data point and adds an accurate one, but making such corrections manually is cumbersome. We did this manually using Cleanlab Studio, an enterprise data correction interface.

After replacing the bad labels we spotted with more suitable ones, we fine-tune the exact same Davinci LLM on the manually-corrected dataset. The resulting model achieves 77% accuracy (on the same test dataset as before), which is a 37% reduction in error from our original version of this model.

# Load in and format data with the manually fixed labels.  train_studio = pd.read_csv('train_corrected.csv')  format_data(train_studio, "train_corrected.jsonl")    # Train a more robust classifier with the fixed data.  !openai api fine_tunes.create -t "train_corrected_prepared.jsonl" -v "test_prepared.jsonl"   --compute_classification_metrics --classification_n_classes 3 -m davinci --suffix "corrected"    # Evaluate model on test data.  !openai api fine_tunes.results -i ft-MQbaduYd8UGD2EWBmfpoQpkQ > corrected .csv  df = pd.read_csv('corrected.csv')  corrected_acc = df.iloc[-1]['classification/accuracy']  >>> 0.7729166746139526

Note: throughout this entire process, we never changed any code related to model architecture/hyperparameters, training, or data preprocessing! All improvement strictly comes from increasing the quality of our training data, which leaves room for additional optimizations on the modeling side.

Evaluating other LLMs

We repeated this same experiment with two other recent LLM models OpenAI offers for fine-tuning: Ada and Curie. The resulting improvements look similar to those achieved for the Davinci model.

Fine-Tuning OpenAI Language Models with Noisily Labeled Data
Conclusion

Data-centric AI is a powerful paradigm for handling noisy data via AI/automated techniques rather than tedious manual effort. There are now tools to help you efficiently find and fix data and label issues to improve any ML model (not just LLMs) for most types of data (not just text, but also images, audio, tabular data, etc). Such tools utilize any ML model to diagnose/fix issues in the data and then improve the data for any other ML model. These tools will remain applicable with future advances in ML models like GPT-10, and will only become better at identifying issues when used with more accurate models!

Practice data-centric AI to systematically engineer better data via AI/automation. This frees you to capitalize on your unique domain knowledge rather than fixing general data issues like label errors.
Chris Mauck is Data Scientist at Cleanlab.

More On This Topic

  • OpenAI Releases Two Transformer Models that Magically Link Language and…
  • Fine-Tuning Transformer Model for Invoice Recognition
  • Fine-Tuning BERT for Tweets Classification with HuggingFace
  • Learn About Large Language Models
  • Jurassic-1 Language Models and AI21 Studio
  • Top Open Source Large Language Models

Data Visualization Best Practices & Resources for Effective Communication

Data Visualization Best Practices & Resources for Effective Communication
Image by Author

Here’s my hot take: Good data visualization is objective. It’s an art, yes, but unlike the debate about whether modern art is good or bad, which is impossible to answer, there are definitely “good” and “bad” data visualizations.

We’ve all seen a bad graph and were able to say, objectively, that graph sucks. And likewise, we’ve all seen an amazing piece of data visualization that succinctly and cleverly communicates a complicated topic.

Here’s one of my favorites, just as an example of good data viz. It’s a map of a cholera outbreak in 1854. Using this rudimentary dot chart, physician John Snow was able to see where deaths were highest and ultimately figure out what was causing the outbreak — a contaminated well. This looks simple by today’s standards, but it was positively ground-breaking in 1854.

It’s interesting, it’s rich, it invites further investigation, and it determines a trend. With no epidemiological knowledge, you can intuit what’s happening here. In other words, it’s great data viz.

Data Visualization Best Practices & Resources for Effective Communication
Image from Wikipedia

If you want to see a bad example, look at the graph I generated for this article, where I’m trying to figure out if people like my chart-making skills or not.

It’s bad. You have no idea what I’m trying to say; it doesn’t help spot any trends or patterns. With a single glance, you can say, “Yeah, Nate, that needs some work.”

Data Visualization Best Practices & Resources for Effective Communication
Image by Author

The good news is that because it’s an objective art, you can learn to do it well. It’s not an innate talent that you’re either born with or not – it’s a learned skill.

To help you avoid the sin of bad data viz, I’ll talk about the best practices. Though it’s a bit of an art, there is a science you can apply to the process of data visualization to make sure you’re effectively communicating.

What is Data Visualization?

Data visualization is when you make numbers tell a story. That’s my version of the definition anyway; Wikipedia somewhat more dryly defines it as “the process of designing and creating easy-to-communicate and easy-to-understand graphic or visual representation of a large amount of complex data and information.”

Any number of jobs will end up touching data visualization, since there isn’t a single job title of “data visualist,” at least not today. Data analysts, business analysts, data scientists, and even backend developers might be tasked with creating a graphic to convey some key detail.

For example, as a backend developer, you might find that your data analytics team wants you to create a visualization representing outcomes in a binary search tree. As a data scientist, you’ll be asked to turn complex finance numbers into a chart that makes sense to C-suite execs.

Data visualization is communication, plain and simple.

Why does Good Data Visualization Matters?

This is like asking why good communication matters. But let’s break it down even further. Good data visualization matters for a few different reasons.

Communicate complex information

Imagine you’re John Snow, back in 1854. Your patients are dying. You know there’s a pattern, you know it’s something to do with a particular contaminated well. You’re trying to explain this to beleaguered, skeptical city officials who don’t really believe that disease can spread this way.

Can you imagine trying to describe that cholera chart to someone? How would you do it? It’d be borderline impossible.

By contrast, non-experts can see that chart and instantly make sense of what’s happening. There’s a pattern of deaths that matches geography. Those households are the ones drawing water from that well. His chart communicated complex information at a glance. That’s one of the strengths of good data viz.

Identify patterns and trends

Let’s say you’re a data scientist working for a healthcare company. You’re trying to analyze patient data to improve care, so you’re looking at patient demographics, medical history, and treatment outcomes.

When you run a typical statistical analysis, you don’t notice any standout patterns. However, when you throw mortality and age onto a scatter plot, you realize that patients over the age of 65 have a sharply increased mortality rate.

Data Visualization Best Practices & Resources for Effective Communication
Image by Author

Now you can pass those findings on to healthcare practitioners so they can investigate ways of reversing that trend.

What are the best Practices for doing Good Data Visualization?

OK, now you understand what good data viz is and why it matters. Let’s get into the data visualization best practices you can apply to make sure you create stunning, memorable, compelling graphs and charts.

Know your audience

This is the most important step. Who are you creating this data visualization for? What are they interested in? What kind of basic understanding do they already have? What do they need this data viz for?

For example, imagine you’re a data analyst trying to explain to the CTO how effective an email marketing campaign is for various segments of the brand’s audience. The outcome of this meeting will determine the entire email marketing strategy for the next quarter.

But you’ve forgotten that what’s second nature to you – CTRs and CTAs and segments named things like “Segment A” are not easily parsed by non-experts.

You present the following atrocity and have to spend the entire meeting re-communicating what exactly all that jargon and segment names mean. The CTO is confused, unsatisfied, and unable to make a decision.

Data Visualization Best Practices & Resources for Effective Communication
Image by Author

Instead, you should streamline this to the main key component that decision-makers need to make a decision on and make sure that everything makes sense to that audience. Here’s what a good version of that data visualization could look like:

Data Visualization Best Practices & Resources for Effective Communication
Image by Author

The audience can clearly understand the data and make a decision.

Keep it simple

You know how, when you watch Star Wars nowadays, it feels a little bit like filmmakers recently discovered all the PowerPoint transitions they could use, and used every single one just because of the fun and novelty?

That’s bad data viz practice. A good data viz practice is to keep things as simple as possible.

For example, a few years ago there was this big trend of doing 3D charts. It didn’t add anything to the information being conveyed. But it was fancy, so people loved it.

Data Visualization Best Practices & Resources for Effective Communication
Image from Semantic Scholar

Good data viz means you keep the focus on the data. Don’t make it interactive if it doesn’t need to be interactive. Don’t add more colors than you really need. If you can get away with removing extra legends by making titles self-explanatory, all the better.

Choose the right chart type

Let’s say you want to show change over time. What is the best type of chart?

Your answer to that question can mean the difference between a good, clean bit of data visualization, and a monstrosity that shouldn’t see the light of day.

For the record, the right answer is a line chart. Have time on the x-axis, and whatever other factor you’re measuring on the y-axis.

Data Visualization Best Practices & Resources for Effective Communication
Image by Author

Go back to my hideous pie chart from earlier. You can clearly see this is the wrong type of data visualization for the answer I’m trying to get. A pie chart indicates a wholeness of some kind; it’s great for adding up percentages. So if 55% of my employees think the chart is great, but 45% don’t, then a pie chart is good to convey that finding.

But for a mishmash of open text box answers? A pie chart is worse than useless.

Here is a nice table to show you, as a rough rule of thumb, which types of data viz are right to use when. Remember, you’re the expert on your own data, so take this with a pinch of salt.

Line charts Trends over time
Bar charts Comparing values between groups
Pie charts Show the proportion of different groups
Scatter plots Relationship between two variables
Heat maps Visualize data in a matrix format
Tree maps Hierarchical data

I also encourage you to peruse data viz and make note of what you like and don’t like. Remember, data viz is objective. You can, with some thought, put your finger on what’s working and what isn’t, and take those findings for your own data visualizations.

Provide context

Finally, you should always explain the why behind your data visualizations. What are the data units? What does the data represent? What other relevant information is needed to make your case?

Look at this example for what not to do:

Data Visualization Best Practices & Resources for Effective Communication
Image from Tableau

It’s overly complex, which already violates our second Good Data Viz commandment. But it also doesn’t give any context. What takeaway am I supposed to have here? What do those letters mean? Why are those rectangles not to scale?

If you need to provide definitions, throw them in. If you think an industry benchmark will better illustrate the significance of your findings, add it. And most importantly, remember you’re telling a story. If you just wanted to provide numbers, you could give people a table. But you’re not. You’re shaping the narrative. That’s why context matters so much.

Remember, you’re the expert on these numbers. You’re communicating an idea. You need to provide any supplementary material you think will help you make your case best.

Resources to Learn more about Data Visualization

There are two ways to learn more about data viz: learning and doing. Let’s go through both types.

Read/watch/consume content about data viz

First, you should ground yourself in data visualization basics. I recommend the following resources:

  • I love David McCandless’s YouTube lecture on the beauty of data viz as a starting point.
  • Greg Martin’s intro to Data Viz is also a really nice video to get your feet wet.
  • Simplilearn has a bite-size tutorial of just 27 minutes on YouTube.
  • IBM’s Data Visualization with Python Course is a good next step, hosted by Coursera. It’s free.

DIY style

Once you’re done listening, watching, and learning, it’s time to apply what you know. Get reputable data from sources like:

  • Statista
  • Tidy Tuesdays GitHub
  • Your own life — what you eat, how you spend your time, your mood, your career applications, anything!

Then, try to make data visualizations yourself. Consider the data and think about what questions you have, what trends you’d like to spot, and what’s confusing and could be made clearer.

You can use platforms like The Pudding or Kaggle to get inspiration for what kinds of questions you can ask or answer.

I also recommend checking out what real-life interviewers are asking in data science interviews. Platforms like StrataScratch help you practice your data viz skills on real-world examples.

Want more? The 30 Resources for Mastering Data Visualization is a great best-of list of resources all about data visualization.

The best Practices for Good Data viz Communication

There’s that classic quote: “A picture is worth a thousand words.” If that’s true, then good data visualization is a library’s worth of words.

Good data visualization is the backbone of almost any meaningful decision that gets made at any company. It’s what helps people from different departments communicate in a way that makes sense to all parties. It’s how you take a mess of numbers and make them tell a story.

But it’s easy to get wrong. To do data visualization the right way, remember you need to keep your audience in mind, keep things as simple as possible, choose the right type of chart, and always provide context.

Hopefully, this illustrated guide has helped you to better understand what good data viz is, and how you can make the best data visualizations moving forward.
Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on Twitter: StrataScratch or LinkedIn.

More On This Topic

  • Effective Visualization Techniques for Data Discovery and Analysis
  • This Data Visualization is the First Step for Effective Feature Selection
  • 30 Resources for Mastering Data Visualization
  • KDnuggets News, November 9: 7 Tips To Produce Readable Data Science Code •…
  • Five Signs of an Effective Data Science Manager
  • 15 Habits I Learned from Highly Effective Data Scientists

Iowa State University: “Thinking Like a Data Scientist” Lessons Learned

I recently completed teaching my “Big Data MBA: Thinking Like a Data Scientist (TLADS)” class for the spring semester at Iowa State University. I had 17 second-year MBA students, and their diligence, passion, and creativity were evident throughout the semester and especially in the final project presentations.

2023-ISU-Class-Photo

This class had no tests or mid-term exams where the students were asked to memorize and regurgitate knowledge readily accessible via my books, blogs, and ChatGPT. Instead, these students were graded as a team on their ability to collaborate in defining, synthesizing, and clearly and creatively presenting where and how to apply data science to deliver meaningful, relevant, and ethical business outcomes. Yea, it was like a real-world consulting project complete with tight deadlines, team interpersonal dynamics, and changes to project scope throughout the engagement.

We created five teams of 3 to 4 students. Each team picked a company and then researched that company to identify and understand the company’s key business initiatives. The teams then applied my “Thinking Like a Data Scientist” methodology over the next 13 weeks to determine where and how data and analytics could be applied to the company’s business initiative to drive meaningful, relevant, and ethical results. The companies and the business initiatives that they selected were:

  • Lululemon: Double Men’s Sales while Maintaining Women’s Sales Growth
  • Nike: Promote Sustainability Efforts to Capture Environmental-concerned Customers
  • Nvidia: Increase Design Wins in Gaming Console Market
  • Twitter: Improve personalized content matching to increase Twitter user engagement
  • John Deere: Increase Hybrid/EV Tractor and Equipment Sales

As we progressed throughout the semester, new ideas were explored, resulting in modifications to existing TLADS design templates and even a new TLADS new design template. Yea, we all benefit when we are fearless in exploring, collaborating, and sharing.

TLADS Lesson #1: Decision Map Template to Identify Decision “Influencers”

In Step 2 of the TLADS process, we seek to understand key stakeholders’ decisions, desired outcomes, and the KPIs and metrics against outcomes’ effectiveness will be measured. The Persona template in Figure 1 allows us to walk in the shoes of our key stakeholders in better understanding their jobs, objectives, desired outcomes, and key decisions.

Slide1-1

Figure 1: TLADS Persona Template

We discovered that we could complement the Persona template with a simple Decision Map to facilitate the brainstorming of the variables and metrics that might influence the Persona’s key decisions. These “influencers” became the starter set of variables and metrics (features) that we leveraged in TLADS Step 5 and Step 6 in defining the Analytic Scores that predict stakeholder behavioral or performance occurrence likelihood (Figure 2).

Slide2

Figure 2: TLADS Stakeholder Decision Map

TLADS Lesson #2: Mapping Reusable Features to Analytic Scores

In Step 5 of the TLADS methodology, we identify the features we will use to create analytics scores supporting the stakeholders’ key decisions. Not surprisingly, features (such as weather, local demographics, traffic patterns, local events, and economic conditions) can be reused across multiple analytic scores. So, we created “Template 6.1: Use Case Feature-to-Score Mapping” to help us identify and map those features that can be reusable and managed by an organization’s Feature Store[1] (Figure 3).

Slide3-1

Figure 3: Feature-to-Analytic Score Mapping

In the blog “Features Part 1: Are Features the New Data?” I discussed why Features are a more valuable economic asset than data. It’s nice to have a TLADS design template to support that hypothesis.

TLADS Lesson #3: Recommendation Presentation and Model Feedback Template

And now a completely new TLADS design template that I have been considering for many months. I’ve been running these TLADS workshops for several years. And each time I get to the final step, the process feels incomplete, especially with organizations trying to build AI models that continuously learn and adapt. The TLADS methodology desperately needed a formal feedback process. Consequently, welcome to the new “Template 8: Recommendations Presentation and Feedback” template (Figure 4).

Slide5

Figure 4: Step 8: Recommendations Presentation and Feedback Template

I am 99.9% confident that this form will morph as I get more feedback from running these TLADS workshops. And the fact that the methodology continues to learn and adapt means that this becomes the “Bill Schmarzo Full Employment Act” in ensuring that I have a role to play when AI technologies like ChatGPT and Google Bard are putting everyone else out of work (wait, I still have to work?).

Summary

Figure 5 represents the updated version 2 of my Thinking Like a Data Scientist (TLADS) methodology with the new Step 8. This is the first significant change to that methodology since I developed it eons ago. But the realities of a world where AI can create products, processes, and policies that can continuously learn and adapt necessitated the update—it’s a brave new world, baby.

Slide6

Figure 5: Updated Thinking Like a Data Scientist Methodology – Version 2.0

Special thanks to my students for being bold in exploring new concepts and techniques. Your energy and creativity were contagious. I can’t wait to do it again!

[1] A feature store is an emerging, ML-specific data system used to centralize storage, processing, and access to frequently used features, making them available for reuse in the development of future machine learning models.

How to use ChatPDF: The AI chatbot that can tell you everything about your PDF

PDF downlaod

PDFs have become the golden standard for file sharing and you're bound to encounter them in your everyday life.

Regardless of whether it's a one-page job offer or a 100-page academic paper, these files can often contain a lot of text that can be difficult to parse through and digest.

Also: Hands-on with ChatPDF

Now, ChatPDF is a free AI tool that's here to assist you with all of your PDF-reading needs.

Simply by uploading your PDF to ChatPDF, the chatbot will automatically deliver a summary and suggest questions you can ask to learn more about the file.

Also: How does ChatGPT work?

The most convenient feature of the chatbot is that it can answer any question you ask about the PDF instantly. It even provides page references for where it got its answers, so you can do your own research.

How to use ChatPDF

Getting started is easy since you don't even need to create an account.

Free plan users, however, are limited to three PDF uploads of 120 pages or less a day. If you need more access, you can upgrade to a plus plan for $5 per month.

Also: The 5 biggest risks of generative AI, according to an expert

Once you've determined if you want to stick with the free plan or need an upgrade, you can begin.

FAQs

How much does it cost to use ChatPDF?

Using ChatPDF to sum up a file and answer any questions about your PDF is free. However, if you'd like to exceed the free plan's limit of three uploads with a maximum of 120 pages a day, you can upgrade to a plus plan for $5 per month.

Is it better to use ChatGPT or ChatPDF to sum up articles?

Since ChatGPT doesn't have access to the internet or to information after 2021, it is prone to making mistakes when it creates summaries. If you need to summarize a PDF, I would recommend utilizing ChatPDF instead of ChatGPT as the results are accurate, quick, free, and more reliable.

What technology does ChatPDF use?

ChatPDF runs on OpenAI's GPT 3.5 large language model, the same LLM behind ChatGPT.

Artificial Intelligence

Healthcare Enterprises Treads Super Cautiously with Generative AI

AI in healthcare

If you are following the world of tech, you probably know the hottest thing right now is generative AI. For good or for worse, everyone is talking about it. And so it becomes important to examine whether the real-world applications of generative AI in industries such as healthcare live up to the hype, especially given concerns about the safety of data and accuracy of the models’ results.

Nitin Aggarwal, who heads Cloud AI services at Google, recently took to LinkedIn to opine that while it is easy to create a “wow” factor with generative AI, it is not so easy to integrate it to solve an end-to-end business problem. In consumer settings, it is easy to use these models and get the questions answered or generate an image, but enterprise is a completely different ball game altogether.

Aggarwal opines that one common question here is how mature enterprise GenAI is – there are questions hanging in the air around “where my prompts are stored, who owns the IP or the adapter model built on my data, whether a foundational model will be tuned on my data, and so on.” But, with AI getting democratised, AI will become a very important asset. “The variety and quality of data you own is the biggest IP and competitive advantage you have,” said Aggarwal. And if enterprises don’t dwell over data governance and accountability, it will be easy to lose that differentiator.

In this regard, OpenAI also announced its plans to launch ChatGPT Business in the coming months. The feature is for enterprises looking to manage their end users, stressing that users’ data will not be used to train their model by default. “ChatGPT Business will follow our API’s data usage policies,” said the company.

A cautious approach

One example of a healthcare enterprise using GenAI would be India’s largest hospital chain Apollo. Apollo created an AI application at the end of last year, called Clinical Intelligence Engine (CIE), which uses probabilistic algorithms to determine clinical diagnosis and related information. This technology is touted to be much like ChatGPT in this regard. Trained on numerous medical histories and case studies extracted from Apollo’s proprietary clinical knowledge base as well as millions of anonymised, real-world clinical data from Apollo, CIE is an expert knowledge system with reasoning power and highly specialised, deep domain knowledge in the clinical area.

Generally, the approach has been fairly cautious, and implemented in those areas where the stakes are lower because they have less direct impact on patients. For instance, Syntegra, an AI healthcare startup, utilised generative AI to generate synthetic data. Janssen Pharmaceutical Cos’ data scientists validated the synthetic data against real data, making it especially valuable for researching less common diseases where acquiring sufficient patient data is challenging.

The synthetic data has been validated by Janssen’s data scientists against real data, and will be particularly useful for researching less common diseases, where it is harder to gather sufficient patient data

A hard push

Nevertheless, LLM providers are pushing for solutions. Microsoft Azure OpenAI Service’s integration to Epic’s EHR (Electronic Health Record) platform aims to automatically fill in missing information, suggest potential diagnoses, and predict future health outcomes based on historical data. Likewise, Google is looking to explore applications of MedPaLM-2 in ultrasound, radiotherapy, and other diagnostic and treatment planning processes.

We have Nvidia as well, who along with Segmed and RadImageNet, are working to develop models that can create high-quality synthetic images to expand the availability of training data. This will aid in the refinement of medical AI algorithms to improve the accuracy and consistency of medical diagnoses. Additionally, during GTC, the company also announced that it will integrate edge AI capabilities to Medtronic’s GI Genius, an AI-assisted colonoscopy tool to help physicians detect polyps that can lead to colorectal cancer.

Recently, a study also showed that novel technologies like natural language processing (NLP) and artificial intelligence (AI), such as ChatGPT, have the potential to produce high quality clinical letters that are easily understood by patients while improving efficiency, accuracy, patient satisfaction, as well as save cost to a health-care system.

Ghosts of IBM Watson

The crux is that hopes are high among all. But, given the challenges, the promises of any AI-focused healthcare startup should be taken with a grain of salt.

“IBM once boasted that Watson could one day find a cure for cancer. No published research has yet to show that Watson improved patient outcomes, and IBM has since abandoned all applications of Watson for healthcare,” reads a WSJ article.

The generative AI hype train might also meet the same fate. University of Pittsburgh Medical Center’s Dr Robert Bart told WSJ that future uses of generative AI, such as for disease diagnosis, are still very off. However, what it can do in the now is improve operational processes such as patient scheduling and flow.

“There are AI algorithms already certified by the US FDA and can be safely used in medicine, but in the case of generative AI, it will be several years before they can be trusted. But then we are in for a truly massive revolution in healthcare,” said Artur Olesch, Founder of aboutDigitalHealth.com.

The post Healthcare Enterprises Treads Super Cautiously with Generative AI appeared first on Analytics India Magazine.